Division of Cancer Control & Population Sciences

Grant Details
Abstract
Publications

Grant Details

Grant Number:	3U24CA275783-02S1 Interpret this number
Primary Investigator:	Griffith, Malachi
Organization:	Washington University
Project Title:	Text Mining and Large Language Models for Ai-Driven Evidence-Based Functional Annotation of Clinical Variants
Fiscal Year:	2024

Abstract

Project Summary Biomedical knowledgebases are faced with the challenge of sustaining high-quality curation in the face of ever increasing amounts of biomedical data and limited curator resources. These resources have successfully taken advantage of natural language processing (NLP) technologies to automate some curation tasks, such as document triage; however, other tasks such as free-text annotation and information extraction still require intensive manual effort that causes bottlenecks in curation workflows. The advent of Large Language Models (LLMs), which have demonstrated impressive performance in interpretation and production of natural language, opens up the possibility of automating these time-consuming tasks, so as to maximize the value of curator effort. Discussions at the recent NIH data repository and knowledgebase (DRKB) program meeting in February 2024 showcased the great interest among resources in using LLMs to scale up curation. In this supplement application, the Clinical Interpretation of Variants in Cancer (CIViC) resource and UniProt will collaborate to develop AI-driven data curation strategies to benefit our resources and to serve as a model for other DRKB members. CIViC is dedicated to the expert curation of information about the clinical significance of cancer genome alterations to enable precision medicine. To support CIViC curation, we have previously developed a BERT-based NLP system that extracts relationships between genes, genetic variants, cancers, and drugs from sentences in the scientific articles. In Aim 1 of this project, we will enhance this tool in two ways. First, we will add functionality that will classify sentences according to CIViC evidence types for somatic variants: predictive, diagnostic, prognostic, oncogenic, and functional. Second, we will use an LLM to verify the information extracted by the BERT-based tool. Relations that are supported by both methodologies will be scored as high confidence, necessitating less manual curator review. In Aim 2, we will use an LLM to prepare drafts of CIViC evidence statements, which are free-text descriptions of the literature evidence supporting asserted relations. To increase the accuracy and relevance of the statements, we will provide sentences identified by the BERT-based tool as enhanced context to the LLM. Supplementing an LLM with domain-specific information, an approach known as Retrieval Augmented Generation (RAG), has been shown to improve LLM performance on biomedical tasks. Finally, in Aim 3, we will disseminate the results from Aims 1 and 2 via the Hypothes.is community annotation platform, which is used by curators at CIViC and also at ClinGen, an NIH-funded resource focusing on the clinical relevance of genes and genetic variants. Moreover, UniProt will import CIViC relations and Evidence Statements for display in its computationally mapped bibliography and will establish cross-links with CIViC. The prototype framework developed here is generalizable to other biomedical knowledge domains and can be adopted by other data resources.

Publications

None. See parent grant details.

Program Areas

Follow

Resources

Policies

National Cancer Institute

Contact Us