Grant Details
Grant Number: |
3U24CA275783-02S1 Interpret this number |
Primary Investigator: |
Griffith, Malachi |
Organization: |
Washington University |
Project Title: |
Text Mining and Large Language Models for Ai-Driven Evidence-Based Functional Annotation of Clinical Variants |
Fiscal Year: |
2024 |
Abstract
Project Summary
Biomedical knowledgebases are faced with the challenge of sustaining high-quality curation in the face of ever
increasing amounts of biomedical data and limited curator resources. These resources have successfully taken
advantage of natural language processing (NLP) technologies to automate some curation tasks, such as
document triage; however, other tasks such as free-text annotation and information extraction still require
intensive manual effort that causes bottlenecks in curation workflows. The advent of Large Language Models
(LLMs), which have demonstrated impressive performance in interpretation and production of natural
language, opens up the possibility of automating these time-consuming tasks, so as to maximize the value of
curator effort. Discussions at the recent NIH data repository and knowledgebase (DRKB) program meeting in
February 2024 showcased the great interest among resources in using LLMs to scale up curation. In this
supplement application, the Clinical Interpretation of Variants in Cancer (CIViC) resource and UniProt will
collaborate to develop AI-driven data curation strategies to benefit our resources and to serve as a model for
other DRKB members. CIViC is dedicated to the expert curation of information about the clinical significance of
cancer genome alterations to enable precision medicine. To support CIViC curation, we have previously
developed a BERT-based NLP system that extracts relationships between genes, genetic variants, cancers,
and drugs from sentences in the scientific articles. In Aim 1 of this project, we will enhance this tool in two
ways. First, we will add functionality that will classify sentences according to CIViC evidence types for somatic
variants: predictive, diagnostic, prognostic, oncogenic, and functional. Second, we will use an LLM to verify the
information extracted by the BERT-based tool. Relations that are supported by both methodologies will be
scored as high confidence, necessitating less manual curator review. In Aim 2, we will use an LLM to prepare
drafts of CIViC evidence statements, which are free-text descriptions of the literature evidence supporting
asserted relations. To increase the accuracy and relevance of the statements, we will provide sentences
identified by the BERT-based tool as enhanced context to the LLM. Supplementing an LLM with
domain-specific information, an approach known as Retrieval Augmented Generation (RAG), has been shown
to improve LLM performance on biomedical tasks. Finally, in Aim 3, we will disseminate the results from Aims 1
and 2 via the Hypothes.is community annotation platform, which is used by curators at CIViC and also at
ClinGen, an NIH-funded resource focusing on the clinical relevance of genes and genetic variants. Moreover,
UniProt will import CIViC relations and Evidence Statements for display in its computationally mapped
bibliography and will establish cross-links with CIViC. The prototype framework developed here is
generalizable to other biomedical knowledge domains and can be adopted by other data resources.
Publications
None. See parent grant details.