Skip to main content

Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted.

The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov.

Updates regarding government operating status and resumption of normal operations can be found at opm.gov.

An official website of the United States government
Grant Details

Grant Number: 2U24CA248010-06 Interpret this number
Primary Investigator: Savova, Guergana
Organization: Boston Children'S Hospital
Project Title: Cancer Deep Phenotype Extraction From Electronic Medical Records (RENEWAL)
Fiscal Year: 2025


Abstract

Summary Cancer patients accumulate a wealth of electronic medical record (EMR) data during the diagnostic, decision- making, treatment, and follow-up processes of their care; most of these data are found in unstructured narrative form that remains dormant for secondary research purposes. Even when patients enroll in clinical trials that gather detailed case report forms, a holistic picture of their cancer journey is the exception, not the rule. Answering seemingly simple questions requires intensive manual review of patient records, a tedious process that can take hours per patient case, limiting researchers’ ability to construct large observational cohorts. With the exponential growth in the quantity of EMR data, it is not tenable for even a very large team of manual curators to thoroughly and exhaustively evaluate records at scale. Understanding the “deep phenotype” of a cancer patient requires a complete picture of both tumor and host. Critical cancer phenotypic variables include morphology, tumor location, extent of invasion, predictive and prognostic biomarkers, treatment exposure history, and response to treatment. Host phenotypic variables include fitness (eg performance status and comorbidities), adverse effects of treatment, and non-medical determinants of health (eg global distress, financial toxicity, and behavioral habits). Phenotypic profiles are typically constructed from multiple data sources and temporality is critically important. As many phenotypic variables are available only in EMR free text created over time, the cancer research community needs new, openly-available natural language processing (NLP) methods and systems to transform phenotypic detail from EMRs to data for advancing translational research. We have been developing DeepPhe, a platform for turning this rich data into computable longitudinal summaries of cancer diagnostic, prognostic, and treatment information. Since our last submission in 2019, there has been an unprecedented speed of developments within the Artificial Intelligence field, mainly in its subfield of text processing as exemplified by the advent of large language models (LLMs) and then very large language models. In this renewal, we will build on our and community’s methodology advancements, including the use of LLMs for EMR processing, to deliver a state-of-the-art, comprehensive, modern open-source tool for extracting deep phenotype information and provide novel visual analytics approaches. Our case studies will demonstrate the utility of our tools and drive the development of a vibrant community of cancer researchers using DeepPhe. Our community development efforts are aligned with the mission of the NCI Cancer Research Data Commons to advance methods of extracting and representing precision medicine phenotypes.



Publications

Informatics at the Frontier of Cancer Research.
Authors: Noller K. , Botsis T. , Camara P.G. , Ciotti L. , Cooper L.A.D. , Goecks J. , Griffith M. , Haas B.J. , Ideker T. , Karchin R. , et al. .
Source: Cancer Research, 2025-08-15 00:00:00.0; 85(16), p. 2967-2986.
PMID: 40600473
Related Citations

Extracting Knowledge from Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation.
Authors: Yao J. , Perova Z. , Mandloi T. , Lewis E. , Parkinson H. , Savova G. .
Source: Biorxiv : The Preprint Server For Biology, 2025-01-29 00:00:00.0; , .
EPub date: 2025-01-29 00:00:00.0.
PMID: 39975119
Related Citations

As bleak as it sounds? Analysing trends in oncology clinical trial initiation in the UK from 2010 to 2022.
Authors: VanHelene A.D. , Hadfield M.J. , Trapani D. , Warner J.L. , Lythgoe M.P. .
Source: Bmj Oncology, 2024; 3(1), p. e000410.
EPub date: 2024-08-14 00:00:00.0.
PMID: 39886121
Related Citations

DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction.
Authors: Hochheiser H. , Finan S. , Yuan Z. , Durbin E.B. , Jeong J.C. , Hands I. , Rust D. , Kavuluru R. , Wu X.C. , Warner J.L. , et al. .
Source: Medrxiv : The Preprint Server For Health Sciences, 2023-10-26 00:00:00.0; , .
EPub date: 2023-10-26 00:00:00.0.
PMID: 37205575
Related Citations

An End-to-End Natural Language Processing System for Automatically Extracting Radiation Therapy Events From Clinical Texts.
Authors: Bitterman D.S. , Goldner E. , Finan S. , Harris D. , Durbin E.B. , Hochheiser H. , Warner J.L. , Mak R.H. , Miller T. , Savova G.K. .
Source: International Journal Of Radiation Oncology, Biology, Physics, 2023-09-01 00:00:00.0; 117(1), p. 262-273.
EPub date: 2023-03-27 00:00:00.0.
PMID: 36990288
Related Citations

DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction.
Authors: Hochheiser H. , Finan S. , Yuan Z. , Durbin E.B. , Jeong J.C. , Hands I. , Rust D. , Kavuluru R. , Wu X.C. , Warner J.L. , et al. .
Source: Jco Clinical Cancer Informatics, 2023 Sep; 7, p. e2300156.
PMID: 38113411
Related Citations

Open-source Software Sustainability Models: Initial White Paper From the Informatics Technology for Cancer Research Sustainability and Industry Partnership Working Group.
Authors: Ye Y. , Barapatre S. , Davis M.K. , Elliston K.O. , Davatzikos C. , Fedorov A. , Fillion-Robin J.C. , Foster I. , Gilbertson J.R. , Lasso A. , et al. .
Source: Journal Of Medical Internet Research, 2021-12-02 00:00:00.0; 23(12), p. e20028.
EPub date: 2021-12-02 00:00:00.0.
PMID: 34860667
Related Citations

Characterizing the Anticancer Treatment Trajectory and Pattern in Patients Receiving Chemotherapy for Cancer Using Harmonized Observational Databases: Retrospective Study.
Authors: Jeon H. , You S.C. , Kang S.Y. , Seo S.I. , Warner J.L. , Belenkaya R. , Park R.W. .
Source: Jmir Medical Informatics, 2021-04-06 00:00:00.0; 9(4), p. e25035.
EPub date: 2021-04-06 00:00:00.0.
PMID: 33720842
Related Citations

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records.
Authors: Savova G.K. , Danciu I. , Alamudun F. , Miller T. , Lin C. , Bitterman D.S. , Tourassi G. , Warner J.L. .
Source: Cancer Research, 2019-11-01 00:00:00.0; 79(21), p. 5463-5470.
EPub date: 2019-08-08 00:00:00.0.
PMID: 31395609
Related Citations



Back to Top