Skip to main content
An official website of the United States government
Grant Details

Grant Number: 5R21CA242940-02 Interpret this number
Primary Investigator: Cai, Tianxi
Organization: Harvard School Of Public Health
Project Title: Semi-Supervised Algorithms for Risk Assessment with Noisy Ehr Data
Fiscal Year: 2020


PROJECT SUMMARY Large electronic health record research (EHR) data integrated with -omics data from linked biorepositories have expanded opportunities for precision medicine research. These integrated datasets open opportunities for developing accurate EHR-based personalized cancer risk and progression prediction models, which can be easily incorporated into clinical practice and ultimately realize the promise of precision oncology. However, efficiently and effectively using EHR for cancer research remains challenging due to practical and methodological obstacles. For example, obtaining precise event time information such as time of cancer recurrence is a major bottleneck in using EHR for precision medicine research due to the requirement of laborious medical record review and the lack of documentation. Simple estimates of the event time based on billing or procedure codes may poorly approximate the true event time. Naive use of such estimated event times can lead to highly biased estimates due to the approximation error. Such biases impose challenges to performing pragmatic trials when the study endpoint is time to events and captured using EHR. The overall goal of this proposal is to fill these methodological gaps in risk assessment for cancer research using EHR data, which will advance our ability to achieve the promise of precision oncology. Statistical algorithms and software will be developed to (i) automatically assign event time information using longitudinally recorded EHR information; and (ii) to perform accurate risk assessment using noisy proxies of event times. The proposed tools for risk assessment using imperfect EHR data without requiring extensive manual chart review could greatly improve the utility of EHR for oncology research.


Semisupervised Calibration of Risk with Noisy Event Times (SCORNET) using electronic health record data.
Authors: Ahuja Y. , Liang L. , Zhou D. , Huang S. , Cai T. .
Source: Biostatistics (Oxford, England), 2023-07-14; 24(3), p. 760-775.
PMID: 35166342
Related Citations

Risk prediction with imperfect survival outcome information from electronic health records.
Authors: Hou J. , Chan S.F. , Wang X. , Cai T. .
Source: Biometrics, 2023 Mar; 79(1), p. 190-202.
EPub date: 2021-11-22.
PMID: 34747010
Related Citations

Weakly Semi-supervised phenotyping using Electronic Health records.
Authors: Nogues I.E. , Wen J. , Lin Y. , Liu M. , Tedeschi S.K. , Geva A. , Cai T. , Hong C. .
Source: Journal of biomedical informatics, 2022 Oct; 134, p. 104175.
EPub date: 2022-09-05.
PMID: 36064111
Related Citations

Semi-supervised approach to event time annotation using longitudinal electronic health records.
Authors: Liang L. , Hou J. , Uno H. , Cho K. , Ma Y. , Cai T. .
Source: Lifetime data analysis, 2022 Jul; 28(3), p. 428-491.
EPub date: 2022-06-26.
PMID: 35753014
Related Citations

Developing and evaluating risk prediction models with panel current status data.
Authors: Chan S. , Wang X. , Jazić I. , Peskoe S. , Zheng Y. , Cai T. .
Source: Biometrics, 2021 Jun; 77(2), p. 599-609.
EPub date: 2020-07-08.
PMID: 32562264
Related Citations

Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data.
Authors: Cheng D. , Ananthakrishnan A.N. , Cai T. .
Source: Biometrics, 2021 Jun; 77(2), p. 413-423.
EPub date: 2020-05-25.
PMID: 32413171
Related Citations

sureLDA: A multidisease automated phenotyping method for the electronic health record.
Authors: Ahuja Y. , Zhou D. , He Z. , Sun J. , Castro V.M. , Gainer V. , Murphy S.N. , Hong C. , Cai T. .
Source: Journal of the American Medical Informatics Association : JAMIA, 2020-08-01; 27(8), p. 1235-1243.
PMID: 32548637
Related Citations

Back to Top