Skip to main content

COVID-19 Resources

What people with cancer should know:

Guidance for cancer researchers:

Get the latest public health information from CDC:

Get the latest research information from NIH:

Grant Details

Grant Number: 5R21CA242940-02 Interpret this number
Primary Investigator: Cai, Tianxi
Organization: Harvard School Of Public Health
Project Title: Semi-Supervised Algorithms for Risk Assessment with Noisy Ehr Data
Fiscal Year: 2020


PROJECT SUMMARY Large electronic health record research (EHR) data integrated with -omics data from linked biorepositories have expanded opportunities for precision medicine research. These integrated datasets open opportunities for developing accurate EHR-based personalized cancer risk and progression prediction models, which can be easily incorporated into clinical practice and ultimately realize the promise of precision oncology. However, efficiently and effectively using EHR for cancer research remains challenging due to practical and methodological obstacles. For example, obtaining precise event time information such as time of cancer recurrence is a major bottleneck in using EHR for precision medicine research due to the requirement of laborious medical record review and the lack of documentation. Simple estimates of the event time based on billing or procedure codes may poorly approximate the true event time. Naive use of such estimated event times can lead to highly biased estimates due to the approximation error. Such biases impose challenges to performing pragmatic trials when the study endpoint is time to events and captured using EHR. The overall goal of this proposal is to fill these methodological gaps in risk assessment for cancer research using EHR data, which will advance our ability to achieve the promise of precision oncology. Statistical algorithms and software will be developed to (i) automatically assign event time information using longitudinally recorded EHR information; and (ii) to perform accurate risk assessment using noisy proxies of event times. The proposed tools for risk assessment using imperfect EHR data without requiring extensive manual chart review could greatly improve the utility of EHR for oncology research.


Developing and evaluating risk prediction models with panel current status data.
Authors: Chan S. , Wang X. , Jazić I. , Peskoe S. , Zheng Y. , Cai T. .
Source: Biometrics, 2021 06; 77(2), p. 599-609.
EPub date: 2020-07-08.
PMID: 32562264
Related Citations

Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data.
Authors: Cheng D. , Ananthakrishnan A.N. , Cai T. .
Source: Biometrics, 2021 06; 77(2), p. 413-423.
EPub date: 2020-05-25.
PMID: 32413171
Related Citations

sureLDA: A multidisease automated phenotyping method for the electronic health record.
Authors: Ahuja Y. , Zhou D. , He Z. , Sun J. , Castro V.M. , Gainer V. , Murphy S.N. , Hong C. , Cai T. .
Source: Journal of the American Medical Informatics Association : JAMIA, 2020-08-01; 27(8), p. 1235-1243.
PMID: 32548637
Related Citations

Back to Top