Grant Details
| Grant Number: |
5R01CA296289-02 Interpret this number |
| Primary Investigator: |
Gu, Tian |
| Organization: |
Columbia University Health Sciences |
| Project Title: |
Enhanced Cancer Risk Predictions Through Robust Multi-Source Data Integration |
| Fiscal Year: |
2026 |
Abstract
Breast cancer (BC) and prostate cancer (PC) are the most commonly diagnosed cancers among American women and men, respectively, with disease burden varying substantially across population groups. Certain populations experience higher risks, compounded by less effective risk prediction and diagnostic tools, partly due to limited data available in biomedical studies. Beyond continued data collection, there is a pressing need to refine analytical methods that are responsive to population-specific characteristics and health needs.
Building on this premise, the current biomedical landscape boasts a wealth of large-scale electronic health records (EHR) linked biobanks. When effectively harmonized, these resources hold great potential for robust evidence synthesis and model building, particularly for populations with limited sample sizes in any single study. However, harmonizing data poses many challenges. Existing methods have limitations in terms of (1) difficulty in effectively leveraging data from a variety of sources in a united manner, (2) concerns over data privacy when sharing patient-level data, (3) a high communication burden when implementing federated algorithms with iterative sharing of summary data, and (4) limited effectiveness in settings involving complex or admixed ancestral backgrounds. To address these challenges, we propose novel data integration methods with broad applicability to improve risk prediction in these settings. Specifically, Aim 1 will develop a transfer learning method that integrates shared knowledge from models fitted in external studies, accommodating various data-sharing infrastructures (sharable, partially sharable, and non-shareable) to leverage cross-study data. The proposed method will allow external studies to use a subset of covariates than the target and will require only one-time summary-level data-sharing, ensuring efficient communication and computation. Aim 2 will develop a privacy-preserving semi-supervised risk prediction model that integrates heterogeneous data from multi-site studies containing surrogate outcomes widely available in EHR (e.g., diagnosis billing codes). To ensure robust estimation, we will consider multi-layered data heterogeneity in both patient covariate distribution and outcome modeling. Aim 3 will develop an interpretable and integrative tree-based model designed for populations with complex ancestral structure by incorporating external polygenic risk scores. It will combine the strengths of random forest (RF) and small linear models derived from RF terminal nodes to capture the complex population structure and provide interpretable results. Aim 4 will apply these methods to build BC and PC risk prediction tools using data such as the All of Us program, MGB Biobank, and UK Biobank, and develop open-source software for easy implementation.
Publications
Adaptive transfer learning for time-to-event modeling with applications in disease risk assessment.
Authors: Lu Y.
, Gu T.
, Duan R.
.
Source: Biostatistics (oxford, England), 2026-01-20 00:00:00.0; 27(1), .
PMID: 42231823
Related Citations
On the Connections Among Three Transfer Learning Paradigms.
Authors: Gu T.
, Li S.
, Duan R.
.
Source: Stat (international Statistical Institute), 2025 Dec; 14(4), .
EPub date: 2025-09-22 00:00:00.0.
PMID: 41293088
Related Citations
Global Prevalence of Long COVID, Its Subtypes, and Risk Factors: An Updated Systematic Review and Meta-analysis.
Authors: Hou Y.
, Gu T.
, Ni Z.
, Shi X.
, Ranney M.L.
, Mukherjee B.
.
Source: Open Forum Infectious Diseases, 2025 Sep; 12(9), p. ofaf533.
EPub date: 2025-08-30 00:00:00.0.
PMID: 41018705
Related Citations
Robust angle-based transfer learning in high dimensions.
Authors: Gu T.
, Han Y.
, Duan R.
.
Source: Journal Of The Royal Statistical Society. Series B, Statistical Methodology, 2025 Jul; 87(3), p. 723-745.
EPub date: 2024-12-03 00:00:00.0.
PMID: 40661839
Related Citations
Adaptive Transfer Learning for Time-to-Event Modeling with Applications in Disease Risk Assessment.
Authors: Lu Y.
, Gu T.
, Duan R.
.
Source: Medrxiv : The Preprint Server For Health Sciences, 2025-05-12 00:00:00.0; , .
EPub date: 2025-05-12 00:00:00.0.
PMID: 40463526
Related Citations
EntroLLM: Leveraging Entropy and Large Language Model Embeddings for Enhanced Risk Prediction with Wearable Device Data.
Authors: Huang X.
, Gu T.
.
Source: Amia Joint Summits On Translational Science Proceedings. Amia Joint Summits On Translational Science, 2025; 2025, p. 225-234.
EPub date: 2025-06-10 00:00:00.0.
PMID: 40502232
Related Citations