Division of Cancer Control & Population Sciences

Grant Details
Abstract
Publications

Grant Details

Grant Number:	1R01CA296289-01 Interpret this number
Primary Investigator:	Gu, Tian
Organization:	Columbia University Health Sciences
Project Title:	Enhanced Cancer Risk Predictions in Underrepresented Populations Through Robust Multi-Source Data Integration
Fiscal Year:	2025

Abstract

Summary/Abstract Breast cancer (BC) and prostate cancer (PC) are the most commonly diagnosed cancers among American women and men, respectively, with the burden disproportionately affecting Black and Hispanic populations. These groups experience higher risks, yet, compounded by less effective risk prediction and diagnostic tools, partly due to their notable underrepresentation in biomedical studies. To further address these disparities, beyond collecting data, there's a pressing need to reﬁne analytical methods tailored to these groups and catered to their speciﬁc health needs. Building on this premise, the current biomedical landscape boasts a wealth of large-scale electronic health records (EHR) linked biobanks. When effectively harmonized, these resources hold great potential for robust evidence synthesis and model building, especially beneﬁcial for underrepresented populations with limited data in a single study. However, harmonizing data poses many challenges. Existing methods have limitations in terms of (1) difﬁculty in effectively leveraging data from diverse sources in a united manner, (2) concerns over data privacy when sharing patient-level data, (3) a high communication burden when implementing federated algorithms with iterative sharing of summary data, and (4) ineffectiveness in handling Hispanic populations with an admixed ancestral background. Therefore, we propose novel data integration methods with general applicability to navigate these methodological gaps effectively, targeting underrepresented populations. Speciﬁcally, Aim 1 will develop a transfer learning method that integrates shared knowledge from models ﬁtted in external studies, accommodating various data-sharing infrastructures (sharable, partially sharable, and non-shareable) to leverage cross-study data. The proposed method will allow external studies to use a subset of covariates than the target and require one-time summary-level data-sharing, ensuring efﬁcient communication and computation. Aim 2 will develop a privacy-preserving semi-supervised risk prediction model that integrates ancestrally diverse data from multi-site studies containing surrogate outcomes that are widely available in EHR (e.g., diagnoses billing codes). To ensure robust estimation, we will consider multi-layered data heterogeneity in both patient covariate distribution and outcome modeling. Aim 3 will develop an interpretable and integrative tree-based model for Hispanics with an admixed ancestral background by incorporating external polygenic risk scores. It will combine the power of random forest (RF) and small linear models derived from RF terminal nodes to capture the complex population structure and provide interpretable results. Aim 4 will apply these methods to build BC and PC risk prediction tools for Blacks and Hispanics using data such as the All of Us program, MGB Biobank, and UK Biobank, and develop open-sourced software for easy implementation.

Publications

On the Connections Among Three Transfer Learning Paradigms.
Authors: Gu T. , Li S. , Duan R. .
Source: Stat (international Statistical Institute), 2025 Dec; 14(4), .
EPub date: 2025-09-22 00:00:00.0.
PMID: 41293088
Related Citations

Global Prevalence of Long COVID, Its Subtypes, and Risk Factors: An Updated Systematic Review and Meta-analysis.
Authors: Hou Y. , Gu T. , Ni Z. , Shi X. , Ranney M.L. , Mukherjee B. .
Source: Open Forum Infectious Diseases, 2025 Sep; 12(9), p. ofaf533.
EPub date: 2025-08-30 00:00:00.0.
PMID: 41018705
Related Citations

Robust angle-based transfer learning in high dimensions.
Authors: Gu T. , Han Y. , Duan R. .
Source: Journal Of The Royal Statistical Society. Series B, Statistical Methodology, 2025 Jul; 87(3), p. 723-745.
EPub date: 2024-12-03 00:00:00.0.
PMID: 40661839
Related Citations

Adaptive Transfer Learning for Time-to-Event Modeling with Applications in Disease Risk Assessment.
Authors: Lu Y. , Gu T. , Duan R. .
Source: Medrxiv : The Preprint Server For Health Sciences, 2025-05-12 00:00:00.0; , .
EPub date: 2025-05-12 00:00:00.0.
PMID: 40463526
Related Citations

EntroLLM: Leveraging Entropy and Large Language Model Embeddings for Enhanced Risk Prediction with Wearable Device Data.
Authors: Huang X. , Gu T. .
Source: Amia Joint Summits On Translational Science Proceedings. Amia Joint Summits On Translational Science, 2025; 2025, p. 225-234.
EPub date: 2025-06-10 00:00:00.0.
PMID: 40502232
Related Citations

Division of Cancer Control and Population Sciences Program Areas

Follow

Resources

Policies

National Cancer Institute

Contact Us