Skip to main content
An official website of the United States government
Grant Details

Grant Number: 1R01CA296289-01 Interpret this number
Primary Investigator: Gu, Tian
Organization: Columbia University Health Sciences
Project Title: Enhanced Cancer Risk Predictions in Underrepresented Populations Through Robust Multi-Source Data Integration
Fiscal Year: 2025


Abstract

Summary/Abstract Breast cancer (BC) and prostate cancer (PC) are the most commonly diagnosed cancers among American women and men, respectively, with the burden disproportionately affecting Black and Hispanic populations. These groups experience higher risks, yet, compounded by less effective risk prediction and diagnostic tools, partly due to their notable underrepresentation in biomedical studies. To further address these disparities, beyond collecting data, there's a pressing need to refine analytical methods tailored to these groups and catered to their specific health needs. Building on this premise, the current biomedical landscape boasts a wealth of large-scale electronic health records (EHR) linked biobanks. When effectively harmonized, these resources hold great potential for robust evidence synthesis and model building, especially beneficial for underrepresented populations with limited data in a single study. However, harmonizing data poses many challenges. Existing methods have limitations in terms of (1) difficulty in effectively leveraging data from diverse sources in a united manner, (2) concerns over data privacy when sharing patient-level data, (3) a high communication burden when implementing federated algorithms with iterative sharing of summary data, and (4) ineffectiveness in handling Hispanic populations with an admixed ancestral background. Therefore, we propose novel data integration methods with general applicability to navigate these methodological gaps effectively, targeting underrepresented populations. Specifically, Aim 1 will develop a transfer learning method that integrates shared knowledge from models fitted in external studies, accommodating various data-sharing infrastructures (sharable, partially sharable, and non-shareable) to leverage cross-study data. The proposed method will allow external studies to use a subset of covariates than the target and require one-time summary-level data-sharing, ensuring efficient communication and computation. Aim 2 will develop a privacy-preserving semi-supervised risk prediction model that integrates ancestrally diverse data from multi-site studies containing surrogate outcomes that are widely available in EHR (e.g., diagnoses billing codes). To ensure robust estimation, we will consider multi-layered data heterogeneity in both patient covariate distribution and outcome modeling. Aim 3 will develop an interpretable and integrative tree-based model for Hispanics with an admixed ancestral background by incorporating external polygenic risk scores. It will combine the power of random forest (RF) and small linear models derived from RF terminal nodes to capture the complex population structure and provide interpretable results. Aim 4 will apply these methods to build BC and PC risk prediction tools for Blacks and Hispanics using data such as the All of Us program, MGB Biobank, and UK Biobank, and develop open-sourced software for easy implementation.



Publications


None

Back to Top