Grant Details
Grant Number: |
1R01CA296289-01 Interpret this number |
Primary Investigator: |
Gu, Tian |
Organization: |
Columbia University Health Sciences |
Project Title: |
Enhanced Cancer Risk Predictions in Underrepresented Populations Through Robust Multi-Source Data Integration |
Fiscal Year: |
2025 |
Abstract
Summary/Abstract
Breast cancer (BC) and prostate cancer (PC) are the most commonly diagnosed cancers among American women
and men, respectively, with the burden disproportionately affecting Black and Hispanic populations. These groups
experience higher risks, yet, compounded by less effective risk prediction and diagnostic tools, partly due to
their notable underrepresentation in biomedical studies. To further address these disparities, beyond collecting
data, there's a pressing need to refine analytical methods tailored to these groups and catered to their specific
health needs. Building on this premise, the current biomedical landscape boasts a wealth of large-scale electronic
health records (EHR) linked biobanks. When effectively harmonized, these resources hold great potential for
robust evidence synthesis and model building, especially beneficial for underrepresented populations with limited
data in a single study. However, harmonizing data poses many challenges. Existing methods have limitations in
terms of (1) difficulty in effectively leveraging data from diverse sources in a united manner, (2) concerns over
data privacy when sharing patient-level data, (3) a high communication burden when implementing federated
algorithms with iterative sharing of summary data, and (4) ineffectiveness in handling Hispanic populations with an
admixed ancestral background. Therefore, we propose novel data integration methods with general applicability
to navigate these methodological gaps effectively, targeting underrepresented populations. Specifically, Aim 1
will develop a transfer learning method that integrates shared knowledge from models fitted in external studies,
accommodating various data-sharing infrastructures (sharable, partially sharable, and non-shareable) to leverage
cross-study data. The proposed method will allow external studies to use a subset of covariates than the target
and require one-time summary-level data-sharing, ensuring efficient communication and computation. Aim 2 will
develop a privacy-preserving semi-supervised risk prediction model that integrates ancestrally diverse data from
multi-site studies containing surrogate outcomes that are widely available in EHR (e.g., diagnoses billing codes).
To ensure robust estimation, we will consider multi-layered data heterogeneity in both patient covariate distribution
and outcome modeling. Aim 3 will develop an interpretable and integrative tree-based model for Hispanics with
an admixed ancestral background by incorporating external polygenic risk scores. It will combine the power of
random forest (RF) and small linear models derived from RF terminal nodes to capture the complex population
structure and provide interpretable results. Aim 4 will apply these methods to build BC and PC risk prediction tools
for Blacks and Hispanics using data such as the All of Us program, MGB Biobank, and UK Biobank, and develop
open-sourced software for easy implementation.
Publications
None