Breast cancer (BC) and prostate cancer (PC) are the most commonly diagnosed cancers among American women and men, respectively, with disease burden varying substantially across population groups. Certain populations experience higher risks, compounded by less effective risk prediction and diagnostic tools, partly due to limited data available in biomedical studies. Beyond continued data collection, there is a pressing need to refine analytical methods that are responsive to population-specific characteristics and health needs.
Building on this premise, the current biomedical landscape boasts a wealth of large-scale electronic health records (EHR) linked biobanks. When effectively harmonized, these resources hold great potential for robust evidence synthesis and model building, particularly for populations with limited sample sizes in any single study. However, harmonizing data poses many challenges. Existing methods have limitations in terms of (1) difficulty in effectively leveraging data from a variety of sources in a united manner, (2) concerns over data privacy when sharing patient-level data, (3) a high communication burden when implementing federated algorithms with iterative sharing of summary data, and (4) limited effectiveness in settings involving complex or admixed ancestral backgrounds. To address these challenges, we propose novel data integration methods with broad applicability to improve risk prediction in these settings. Specifically, Aim 1 will develop a transfer learning method that integrates shared knowledge from models fitted in external studies, accommodating various data-sharing infrastructures (sharable, partially sharable, and non-shareable) to leverage cross-study data. The proposed method will allow external studies to use a subset of covariates than the target and will require only one-time summary-level data-sharing, ensuring efficient communication and computation. Aim 2 will develop a privacy-preserving semi-supervised risk prediction model that integrates heterogeneous data from multi-site studies containing surrogate outcomes widely available in EHR (e.g., diagnosis billing codes). To ensure robust estimation, we will consider multi-layered data heterogeneity in both patient covariate distribution and outcome modeling. Aim 3 will develop an interpretable and integrative tree-based model designed for populations with complex ancestral structure by incorporating external polygenic risk scores. It will combine the strengths of random forest (RF) and small linear models derived from RF terminal nodes to capture the complex population structure and provide interpretable results. Aim 4 will apply these methods to build BC and PC risk prediction tools using data such as the All of Us program, MGB Biobank, and UK Biobank, and develop open-source software for easy implementation.
Error Notice
The database may currently be offline for maintenance and should be operational soon. If not, we have been notified of this error and will be reviewing it shortly.
We apologize for the inconvenience.
- The DCCPS Team.