Skip to main content
An official website of the United States government
Grant Details

Grant Number: 5R35CA197449-09 Interpret this number
Primary Investigator: Lin, Xihong
Organization: Harvard School Of Public Health
Project Title: Statistical Methods for Analysis of Massive Genetic and Genomic Data in Cancer Research
Fiscal Year: 2023


Project Summary With massive data from genome, exposome and phenome rapidly available in population and clinical studies, data science has emerged to be critically important and provides unprecedented opportunities for new discoveries in cancer. This competing renewal application of an NCI Outstanding Investigator Award (R35) aims at developing and applying scalable, interpretable and transferable statistical and machine learning (ML) methods for integrative analysis of massive germline whole genome sequencing (WGS) and somatic whole exome sequencing (WES) data, epidemiological and clinical data, in large-scale multi-ethnic biobanks, population and clinical studies of cancer, with experimental cell specific multi-omic functional data, such as single cell RNA/ATAC-seq data. Our ultimate goal is to use advanced data science methods and different types of population, clinical, and experimental data to accelerate progress in advancing from cancer gene mapping to mechanisms to cancer prevention and medicine, discover new effective trans-ethnic precision cancer prevention and treatment strategies, and reduce health disparities in cancer genetic research. This application aims to meet the pressing quantitative needs for the analysis of massive data in cancer research. Specifically, (A) for genetic cancer epidemiology, we will develop scalable, interpretable and transferable statistical and ML methods for (1) rare variant analysis by integrating population-based WGS and experimental single cell functional data; (2) advancing from associated variants with unknown causality and biology to causal variants, genes and pathways using causal mediation analysis and Mendelian Randomization by integrating genetic, cell-specific omic, biomarkers and phenotype data; (3) estimating transferable trans-ethnic polygenetic risk scores (PRSs) and heritability using common and rare variants by integrating WGS data with experimental in-silicon cell-specific functional annotations and non-genetic data, for actionable prevention strategies; (3) federated and transferable trans-ethnic single phenotype and phenome-wide genetic analysis in large WGS studies and biobanks. (B) For cancer genetic medicine, we will develop scalable and interpretable statistical and machine learning methods for (1) joint analysis of germline WGS and tumor somatic WES data to identify genetic variants that predispose to cancer subtypes; (2) integrative analysis of tumor somatic WES data and clinicopathological characteristics to identify patient profiles for improved efficacy of immunotherapies; (3) analysis of the effects of clonal hematopoiesis, mitochondrial dysfunctions, leukocyte telomere length called from germline WGS data on tumor somatic events, cancer prognosis and responses to immunotherapies. We will apply the proposed methods in lung cancer and breast cancer genetic epidemiological and clinical studies and biobanks. We will develop open access cluster and cloud-based software of these methods and data resources and make them available at NIH Data Commons to the cancer research community.