Grant Details
Grant Number: |
5R35CA197449-10 Interpret this number |
Primary Investigator: |
Lin, Xihong |
Organization: |
Harvard School Of Public Health |
Project Title: |
Statistical Methods for Analysis of Massive Genetic and Genomic Data in Cancer Research |
Fiscal Year: |
2024 |
Abstract
Project Summary
With massive data from genome, exposome and phenome rapidly available in population and clinical studies,
data science has emerged to be critically important and provides unprecedented opportunities for new
discoveries in cancer. This competing renewal application of an NCI Outstanding Investigator Award (R35)
aims at developing and applying scalable, interpretable and transferable statistical and machine learning (ML)
methods for integrative analysis of massive germline whole genome sequencing (WGS) and somatic whole
exome sequencing (WES) data, epidemiological and clinical data, in large-scale multi-ethnic biobanks,
population and clinical studies of cancer, with experimental cell specific multi-omic functional data, such as
single cell RNA/ATAC-seq data. Our ultimate goal is to use advanced data science methods and different
types of population, clinical, and experimental data to accelerate progress in advancing from cancer gene
mapping to mechanisms to cancer prevention and medicine, discover new effective trans-ethnic precision
cancer prevention and treatment strategies, and reduce health disparities in cancer genetic research. This
application aims to meet the pressing quantitative needs for the analysis of massive data in cancer research.
Specifically, (A) for genetic cancer epidemiology, we will develop scalable, interpretable and transferable
statistical and ML methods for (1) rare variant analysis by integrating population-based WGS and experimental
single cell functional data; (2) advancing from associated variants with unknown causality and biology to causal
variants, genes and pathways using causal mediation analysis and Mendelian Randomization by integrating
genetic, cell-specific omic, biomarkers and phenotype data; (3) estimating transferable trans-ethnic polygenetic
risk scores (PRSs) and heritability using common and rare variants by integrating WGS data with experimental
in-silicon cell-specific functional annotations and non-genetic data, for actionable prevention strategies; (3)
federated and transferable trans-ethnic single phenotype and phenome-wide genetic analysis in large WGS
studies and biobanks. (B) For cancer genetic medicine, we will develop scalable and interpretable statistical
and machine learning methods for (1) joint analysis of germline WGS and tumor somatic WES data to identify
genetic variants that predispose to cancer subtypes; (2) integrative analysis of tumor somatic WES data and
clinicopathological characteristics to identify patient profiles for improved efficacy of immunotherapies; (3)
analysis of the effects of clonal hematopoiesis, mitochondrial dysfunctions, leukocyte telomere length called
from germline WGS data on tumor somatic events, cancer prognosis and responses to immunotherapies. We
will apply the proposed methods in lung cancer and breast cancer genetic epidemiological and clinical studies
and biobanks. We will develop open access cluster and cloud-based software of these methods and data
resources and make them available at NIH Data Commons to the cancer research community.
Publications