Skip to main content
Grant Details

Grant Number: 3U24CA180996-10S1 Interpret this number
Primary Investigator: Waldron, Levi
Organization: Graduate School Of Public Health And Health Policy
Project Title: Cancer Genomics: Integrative and Scalable Solutions in R/Bioconductor
Fiscal Year: 2022


Project Summary Bioconductor is an ecosystem of more than 2,000 open-source software packages for the reproducible bioinformatics analysis of various types of genomic data. Aim 1 of our parent grant, “Cancer Genomics: Integrative and Scalable Solutions in R/Bioconductor” (7U24CA180996), develops and maintains R/Bioconductor data structures for representation, downstream software development, and analysis of multimodal cancer datasets. Aim 3 of our parent grant establishes ExperimentHub web resources for the curation, distribution, maintenance, discoverability, and usability of cancer data resources for the R/Bioconductor community. This proposal targets hundreds of primarily cancer-focused genomic and metagenomic datasets that are optimized for R/Bioconductor-based usage and contain significant value-added over primary sources in the form of harmonization and manual curation, but for which substantial domain and Bioconductor-specific expertise is currently required to translate into formats suitable for widely used AI/ML softwares. First, it creates the Bioconductor Machine Learning Repository for Omics by translating existing R/Bioconductor versions of TCGA, cBioPortal, metagenomics, and other datasets. Second, in order to assess representation and generalizability of any models developed, it employs manual curation to uniformly annotate key characteristics of each study cohort including race/ethnicity, sex as a biological variable, geographical location, and recruitment period. Finally, it provides runnable documented examples of the import and use of these datasets in TensorFlow, PyTorch, and scikit-learn. In total, this proposal will produce the first large-scale, platform-independent, AI/ML-ready data repository for diverse and highly curated omics data. Thorough annotation on minority status of the studies and samples in our repository will facilitate the identification of biases and health disparities for marginalized populations.


None. See parent grant details.

Back to Top