Grant Details
Grant Number: |
3U24CA180996-10S1 Interpret this number |
Primary Investigator: |
Waldron, Levi |
Organization: |
Graduate School Of Public Health And Health Policy |
Project Title: |
Cancer Genomics: Integrative and Scalable Solutions in R/Bioconductor |
Fiscal Year: |
2022 |
Abstract
Project Summary
Bioconductor is an ecosystem of more than 2,000 open-source software packages for the reproducible
bioinformatics analysis of various types of genomic data. Aim 1 of our parent grant, “Cancer Genomics:
Integrative and Scalable Solutions in R/Bioconductor” (7U24CA180996), develops and maintains
R/Bioconductor data structures for representation, downstream software development, and analysis of
multimodal cancer datasets. Aim 3 of our parent grant establishes ExperimentHub web resources for the
curation, distribution, maintenance, discoverability, and usability of cancer data resources for the
R/Bioconductor community. This proposal targets hundreds of primarily cancer-focused genomic and
metagenomic datasets that are optimized for R/Bioconductor-based usage and contain significant value-added
over primary sources in the form of harmonization and manual curation, but for which substantial domain and
Bioconductor-specific expertise is currently required to translate into formats suitable for widely used AI/ML
softwares. First, it creates the Bioconductor Machine Learning Repository for Omics by translating existing
R/Bioconductor versions of TCGA, cBioPortal, metagenomics, and other datasets. Second, in order to assess
representation and generalizability of any models developed, it employs manual curation to uniformly annotate
key characteristics of each study cohort including race/ethnicity, sex as a biological variable, geographical
location, and recruitment period. Finally, it provides runnable documented examples of the import and use of
these datasets in TensorFlow, PyTorch, and scikit-learn. In total, this proposal will produce the first large-scale,
platform-independent, AI/ML-ready data repository for diverse and highly curated omics data. Thorough
annotation on minority status of the studies and samples in our repository will facilitate the identification of
biases and health disparities for marginalized populations.
Publications
None. See parent grant details.