Grant Details
Grant Number: |
5R03CA272952-02 Interpret this number |
Primary Investigator: |
Schatz, Michael |
Organization: |
Johns Hopkins University |
Project Title: |
Optimized Workflows for Structural Variant Analysis of the Kids First Genomes Using Short and Long Reads |
Fiscal Year: |
2023 |
Abstract
Project Summary
The overall goal of the Gabriella Miller Kids First Pediatric Research Program is to alleviate suffering from
childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of
these diseases. A recent addition to the program is the Kids First Long Read Pilot Projects, which are leveraging
long-read sequencing technologies to further resolve the patients’ genomes. Already these technologies are
transforming genomics by allowing complete telomere-to-telomere (T2T) reconstructions of human genomes
for the first time, and by allowing the discovery of structural variants and other complex variants that were
previously inaccessible using short read sequencing.
Here we will enhance the utility of the Kids First data sets by developing and applying optimized cloud-scale
workflows for analyzing short and long read datasets with the new T2T-CHM13 human genome. Within the
T2T consortium, we have led the effort to characterize how the CHM13 genome influences variant calling, and
have found the T2T reference universally improves the analysis of genetic variation using both short and long
read sequencing. Here we will develop optimized workflows for analyzing short read datasets with the
T2T-CHM13 reference genome using GATK for SNVs and small indels, and Parliament2 for short-read SV
discovery. Next we will develop optimized workflows for Long Read Structural Variant Detection. Short-reads
are challenged to detect many classes of mutations (e.g. SVs, repeat expansions, etc), and cannot resolve many
repetitive regions of the genome, including within many medically relevant genes. Long-reads show great
promise to address these challenges and discover new disease associations due to its increased mappability,
variant resolution, and phasing capabilities. To enable these technologies for Kids First, we will develop
optimized workflows for accurately identifying and comparing SVs across long read samples with Jasmine, as
well as genotyping SVs discovered by long reads within short read datasets with Paragraph. This will enable us
to analyze and prioritize variants found by long reads within the much larger numbers of short read datasets.
We will then apply these workflows to the Kids First data resource to develop improved variant calls and
improved variant analysis of these precious samples. This will lead to the discovery of thousands of SVs that
were previously missed, and will reduce the number of false variants that would otherwise confuse any
downstream analysis. We will also develop new statistical and machine learning approaches for prioritizing the
variants that are most likely to be related to the studied diseases, leveraging the pedigree information and
genome annotations available, in support of our overall goal of identifying the driver mutations for these
diseases. All workflows and software developments will be released open source for use in CAVATICA, the
cloud-based analysis platform used by all Kids First researchers, ensuring scalability and reproducibility.
Publications
None