Skip to main content
An official website of the United States government
Grant Details

Grant Number: 5R03CA272952-02 Interpret this number
Primary Investigator: Schatz, Michael
Organization: Johns Hopkins University
Project Title: Optimized Workflows for Structural Variant Analysis of the Kids First Genomes Using Short and Long Reads
Fiscal Year: 2023


Abstract

Project Summary The overall goal of the Gabriella Miller Kids First Pediatric Research Program is to alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases. A recent addition to the program is the Kids First Long Read Pilot Projects, which are leveraging long-read sequencing technologies to further resolve the patients’ genomes. Already these technologies are transforming genomics by allowing complete telomere-to-telomere (T2T) reconstructions of human genomes for the first time, and by allowing the discovery of structural variants and other complex variants that were previously inaccessible using short read sequencing. Here we will enhance the utility of the Kids First data sets by developing and applying optimized cloud-scale workflows for analyzing short and long read datasets with the new T2T-CHM13 human genome. Within the T2T consortium, we have led the effort to characterize how the CHM13 genome influences variant calling, and have found the T2T reference universally improves the analysis of genetic variation using both short and long read sequencing. Here we will develop optimized workflows for analyzing short read datasets with the T2T-CHM13 reference genome using GATK for SNVs and small indels, and Parliament2 for short-read SV discovery. Next we will develop optimized workflows for Long Read Structural Variant Detection. Short-reads are challenged to detect many classes of mutations (e.g. SVs, repeat expansions, etc), and cannot resolve many repetitive regions of the genome, including within many medically relevant genes. Long-reads show great promise to address these challenges and discover new disease associations due to its increased mappability, variant resolution, and phasing capabilities. To enable these technologies for Kids First, we will develop optimized workflows for accurately identifying and comparing SVs across long read samples with Jasmine, as well as genotyping SVs discovered by long reads within short read datasets with Paragraph. This will enable us to analyze and prioritize variants found by long reads within the much larger numbers of short read datasets. We will then apply these workflows to the Kids First data resource to develop improved variant calls and improved variant analysis of these precious samples. This will lead to the discovery of thousands of SVs that were previously missed, and will reduce the number of false variants that would otherwise confuse any downstream analysis. We will also develop new statistical and machine learning approaches for prioritizing the variants that are most likely to be related to the studied diseases, leveraging the pedigree information and genome annotations available, in support of our overall goal of identifying the driver mutations for these diseases. All workflows and software developments will be released open source for use in CAVATICA, the cloud-based analysis platform used by all Kids First researchers, ensuring scalability and reproducibility.



Publications


None

Back to Top