RESUMO
Accurate identification of germline de novo variants (DNVs) remains a challenging problem despite rapid advances in sequencing technologies as well as methods for the analysis of the data they generate, with putative solutions often involving ad hoc filters and visual inspection of identified variants. Here, we present a purely informatic method for the identification of DNVs by analyzing short-read genome sequencing data from proband-parent trios. Our method evaluates variant calls generated by three genome sequence analysis pipelines utilizing different algorithms-GATK HaplotypeCaller, DeepTrio and Velsera GRAF-exploring the assumption that a requirement of consensus can serve as an effective filter for high-quality DNVs. We assessed the efficacy of our method by testing DNVs identified using a previously established, highly accurate classification procedure that partially relied on manual inspection and used Sanger sequencing to validate a DNV subset comprising less confident calls. The results show that our method is highly precise and that applying a force-calling procedure to putative variants further removes false-positive calls, increasing precision of the workflow to 99.6%. Our method also identified novel DNVs, 87% of which were validated, indicating it offers a higher recall rate without compromising accuracy. We have implemented this method as an automated bioinformatics workflow suitable for large-scale analyses without need for manual intervention.
RESUMO
Graph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference to represent the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based toolkits for NGS read alignment and variant calling, methods to curate genomic variants and subsequently construct genome graphs remain an understudied problem that inevitably determines the effectiveness of the overall bioinformatics pipeline. In this study, we discuss obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and demonstrate this approach on the whole-genome samples of African ancestry. Our results show that population-specific graphs, as more representative alternatives to linear or generic graph references, can achieve significantly lower read mapping errors and enhanced variant calling sensitivity, in addition to providing the improvements of joint variant calling without the need of computationally intensive post-processing steps.
Assuntos
Análise de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Genoma Humano/genética , Genômica/métodos , Humanos , Análise de Sequência de DNA/métodos , SoftwareRESUMO
The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications. Challenge submissions included numerous innovative methods, with graph-based and machine learning methods scoring best for short-read and long-read datasets, respectively. With machine learning approaches, combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.