Generative Haplotype Prediction Outperforms Statistical Methods for Small Variant Detection in NGS Data.

O'Fallon, Brendan; Bolia, Ashini; Durtschi, Jacob; Yang, Luobin; Fredrickson, Eric; Best, Hunter

O'Fallon, Brendan; Bolia, Ashini; Durtschi, Jacob; Yang, Luobin; Fredrickson, Eric; Best, Hunter.

Afiliação

O'Fallon B; Institute for Research and Innovation, ARUP Labs, 560 Chipeta Way, 84112, UT, USA.
Bolia A; Institute for Clinical and Experimental Pathology, ARUP Labs, 560 Chipeta Way, 84112, State, USA.
Durtschi J; Institute for Research and Innovation, ARUP Labs, 560 Chipeta Way, 84112, UT, USA.
Yang L; Institute for Research and Innovation, ARUP Labs, 560 Chipeta Way, 84112, UT, USA.
Fredrickson E; Institute for Clinical and Experimental Pathology, ARUP Labs, 560 Chipeta Way, 84112, State, USA.
Best H; Institute for Research and Innovation, ARUP Labs, 560 Chipeta Way, 84112, UT, USA.

Bioinformatics ; 2024 Sep 19.

Article em En | MEDLINE | ID: mdl-39298478

ABSTRACT

ABSTRACT

MOTIVATION Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov Models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false positive detections in a typical human whole genome, creating a significant manual review burden.

RESULTS:

We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern Large Language Models (LLMs). We train our model on 37 Whole Genome Sequences (WGS) from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3 and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested. AVAILABILITY AND IMPLEMENTATION Jenever is implemented as a python-based command line tool. Source code is available at https//github.com/ARUP-NGS/jenever/.

Palavras-chave

Deep Learning; Genomics; NGS; Variant Detection

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article