RESUMEN
DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.
Asunto(s)
ADN , Trucha , Humanos , Animales , Análisis de Secuencia de ADN/métodos , Genotipo , Homocigoto , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas InformáticosRESUMEN
Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.
Asunto(s)
Algoritmos , Genómica , Fenotipo , Programas InformáticosRESUMEN
Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
Asunto(s)
Biología Computacional , Enfermedades Raras , Humanos , Enfermedades Raras/diagnóstico , Enfermedades Raras/genética , Genómica , Genoma Humano , Células Germinativas , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable's superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.
Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas , Neoplasias Pulmonares , Humanos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genómica/métodos , Genoma , AlgoritmosRESUMEN
BACKGROUND: Whole genome sequencing (WGS) is becoming increasingly prevalent for molecular diagnosis, staging and prognosis because of its declining costs and the ability to detect nearly all genes associated with a patient's disease. The currently widely accepted variant calling pipeline, GATK, is limited in terms of its computational speed and efficiency, which cannot meet the growing analysis needs. RESULTS: Here, we propose a fast and accurate DNASeq variant calling workflow that is purely composed of tools from LUSH toolkit. The precision and recall measurements indicate that both the LUSH and GATK pipelines exhibit high levels of consistency, with precision and recall rates exceeding 99% on the 30x NA12878 dataset. In terms of processing speed, the LUSH pipeline outperforms the GATK pipeline, completing 30x WGS data analysis in just 1.6 h, which is approximately 17 times faster than GATK. Notably, the LUSH_HC tool completes the processing from BAM to VCF in just 12 min, which is around 76 times faster than GATK. CONCLUSION: These findings suggest that the LUSH pipeline is a highly promising alternative to the GATK pipeline for WGS data analysis, with the potential to significantly improve bedside analysis of acutely ill patients, large-scale cohort data analysis, and high-throughput variant calling in crop breeding programs. Furthermore, the LUSH pipeline is highly scalable and easily deployable, allowing it to be readily applied to various scenarios such as clinical diagnosis and genomic research.
Asunto(s)
Programas Informáticos , Secuenciación Completa del Genoma , Flujo de Trabajo , Humanos , Secuenciación Completa del Genoma/métodos , Genoma Humano/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple/genética , Biología Computacional/métodosRESUMEN
Mycobacterium avium is one of the prominent disease-causing bacteria in humans. It causes lymphadenitis, chronic and extrapulmonary, and disseminated infections in adults, children, and immunocompromised patients. M. avium has â¼4500 predicted protein-coding regions on average, which can help discover several variants at the proteome level. Many of them are potentially associated with virulence; thus, identifying such proteins can be a helpful feature in developing panel-based theranostics. In line with such a long-term goal, we carried out an in-depth proteomic analysis of M. avium with both data-dependent and data-independent acquisition methods. Further, a set of proteogenomic investigations were carried out using (i) a protein database for Mycobacterium tuberculosis, (ii) an M. avium genome six-frame-translated database, and (iii) a variant protein database of M. avium. A search of mass spectrometry data against M. avium protein database resulted in identifying 2954 proteins. Further, proteogenomic analyses aided in identifying 1301 novel peptide sequences and correcting translation start sites for 15 proteins. Ultimately, we created a spectral library of M. avium proteins, including novel genome search-specific peptides and variant peptides detected in this study. We validated the spectral library by a data-independent acquisition of the M. avium proteome. Thus, we present an M. avium spectral library of 29,033 peptide precursors supported by 0.4 million fragment ions for further use by the biomedical community.
Asunto(s)
Mycobacterium avium , Proteogenómica , Niño , Humanos , Mycobacterium avium/genética , Proteómica/métodos , Proteoma/genética , Virulencia , Genoma Bacteriano , Genómica/métodos , Péptidos/genética , Espectrometría de MasasRESUMEN
BACKGROUND: High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. RESULTS: Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. CONCLUSIONS: Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.
Asunto(s)
Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento , Neoplasias , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Neoplasias/genética , Mutación , Algoritmos , ADN de Neoplasias/genética , Análisis de Secuencia de ADN/métodos , Biología Computacional/métodosRESUMEN
BACKGROUND: Circulating tumour DNA (ctDNA) is a subset of cell free DNA (cfDNA) released by tumour cells into the bloodstream. Circulating tumour DNA has shown great potential as a biomarker to inform treatment in cancer patients. Collecting ctDNA is minimally invasive and reflects the entire genetic makeup of a patient's cancer. ctDNA variants in NGS data can be difficult to distinguish from sequencing and PCR artefacts due to low abundance, particularly in the early stages of cancer. Unique Molecular Identifiers (UMIs) are short sequences ligated to the sequencing library before amplification. These sequences are useful for filtering out low frequency artefacts. The utility of ctDNA as a cancer biomarker depends on accurate detection of cancer variants. RESULTS: In this study, we benchmarked six variant calling tools, including two UMI-aware callers for their ability to call ctDNA variants. The standard variant callers tested included Mutect2, bcftools, LoFreq and FreeBayes. The UMI-aware variant callers benchmarked were UMI-VarCal and UMIErrorCorrect. We used both datasets with known variants spiked in at low frequencies, and datasets containing ctDNA, and generated synthetic UMI sequences for these datasets. Variant callers displayed different preferences for sensitivity and specificity. Mutect2 showed high sensitivity, while returning more privately called variants than any other caller in data without synthetic UMIs - an indicator of false positive variant discovery. In data encoded with synthetic UMIs, UMI-VarCal detected fewer putative false positive variants than all other callers in synthetic datasets. Mutect2 showed a balance between high sensitivity and specificity in data encoded with synthetic UMIs. CONCLUSIONS: Our results indicate UMI-aware variant callers have potential to improve sensitivity and specificity in calling low frequency ctDNA variants over standard variant calling tools. There is a growing need for further development of UMI-aware variant calling tools if effective early detection methods for cancer using ctDNA samples are to be realised.
Asunto(s)
Benchmarking , ADN Tumoral Circulante , Secuenciación de Nucleótidos de Alto Rendimiento , ADN Tumoral Circulante/genética , ADN Tumoral Circulante/sangre , Humanos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Biomarcadores de Tumor/genética , Biomarcadores de Tumor/sangre , Variación Genética , Neoplasias/genética , Neoplasias/sangre , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Sensibilidad y EspecificidadRESUMEN
BACKGROUND: At a global scale, the SARS-CoV-2 virus did not remain in its initial genotype for a long period of time, with the first global reports of variants of concern (VOCs) in late 2020. Subsequently, genome sequencing has become an indispensable tool for characterizing the ongoing pandemic, particularly for typing SARS-CoV-2 samples obtained from patients or environmental surveillance. For such SARS-CoV-2 typing, various in vitro and in silico workflows exist, yet to date, no systematic cross-platform validation has been reported. RESULTS: In this work, we present the first comprehensive cross-platform evaluation and validation of in silico SARS-CoV-2 typing workflows. The evaluation relies on a dataset of 54 patient-derived samples sequenced with several different in vitro approaches on all relevant state-of-the-art sequencing platforms. Moreover, we present UnCoVar, a robust, production-grade reproducible SARS-CoV-2 typing workflow that outperforms all other tested approaches in terms of precision and recall. CONCLUSIONS: In many ways, the SARS-CoV-2 pandemic has accelerated the development of techniques and analytical approaches. We believe that this can serve as a blueprint for dealing with future pandemics. Accordingly, UnCoVar is easily generalizable towards other viral pathogens and future pandemics. The fully automated workflow assembles virus genomes from patient samples, identifies existing lineages, and provides high-resolution insights into individual mutations. UnCoVar includes extensive quality control and automatically generates interactive visual reports. UnCoVar is implemented as a Snakemake workflow. The open-source code is available under a BSD 2-clause license at github.com/IKIM-Essen/uncovar.
Asunto(s)
COVID-19 , Genoma Viral , SARS-CoV-2 , Flujo de Trabajo , SARS-CoV-2/genética , Humanos , COVID-19/virología , COVID-19/epidemiología , Programas Informáticos , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: Association testing between molecular phenotypes and genomic variants can help to understand how genotype affects phenotype. RNA sequencing provides access to molecular phenotypes such as gene expression and alternative splicing while DNA sequencing or microarray genotyping are the prevailing options to obtain genomic variants. RESULTS: We genotype variants for 74 male Braunvieh cattle from both DNA (~ 13-fold coverage) and deep total RNA sequencing from testis, vas deferens, and epididymis tissue (~ 250 million reads per tissue). We show that RNA sequencing can be used to identify approximately 40% of variants (7-10 million) called from DNA sequencing, with over 80% precision. Within highly expressed coding regions, over 92% of expected variants were called with nearly 98% precision. Allele-specific expression and putative post-transcriptional modifications negatively impact variant genotyping accuracy from RNA sequencing and contribute to RNA-DNA differences. Variants called from RNA sequencing detect roughly 75% of eGenes identified using variants called from DNA sequencing, demonstrating a nearly 2-fold enrichment of eQTL variants. We observe a moderate-to-strong correlation in nominal association p-values (Spearman ρ2 ~ 0.6), although only 9% of eGenes have the same top associated variant. CONCLUSIONS: We find hundreds of thousands of RNA-DNA differences in variants called from RNA and DNA sequencing on the same individuals. We identify several highly significant eQTL when using RNA sequencing variant genotypes which are not found with DNA sequencing variant genotypes, suggesting that using RNA sequencing variant genotypes for association testing results in an increased number of false positives. Our findings demonstrate that caution must be exercised beyond filtering for variant quality or imputation accuracy when analysing or imputing variants called from RNA sequencing.
Asunto(s)
Sitios de Carácter Cuantitativo , Animales , Bovinos/genética , Masculino , ADN/genética , Genotipo , Análisis de Secuencia de ARN , Testículo/metabolismo , Variación Genética , Polimorfismo de Nucleótido Simple , ARN/genética , Análisis de Secuencia de ADNRESUMEN
BACKGROUND: Short tandem repeats (STRs) are widely distributed across the human genome and are associated with numerous neurological disorders. However, the extent that STRs contribute to disease is likely under-estimated because of the challenges calling these variants in short read next generation sequencing data. Several computational tools have been developed for STR variant calling, but none fully address all of the complexities associated with this variant class. RESULTS: Here we introduce LUSTR which is designed to address some of the challenges associated with STR variant calling by enabling more flexibility in defining STR loci, allowing for customizable modules to tailor analyses, and expanding the capability to call somatic and multiallelic STR variants. LUSTR is a user-friendly and easily customizable tool for targeted or unbiased genome-wide STR variant screening that can use either predefined or novel genome builds. Using both simulated and real data sets, we demonstrated that LUSTR accurately infers germline and somatic STR expansions in individuals with and without diseases. CONCLUSIONS: LUSTR offers a powerful and user-friendly approach that allows for the identification of STR variants and can facilitate more comprehensive studies evaluating the role of pathogenic STR variants across human diseases.
Asunto(s)
Genoma Humano , Repeticiones de Microsatélite , Humanos , Repeticiones de Microsatélite/genética , Células Germinativas , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
Mounting evidence recognizes structural variations (SVs) and repetitive DNA sequences as crucial players in shaping the existing grape phenotypic diversity at intra- and inter-species levels. To deepen our understanding on the abundance, diversity, and distribution of SVs and repetitive DNAs, including transposable elements (TEs) and tandemly repeated satellite DNA (satDNAs), we re-sequenced the genomes of the ancient grapes Aglianico and Falanghina. The analysis of large copy number variants (CNVs) detected candidate polymorphic genes that are involved in the enological features of these varieties. In a comparative analysis of Aglianico and Falanghina sequences with 21 publicly available genomes of cultivated grapes, we provided a genome-wide annotation of grape TEs at the lineage level. We disclosed that at least two main clusters of grape cultivars could be identified based on the TEs content. Multiple TEs families appeared either significantly enriched or depleted. In addition, in silico and cytological analyses provided evidence for a diverse chromosomal distribution of several satellite repeats between Aglianico, Falanghina, and other grapes. Overall, our data further improved our understanding of the intricate grape diversity held by two Italian traditional varieties, unveiling a pool of unique candidate genes never so far exploited in breeding for improved fruit quality.
Asunto(s)
Vitis , Humanos , Vitis/genética , Fitomejoramiento , Elementos Transponibles de ADN/genética , ADN SatéliteRESUMEN
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories.
Asunto(s)
Benchmarking , Genoma Humano , Animales , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Ratones , Secuenciación Completa del Genoma/métodosRESUMEN
The study of genetic minority variants is fundamental to the understanding of complex processes such as evolution, fitness, transmission, virulence, heteroresistance and drug tolerance in Mycobacterium tuberculosis (Mtb). We evaluated the performance of the variant calling tool LoFreq to detect de novo as well as drug resistance conferring minor variants in both in silico and clinical Mtb next generation sequencing (NGS) data. The in silico simulations demonstrated that LoFreq is a conservative variant caller with very high precision (≥96.7%) over the entire range of depth of coverage tested (30x to1000x), independent of the type and frequency of the minor variant. Sensitivity increased with increasing depth of coverage and increasing frequency of the variant, and was higher for calling insertion and deletion (indel) variants than for single nucleotide polymorphisms (SNP). The variant frequency limit of detection was 0.5% and 3% for indel and SNP minor variants, respectively. For serial isolates from a patient with DR-TB; LoFreq successfully identified all minor Mtb variants in the Rv0678 gene (allele frequency as low as 3.22% according to targeted deep sequencing) in whole genome sequencing data (median coverage of 62X). In conclusion, LoFreq can successfully detect minor variant populations in Mtb NGS data, thus limiting the need for filtering of possible false positive variants due to sequencing error. The observed performance statistics can be used to determine the limit of detection in existing whole genome sequencing Mtb data and guide the required depth of future studies that aim to investigate the presence of minor variants.
Asunto(s)
Mycobacterium tuberculosis/genética , Secuenciación Completa del Genoma , Proteínas Bacterianas , Frecuencia de los Genes , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación INDEL , Mutación , Polimorfismo de Nucleótido Simple , Tuberculosis Resistente a Múltiples Medicamentos/microbiologíaRESUMEN
As recently demonstrated by the COVID-19 pandemic, large-scale pathogen genomic data are crucial to characterize transmission patterns of human infectious diseases. Yet, current methods to process raw sequence data into analysis-ready variants remain slow to scale, hampering rapid surveillance efforts and epidemiological investigations for disease control. Here, we introduce an accelerated, scalable, reproducible, and cost-effective framework for pathogen genomic variant identification and present an evaluation of its performance and accuracy across benchmark datasets of Plasmodium falciparum malaria genomes. We demonstrate superior performance of the GPU framework relative to standard pipelines with mean execution time and computational costs reduced by 27× and 4.6×, respectively, while delivering 99.9% accuracy at enhanced reproducibility.
Asunto(s)
COVID-19 , Enfermedades Transmisibles , Malaria , COVID-19/epidemiología , COVID-19/genética , Genómica/métodos , Humanos , Pandemias , Reproducibilidad de los ResultadosRESUMEN
Accurate identification of genetic variants from family child-mother-father trio sequencing data is important in genomics. However, state-of-the-art approaches treat variant calling from trios as three independent tasks, which limits their calling accuracy for Nanopore long-read sequencing data. For better trio variant calling, we introduce Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio's predicted variants within a single model to improve variant calling. We also present MCVLoss, a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio showed comprehensive improvement in experiments. It predicted far fewer Mendelian inheritance violation variations than current state-of-the-art methods. We also demonstrated that our Trio-to-Trio model is more accurate than competing architectures. Clair3-Trio is accessible as a free, open-source project at https://github.com/HKU-BAL/Clair3-Trio.
Asunto(s)
Nanoporos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Redes Neurales de la Computación , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
Macrohaplotype combines multiple types of phased DNA variants, increasing forensic discrimination power. High-quality long-sequencing reads, for example, PacBio HiFi reads, provide data to detect macrohaplotypes in multiploidy and DNA mixtures. However, the bioinformatics tools for detecting macrohaplotypes are lacking. In this study, we developed a bioinformatics software, MacroHapCaller, in which targeted loci (i.e., short TRs [STRs], single nucleotide polymorphisms, and insertion and deletions) are genotyped and combined with novel algorithms to call macrohaplotypes from long reads. MacroHapCaller uses physical phasing (i.e., read-backed phasing) to identify macrohaplotypes, and thus it can detect multi-allelic macrohaplotypes for a given sample. MacroHapCaller was validated with data generated from our designed targeted PacBio HiFi sequencing pipeline, which sequenced â¼8-kb amplicon regions harboring 20 core forensic STR loci in human benchmark samples HG002 and HG003. MacroHapCaller also was validated in whole-genome long-read sequencing data. Robust and accurate genotyping and phased macrohaplotypes were obtained with MacroHapCaller compared with the known ground truth. MacroHapCaller achieved a higher or consistent genotyping accuracy and faster speed than existing tools HipSTR and DeepVar. MacroHapCaller enables efficient macrohaplotype analysis from high-throughput sequencing data and supports applications using discriminating macrohaplotypes.
Asunto(s)
Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento , Polimorfismo de Nucleótido Simple , Poliploidía , Análisis de Secuencia de ADN , Programas Informáticos , Humanos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Biología Computacional/métodos , ADN/genética , ADN/análisis , Repeticiones de Microsatélite/genética , Genética Forense/métodos , Técnicas de Genotipaje/métodosRESUMEN
Precision oncology relies on the accurate identification of somatic mutations in cancer patients. While the sequencing of the tumoral tissue is frequently part of routine clinical care, the healthy counterparts are rarely sequenced. We previously published PipeIT, a somatic variant calling workflow specific for Ion Torrent sequencing data enclosed in a Singularity container. PipeIT combines user-friendly execution, reproducibility and reliable mutation identification, but relies on matched germline sequencing data to exclude germline variants. Expanding on the original PipeIT, here we describe PipeIT2 to address the clinical need to define somatic mutations in the absence of germline control. We show that PipeIT2 achieves a > 95% recall for variants with variant allele fraction >10%, reliably detects driver and actionable mutations and filters out most of the germline mutations and sequencing artifacts. With its performance, reproducibility, and ease of execution, PipeIT2 is a valuable addition to molecular diagnostics laboratories.
Asunto(s)
Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Patología Molecular , Flujo de Trabajo , Reproducibilidad de los Resultados , Medicina de Precisión , Mutación , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
Acute myeloid leukemia (AML) is a complex hematologic malignancy with high morbidity and mortality. Nucleophosmin 1 (NPM1) mutations occur in approximately 30% of AML cases, and NPM1-mutated AML is classified as a distinct entity. NPM1-mutated AML patients without additional genetic abnormalities have a favorable prognosis. Despite this, 30-50% of them experience relapse. This study aimed to investigate the potential of total RNAseq in improving the characterization of NPM1-mutated AML patients. We explored genetic variations independently of myeloid stratification, revealing a complex molecular scenario. We showed that total RNAseq enables the uncovering of different genetic alterations and clonal subtypes, allowing for a comprehensive evaluation of the real expression of exome transcripts in leukemic clones and the identification of aberrant fusion transcripts. This characterization may enhance understanding and guide improved treatment strategies for NPM1mut AML patients, contributing to better outcomes. Our findings underscore the complexity of NPM1-mutated AML, supporting the incorporation of advanced technologies for precise risk stratification and personalized therapeutic strategies. The study provides a foundation for future investigations into the clinical implications of identified genetic variations and highlights the importance of evolving diagnostic approaches in leukemia management.
Asunto(s)
Neoplasias Hematológicas , Leucemia Mieloide Aguda , Humanos , Células Clonales , Exoma , Leucemia Mieloide Aguda/diagnóstico , Leucemia Mieloide Aguda/genética , Proteínas Nucleares/genéticaRESUMEN
BACKGROUND: With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data. RESULTS: We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP . CONCLUSIONS: These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.