Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 275
Filtrar
1.
BMC Genomics ; 25(1): 750, 2024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-39090567

RESUMO

BACKGROUND: Association testing between molecular phenotypes and genomic variants can help to understand how genotype affects phenotype. RNA sequencing provides access to molecular phenotypes such as gene expression and alternative splicing while DNA sequencing or microarray genotyping are the prevailing options to obtain genomic variants. RESULTS: We genotype variants for 74 male Braunvieh cattle from both DNA (~ 13-fold coverage) and deep total RNA sequencing from testis, vas deferens, and epididymis tissue (~ 250 million reads per tissue). We show that RNA sequencing can be used to identify approximately 40% of variants (7-10 million) called from DNA sequencing, with over 80% precision. Within highly expressed coding regions, over 92% of expected variants were called with nearly 98% precision. Allele-specific expression and putative post-transcriptional modifications negatively impact variant genotyping accuracy from RNA sequencing and contribute to RNA-DNA differences. Variants called from RNA sequencing detect roughly 75% of eGenes identified using variants called from DNA sequencing, demonstrating a nearly 2-fold enrichment of eQTL variants. We observe a moderate-to-strong correlation in nominal association p-values (Spearman ρ2 ~ 0.6), although only 9% of eGenes have the same top associated variant. CONCLUSIONS: We find hundreds of thousands of RNA-DNA differences in variants called from RNA and DNA sequencing on the same individuals. We identify several highly significant eQTL when using RNA sequencing variant genotypes which are not found with DNA sequencing variant genotypes, suggesting that using RNA sequencing variant genotypes for association testing results in an increased number of false positives. Our findings demonstrate that caution must be exercised beyond filtering for variant quality or imputation accuracy when analysing or imputing variants called from RNA sequencing.


Assuntos
Locos de Características Quantitativas , Animais , Bovinos/genética , Masculino , DNA/genética , Genótipo , Análise de Sequência de RNA , Testículo/metabolismo , Variação Genética , Polimorfismo de Nucleotídeo Único , RNA/genética , Análise de Sequência de DNA
2.
Appl Plant Sci ; 12(4): e11607, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39184203

RESUMO

Advancements in genome assembly and sequencing technology have made whole genome sequence (WGS) data and reference genomes accessible to study polyploid species. Compared to popular reduced-representation sequencing approaches, the genome-wide coverage and greater marker density provided by WGS data can greatly improve our understanding of polyploid species and polyploid biology. However, biological features that make polyploid species interesting also pose challenges in read mapping, variant identification, and genotype estimation. Accounting for characteristics in variant calling like allelic dosage uncertainty, homology between subgenomes, and variance in chromosome inheritance mode can reduce errors. Here, I discuss the challenges of variant calling in polyploid WGS data and discuss where potential solutions can be integrated into a standard variant calling pipeline.

3.
Front Genet ; 15: 1435087, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39045321

RESUMO

Introduction: Structural Variants (SVs) are a type of variation that can significantly influence phenotypes and cause diseases. Thus, the accurate detection of SVs is a vital part of modern genetic analysis. The advent of long-read sequencing technology ushers in a new era of more accurate and comprehensive SV calling, and many tools have been developed to call SVs using long-read data. Haplotype-tagging is a procedure that can tag haplotype information on reads and can thus potentially improve the SV detection; nevertheless, few methods make use of this information. In this article, we introduce HapKled, a new SV detection tool that can accurately detect SVs from Oxford Nanopore Technologies (ONT) long-read alignment data. Methods: HapKled utilizes haplotype information underlying alignment data by conducting haplotype-tagging using Whatshap on the reads to improve the detection performance, with three unique calling mechanics including altering clustering conditions according to haplotype information of signatures, determination of similar SVs based on haplotype information, and slack filtering conditions based on haplotype quality. Results: In our evaluations, HapKled outperformed state-of-the-art tools and can deliver better SV detection results on both simulated and real sequencing data. The code and experiments of HapKled can be obtained from https://github.com/CoREse/HapKled. Discussion: With the superb SV detection performance that HapKled can deliver, HapKled could be useful in bioinformatics research, clinical diagnosis, and medical research and development.

4.
Genes (Basel) ; 15(6)2024 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-38927635

RESUMO

The integration of target capture systems with next-generation sequencing has emerged as an efficient tool for exploring specific genetic regions with a high resolution and facilitating the rapid discovery of novel alleles. Despite these advancements, the application of targeted sequencing methodologies, such as the myBaits technology, in polyploid oat species remains relatively unexplored. In this study, we utilized the myBaits target capture method offered by Daicel Arbor Biosciences to detect variants and assess their reliability for variant detection in oat genomics and breeding. Ten oat genotypes were carefully chosen for targeted sequencing, focusing on specific regions on chromosome 2A to detect variants. The selected region harbors 98 genes. Precisely designed baits targeting the genes within these regions were employed for the target capture sequencing. We employed various mappers and variant callers to identify variants. After the identification of variants, we focused on the variants identified via all variants callers to assess the applicability of the myBaits sequencing methodology in oat breeding. In our efforts to validate the identified variants, we focused on two SNPs, one deletion and one insertion identified via all variant callers in the genotypes KF-318 and NOS 819111-70 but absent in the remaining eight genotypes. The Sanger sequencing of targeted SNPs failed to reproduce target capture data obtained through the myBaits technology. Similarly, the validation of deletion and insertion variants via high-resolution melting (HRM) curve analysis also failed to reproduce target capture data, again suggesting limitations in the reliability of the myBaits target capture sequencing using short-read sequencing for variant detection in the oat genome. This study shed light on the importance of exercising caution when employing the myBaits target capture strategy for variant detection in oats. This study provides valuable insights for breeders seeking to advance oat breeding efforts and marker development using myBaits target capture sequencing, emphasizing the significance of methodological sequencing considerations in oat genomics research.


Assuntos
Avena , Sequenciamento de Nucleotídeos em Larga Escala , Melhoramento Vegetal , Polimorfismo de Nucleotídeo Único , Avena/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Melhoramento Vegetal/métodos , Polimorfismo de Nucleotídeo Único/genética , Genoma de Planta/genética , Genômica/métodos , Genótipo , Análise de Sequência de DNA/métodos
5.
BMC Genomics ; 25(1): 647, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38943066

RESUMO

BACKGROUND: At a global scale, the SARS-CoV-2 virus did not remain in its initial genotype for a long period of time, with the first global reports of variants of concern (VOCs) in late 2020. Subsequently, genome sequencing has become an indispensable tool for characterizing the ongoing pandemic, particularly for typing SARS-CoV-2 samples obtained from patients or environmental surveillance. For such SARS-CoV-2 typing, various in vitro and in silico workflows exist, yet to date, no systematic cross-platform validation has been reported. RESULTS: In this work, we present the first comprehensive cross-platform evaluation and validation of in silico SARS-CoV-2 typing workflows. The evaluation relies on a dataset of 54 patient-derived samples sequenced with several different in vitro approaches on all relevant state-of-the-art sequencing platforms. Moreover, we present UnCoVar, a robust, production-grade reproducible SARS-CoV-2 typing workflow that outperforms all other tested approaches in terms of precision and recall. CONCLUSIONS: In many ways, the SARS-CoV-2 pandemic has accelerated the development of techniques and analytical approaches. We believe that this can serve as a blueprint for dealing with future pandemics. Accordingly, UnCoVar is easily generalizable towards other viral pathogens and future pandemics. The fully automated workflow assembles virus genomes from patient samples, identifies existing lineages, and provides high-resolution insights into individual mutations. UnCoVar includes extensive quality control and automatically generates interactive visual reports. UnCoVar is implemented as a Snakemake workflow. The open-source code is available under a BSD 2-clause license at github.com/IKIM-Essen/uncovar.


Assuntos
COVID-19 , Genoma Viral , SARS-CoV-2 , Fluxo de Trabalho , SARS-CoV-2/genética , Humanos , COVID-19/virologia , COVID-19/epidemiologia , Software , Reprodutibilidade dos Testes
6.
J Alzheimers Dis Rep ; 8(1): 575-587, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38746629

RESUMO

Background: Mitochondrial DNA (mtDNA) is a double-stranded circular DNA and has multiple copies in each cell. Excess heteroplasmy, the coexistence of distinct variants in copies of mtDNA within a cell, may lead to mitochondrial impairments. Accurate determination of heteroplasmy in whole-genome sequencing (WGS) data has posed a significant challenge because mitochondria carrying heteroplasmic variants cannot be distinguished during library preparation. Moreover, sequencing errors, contamination, and nuclear mtDNA segments can reduce the accuracy of heteroplasmic variant calling. Objective: To efficiently and accurately call mtDNA homoplasmic and heteroplasmic variants from the large-scale WGS data generated from the Alzheimer's Disease Sequencing Project (ADSP), and test their association with Alzheimer's disease (AD). Methods: In this study, we present MitoH3-a comprehensive computational pipeline for calling mtDNA homoplasmic and heteroplasmic variants and inferring haplogroups in the ADSP WGS data. We first applied MitoH3 to 45 technical replicates from 6 subjects to define a threshold for detecting heteroplasmic variants. Then using the threshold of 5% ≤variant allele fraction≤95%, we further applied MitoH3 to call heteroplasmic variants from a total of 16,113 DNA samples with 6,742 samples from cognitively normal controls and 6,183 from AD cases. Results: This pipeline is available through the Singularity container engine. For 4,311 heteroplasmic variants identified from 16,113 samples, no significant variant count difference was observed between AD cases and controls. Conclusions: Our streamlined pipeline, MitoH3, enables computationally efficient and accurate analysis of a large number of samples.

7.
BMC Bioinformatics ; 25(1): 180, 2024 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-38720249

RESUMO

BACKGROUND: High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. RESULTS: Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. CONCLUSIONS: Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.


Assuntos
Benchmarking , Sequenciamento de Nucleotídeos em Larga Escala , Neoplasias , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Neoplasias/genética , Mutação , Algoritmos , DNA de Neoplasias/genética , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos
8.
Curr Protoc ; 4(5): e1046, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38717471

RESUMO

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.


Assuntos
Polimorfismo de Nucleotídeo Único , Software , Fluxo de Trabalho , Polimorfismo de Nucleotídeo Único/genética , Biologia Computacional/métodos , Genômica/métodos , Anotação de Sequência Molecular/métodos , Sequenciamento Completo do Genoma/métodos
9.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38632951

RESUMO

In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable's superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.


Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos , Genoma , Algoritmos
10.
Int J Mol Sci ; 25(7)2024 Mar 24.
Artigo em Inglês | MEDLINE | ID: mdl-38612443

RESUMO

Acute myeloid leukemia (AML) is a complex hematologic malignancy with high morbidity and mortality. Nucleophosmin 1 (NPM1) mutations occur in approximately 30% of AML cases, and NPM1-mutated AML is classified as a distinct entity. NPM1-mutated AML patients without additional genetic abnormalities have a favorable prognosis. Despite this, 30-50% of them experience relapse. This study aimed to investigate the potential of total RNAseq in improving the characterization of NPM1-mutated AML patients. We explored genetic variations independently of myeloid stratification, revealing a complex molecular scenario. We showed that total RNAseq enables the uncovering of different genetic alterations and clonal subtypes, allowing for a comprehensive evaluation of the real expression of exome transcripts in leukemic clones and the identification of aberrant fusion transcripts. This characterization may enhance understanding and guide improved treatment strategies for NPM1mut AML patients, contributing to better outcomes. Our findings underscore the complexity of NPM1-mutated AML, supporting the incorporation of advanced technologies for precise risk stratification and personalized therapeutic strategies. The study provides a foundation for future investigations into the clinical implications of identified genetic variations and highlights the importance of evolving diagnostic approaches in leukemia management.


Assuntos
Neoplasias Hematológicas , Leucemia Mieloide Aguda , Humanos , Células Clonais , Exoma , Leucemia Mieloide Aguda/diagnóstico , Leucemia Mieloide Aguda/genética , Proteínas Nucleares/genética
11.
Viruses ; 16(3)2024 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-38543795

RESUMO

Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiologia , Pandemias , Fluxo de Trabalho , Biologia Computacional
12.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38385878

RESUMO

Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.


Assuntos
Algoritmos , Genômica , Fenótipo , Software
13.
Brief Funct Genomics ; 23(4): 303-313, 2024 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-38366908

RESUMO

Genome sequencing data have become increasingly important in the field of personalized medicine and diagnosis. However, accurately detecting genomic variations remains a challenging task. Traditional variation detection methods rely on manual inspection or predefined rules, which can be time-consuming and prone to errors. Consequently, deep learning-based approaches for variation detection have gained attention due to their ability to automatically learn genomic features that distinguish between variants. In our review, we discuss the recent advancements in deep learning-based algorithms for detecting small variations and structural variations in genomic data, as well as their advantages and limitations.


Assuntos
Aprendizado Profundo , Humanos , Variação Genética , Genômica/métodos , Algoritmos
14.
BMC Plant Biol ; 24(1): 88, 2024 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-38317087

RESUMO

Mounting evidence recognizes structural variations (SVs) and repetitive DNA sequences as crucial players in shaping the existing grape phenotypic diversity at intra- and inter-species levels. To deepen our understanding on the abundance, diversity, and distribution of SVs and repetitive DNAs, including transposable elements (TEs) and tandemly repeated satellite DNA (satDNAs), we re-sequenced the genomes of the ancient grapes Aglianico and Falanghina. The analysis of large copy number variants (CNVs) detected candidate polymorphic genes that are involved in the enological features of these varieties. In a comparative analysis of Aglianico and Falanghina sequences with 21 publicly available genomes of cultivated grapes, we provided a genome-wide annotation of grape TEs at the lineage level. We disclosed that at least two main clusters of grape cultivars could be identified based on the TEs content. Multiple TEs families appeared either significantly enriched or depleted. In addition, in silico and cytological analyses provided evidence for a diverse chromosomal distribution of several satellite repeats between Aglianico, Falanghina, and other grapes. Overall, our data further improved our understanding of the intricate grape diversity held by two Italian traditional varieties, unveiling a pool of unique candidate genes never so far exploited in breeding for improved fruit quality.


Assuntos
Vitis , Humanos , Vitis/genética , Melhoramento Vegetal , Elementos de DNA Transponíveis/genética , DNA Satélite
15.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38271481

RESUMO

Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.


Assuntos
Biologia Computacional , Doenças Raras , Humanos , Doenças Raras/diagnóstico , Doenças Raras/genética , Genômica , Genoma Humano , Células Germinativas , Sequenciamento de Nucleotídeos em Larga Escala
16.
Electrophoresis ; 45(9-10): 877-884, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38196015

RESUMO

Macrohaplotype combines multiple types of phased DNA variants, increasing forensic discrimination power. High-quality long-sequencing reads, for example, PacBio HiFi reads, provide data to detect macrohaplotypes in multiploidy and DNA mixtures. However, the bioinformatics tools for detecting macrohaplotypes are lacking. In this study, we developed a bioinformatics software, MacroHapCaller, in which targeted loci (i.e., short TRs [STRs], single nucleotide polymorphisms, and insertion and deletions) are genotyped and combined with novel algorithms to call macrohaplotypes from long reads. MacroHapCaller uses physical phasing (i.e., read-backed phasing) to identify macrohaplotypes, and thus it can detect multi-allelic macrohaplotypes for a given sample. MacroHapCaller was validated with data generated from our designed targeted PacBio HiFi sequencing pipeline, which sequenced ∼8-kb amplicon regions harboring 20 core forensic STR loci in human benchmark samples HG002 and HG003. MacroHapCaller also was validated in whole-genome long-read sequencing data. Robust and accurate genotyping and phased macrohaplotypes were obtained with MacroHapCaller compared with the known ground truth. MacroHapCaller achieved a higher or consistent genotyping accuracy and faster speed than existing tools HipSTR and DeepVar. MacroHapCaller enables efficient macrohaplotype analysis from high-throughput sequencing data and supports applications using discriminating macrohaplotypes.


Assuntos
Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Polimorfismo de Nucleotídeo Único , Poliploidia , Análise de Sequência de DNA , Software , Humanos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Biologia Computacional/métodos , DNA/genética , DNA/análise , Repetições de Microssatélites/genética , Genética Forense/métodos , Técnicas de Genotipagem/métodos
17.
BMC Genomics ; 25(1): 115, 2024 Jan 26.
Artigo em Inglês | MEDLINE | ID: mdl-38279154

RESUMO

BACKGROUND: Short tandem repeats (STRs) are widely distributed across the human genome and are associated with numerous neurological disorders. However, the extent that STRs contribute to disease is likely under-estimated because of the challenges calling these variants in short read next generation sequencing data. Several computational tools have been developed for STR variant calling, but none fully address all of the complexities associated with this variant class. RESULTS: Here we introduce LUSTR which is designed to address some of the challenges associated with STR variant calling by enabling more flexibility in defining STR loci, allowing for customizable modules to tailor analyses, and expanding the capability to call somatic and multiallelic STR variants. LUSTR is a user-friendly and easily customizable tool for targeted or unbiased genome-wide STR variant screening that can use either predefined or novel genome builds. Using both simulated and real data sets, we demonstrated that LUSTR accurately infers germline and somatic STR expansions in individuals with and without diseases. CONCLUSIONS: LUSTR offers a powerful and user-friendly approach that allows for the identification of STR variants and can facilitate more comprehensive studies evaluating the role of pathogenic STR variants across human diseases.


Assuntos
Genoma Humano , Repetições de Microssatélites , Humanos , Repetições de Microssatélites/genética , Células Germinativas , Sequenciamento de Nucleotídeos em Larga Escala
18.
BMC Bioinformatics ; 24(1): 472, 2023 Dec 14.
Artigo em Inglês | MEDLINE | ID: mdl-38097928

RESUMO

BACKGROUND: The accurate detection of variants is essential for genomics-based studies. Currently, there are various tools designed to detect genomic variants, however, it has always been a challenge to decide which tool to use, especially when various major genome projects have chosen to use different tools. Thus far, most of the existing tools were mainly developed to work on short-read data (i.e., Illumina); however, other sequencing technologies (e.g. PacBio, and Oxford Nanopore) have recently shown that they can also be used for variant calling. In addition, with the emergence of artificial intelligence (AI)-based variant calling tools, there is a pressing need to compare these tools in terms of efficiency, accuracy, computational power, and ease of use. RESULTS: In this study, we evaluated five of the most widely used conventional and AI-based variant calling tools (BCFTools, GATK4, Platypus, DNAscope, and DeepVariant) in terms of accuracy and computational cost using both short-read and long-read data derived from three different sequencing technologies (Illumina, PacBio HiFi, and ONT) for the same set of samples from the Genome In A Bottle project. The analysis showed that AI-based variant calling tools supersede conventional ones for calling SNVs and INDELs using both long and short reads in most aspects. In addition, we demonstrate the advantages and drawbacks of each tool while ranking them in each aspect of these comparisons. CONCLUSION: This study provides best practices for variant calling using AI-based and conventional variant callers with different types of sequencing data.


Assuntos
Inteligência Artificial , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos
19.
Front Genet ; 14: 1277784, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38155715

RESUMO

Exome sequencing (ES) is a recommended first-tier diagnostic test for many rare monogenic diseases. It allows for the detection of both single-nucleotide variants (SNVs) and copy number variants (CNVs) in coding exonic regions of the genome in a single test, and this dual analysis is a valuable approach, especially in limited resource settings. Single-nucleotide variants are well studied; however, the incorporation of copy number variant analysis tools into variant calling pipelines has not been implemented yet as a routine diagnostic test, and chromosomal microarray is still more widely used to detect copy number variants. Research shows that combined single and copy number variant analysis can lead to a diagnostic yield of up to 58%, increasing the yield with as much as 18% from the single-nucleotide variant only pipeline. Importantly, this is achieved with the consideration of computational costs only, without incurring any additional sequencing costs. This mini review provides an overview of copy number variant analysis from exome data and what the current recommendations are for this type of analysis. We also present an overview on rare monogenic disease research standard practices in resource-limited settings. We present evidence that integrating copy number variant detection tools into a standard exome sequencing analysis pipeline improves diagnostic yield and should be considered a significantly beneficial addition, with relatively low-cost implications. Routine implementation in underrepresented populations and limited resource settings will promote generation and sharing of CNV datasets and provide momentum to build core centers for this niche within genomic medicine.

20.
BMC Bioinformatics ; 24(1): 424, 2023 Nov 08.
Artigo em Inglês | MEDLINE | ID: mdl-37940870

RESUMO

BACKGROUND: Processing raw genomic data for downstream applications such as imputation, association studies, and modeling requires numerous third-party bioinformatics software tools. It is highly time-consuming and resource-intensive with computational demands and storage limitations that pose significant challenges that increase cost. The use of software tools independent of one another, in a disjointed stepwise fashion, increases the difficulty and sets forth higher error rates because of fragmented job executions in alignment, variant calling, and/or build conversion complications. As sequencing data availability grows, the ability for biologists to process it using stable, automated, and reproducible workflows is paramount as it significantly reduces the time to generate clean and reliable data. RESULTS: The Iliad suite of genomic data workflows was developed to provide users with seamless file transitions from raw genomic data to a quality-controlled variant call format (VCF) file for downstream applications. Iliad benefits from the efficiency of the Snakemake best practices framework coupled with Singularity and Docker containers for repeatability, portability, and ease of installation. This feat is accomplished from the onset with download acquisitions of any raw data type (FASTQ, CRAM, IDAT) straight through to the generation of a clean merged data file that can combine any user-preferred datasets using robust programs such as BWA, Samtools, and BCFtools. Users can customize and direct their workflow with one straightforward configuration file. Iliad is compatible with Linux, MacOS, and Windows platforms and scalable from a local machine to a high-performance computing cluster. CONCLUSION: Iliad offers automated workflows with optimized time and resource management that are comparable to other workflows available but generates analysis-ready VCF files from the most common datatypes using a single command. The storage footprint challenge of genomic data is overcome by utilizing temporary intermediate files before the final VCF is generated. This file is ready for use in imputation, genome-wide association study (GWAS) pipelines, high-throughput population genetics studies, select gene candidate studies, and more. Iliad was developed to be portable, compatible, scalable, robust, and repeatable with a simplistic setup, so biologists that are less familiar with programming can manage their own big data with this open-source suite of workflows.


Assuntos
Estudo de Associação Genômica Ampla , Genômica , Fluxo de Trabalho , Biologia Computacional , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA