Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
1.
Am J Hum Genet ; 111(5): 863-876, 2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38565148

RESUMO

Copy number variants (CNVs) are significant contributors to the pathogenicity of rare genetic diseases and, with new innovative methods, can now reliably be identified from exome sequencing. Challenges still remain in accurate classification of CNV pathogenicity. CNV calling using GATK-gCNV was performed on exomes from a cohort of 6,633 families (15,759 individuals) with heterogeneous phenotypes and variable prior genetic testing collected at the Broad Institute Center for Mendelian Genomics of the Genomics Research to Elucidate the Genetics of Rare Diseases consortium and analyzed using the seqr platform. The addition of CNV detection to exome analysis identified causal CNVs for 171 families (2.6%). The estimated sizes of CNVs ranged from 293 bp to 80 Mb. The causal CNVs consisted of 140 deletions, 15 duplications, 3 suspected complex structural variants (SVs), 3 insertions, and 10 complex SVs, the latter two groups being identified by orthogonal confirmation methods. To classify CNV variant pathogenicity, we used the 2020 American College of Medical Genetics and Genomics/ClinGen CNV interpretation standards and developed additional criteria to evaluate allelic and functional data as well as variants on the X chromosome to further advance the framework. We interpreted 151 CNVs as likely pathogenic/pathogenic and 20 CNVs as high-interest variants of uncertain significance. Calling CNVs from existing exome data increases the diagnostic yield for individuals undiagnosed after standard testing approaches, providing a higher-resolution alternative to arrays at a fraction of the cost of genome sequencing. Our improvements to the classification approach advances the systematic framework to assess the pathogenicity of CNVs.


Assuntos
Variações do Número de Cópias de DNA , Sequenciamento do Exoma , Exoma , Doenças Raras , Humanos , Variações do Número de Cópias de DNA/genética , Doenças Raras/genética , Doenças Raras/diagnóstico , Exoma/genética , Masculino , Feminino , Estudos de Coortes , Testes Genéticos/métodos
2.
BMC Biol ; 22(1): 13, 2024 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-38273258

RESUMO

BACKGROUND: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.


Assuntos
Genoma de Planta , Polimorfismo de Nucleotídeo Único , Fluxo de Trabalho , Melhoramento Vegetal , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos
3.
J Clin Microbiol ; 62(1): e0116123, 2024 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-38112529

RESUMO

Candida parapsilosis is a common cause of non-albicans candidemia. It can be transmitted in healthcare settings resulting in serious healthcare-associated infections and can develop drug resistance to commonly used antifungal agents. Following a significant increase in the percentage of fluconazole (FLU)-nonsusceptible isolates from sterile site specimens of patients in two Ontario acute care hospital networks, we used whole genome sequence (WGS) analysis to retrospectively investigate the genetic relatedness of isolates and to assess potential in-hospital spread. Phylogenomic analysis was conducted on all 19 FLU-resistant and seven susceptible-dose dependent (SDD) isolates from the two hospital networks, as well as 13 FLU susceptible C. parapsilosis isolates from the same facilities and 20 isolates from patients not related to the investigation. Twenty-five of 26 FLU-nonsusceptible isolates (resistant or SDD) and two susceptible isolates from the two hospital networks formed a phylogenomic cluster that was highly similar genetically and distinct from other isolates. The results suggest the presence of a persistent strain of FLU-nonsusceptible C. parapsilosis causing infections over a 5.5-year period. Results from WGS were largely comparable to microsatellite typing. Twenty-seven of 28 cluster isolates had a K143R substitution in lanosterol 14-α-demethylase (ERG11) associated with azole resistance. As the first report of a healthcare-associated outbreak of FLU-nonsusceptible C. parapsilosis in Canada, this study underscores the importance of monitoring local antimicrobial resistance trends and demonstrates the value of WGS analysis to detect and characterize clusters and outbreaks. Timely access to genomic epidemiological information can inform targeted infection control measures.


Assuntos
Candida parapsilosis , Fluconazol , Humanos , Fluconazol/farmacologia , Estudos Retrospectivos , Testes de Sensibilidade Microbiana , Farmacorresistência Fúngica/genética , Antifúngicos/farmacologia , Antifúngicos/uso terapêutico , Genômica , Hospitais , Ontário
4.
Funct Integr Genomics ; 23(3): 227, 2023 Jul 08.
Artigo em Inglês | MEDLINE | ID: mdl-37422603

RESUMO

Citrus is a source of nutritional and medicinal advantages, cultivated worldwide with major groups of sweet oranges, mandarins, grapefruits, kumquats, lemons and limes. Pakistan produces all major citrus groups with mandarin (Citrus reticulata) being the prominent group that includes local commercial cultivars Feutral's Early, Dancy, Honey, and Kinnow. The present study designed to understand the genetic architecture of this unique variety of Citrus reticulata 'Kinnow.' The whole-genome resequencing and variant calling was performed to map the genomic variability that might be responsible for its particular characteristics like taste, seedlessness, juice content, thickness of peel, and shelf-life. A total of 139,436,350 raw sequence reads were generated with 20.9 Gb data in Fastq format having 98% effectiveness and 0.2% base call error rate. Overall, 3,503,033 SNPs, 176,949 MNPs, 323,287 INS, and 333,083 DEL were identified using the GATK4 variant calling pipeline against Citrus clementina. Furthermore, g:Profiler was applied for annotating the newly found variants, harbor genes/transcripts and their involved pathways. A total of 73,864 transcripts harbors 4,336,352 variants, most of the observed variants were predicted in non-coding regions and 1009 transcripts were found well annotated by different databases. Out of total aforementioned transcripts, 588 involved in biological processes, 234 in molecular functions and 167 transcripts in cellular components. In a nutshell, 18,153 high impact variants and 216 genic variants found in the current study, which may be used after its functional validation for marker-assisted breeding programs of "Kinnow" to propagate its valued traits for the improvement of contemporary citrus varieties in the region.


Assuntos
Citrus , Citrus/genética , Paquistão , Melhoramento Vegetal , Genoma de Planta , Análise de Sequência de DNA
5.
BMC Bioinformatics ; 22(1): 402, 2021 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-34388963

RESUMO

BACKGROUND: The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time. RESULTS: A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half. CONCLUSIONS: The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma , Humanos , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes , Fluxo de Trabalho
6.
BMC Bioinformatics ; 22(1): 488, 2021 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-34627144

RESUMO

BACKGROUND: Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its "Best Practices" bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. RESULTS: We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK "Best Practices" are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. CONCLUSIONS: We conclude that applying the GATK "Best Practices" pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.


Assuntos
Genoma , Polimorfismo de Nucleotídeo Único , Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação INDEL
7.
BMC Genomics ; 22(1): 62, 2021 Jan 19.
Artigo em Inglês | MEDLINE | ID: mdl-33468057

RESUMO

BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.


Assuntos
Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala , Estudos Transversais , Genômica , Humanos , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes
8.
BMC Genomics ; 21(Suppl 10): 683, 2020 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-33208101

RESUMO

BACKGROUND: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. IMPLEMENTATION: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. RESULTS: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. AVAILABILITY: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM .


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Genômica , Sequenciamento Completo do Genoma , Fluxo de Trabalho
9.
Proc Natl Acad Sci U S A ; 114(10): E1923-E1932, 2017 03 07.
Artigo em Inglês | MEDLINE | ID: mdl-28223510

RESUMO

The analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here, we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal in size, and run in a highly computationally efficient way, with the single goal of enabling whole-genome sequencing at scale. In addition to improved computational efficiency, we implement a statistical framework that allows for a base by base error model, allowing this package to perform as well or better than the widely used Genome Analysis Toolkit (GATK) in all key measures of performance on human whole-genome sequences.


Assuntos
Biologia Computacional/métodos , Genoma Humano/genética , Software , Sequenciamento Completo do Genoma/métodos , Algoritmos , Bases de Dados Genéticas , Humanos , Polimorfismo de Nucleotídeo Único/genética
10.
Genomics ; 111(4): 808-818, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-29857119

RESUMO

The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed "consensus calling," to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.


Assuntos
Doença de Alzheimer/genética , Estudo de Associação Genômica Ampla/normas , Técnicas de Genotipagem/normas , Controle de Qualidade , Sequenciamento Completo do Genoma/normas , Algoritmos , Feminino , Estudo de Associação Genômica Ampla/métodos , Genótipo , Técnicas de Genotipagem/métodos , Humanos , Masculino , Polimorfismo Genético , Sequenciamento Completo do Genoma/métodos
11.
BMC Bioinformatics ; 20(1): 557, 2019 Nov 08.
Artigo em Inglês | MEDLINE | ID: mdl-31703611

RESUMO

BACKGROUND: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. RESULTS: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. CONCLUSIONS: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be ∼34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.


Assuntos
Genômica/métodos , Software , Algoritmos , Cromossomos Humanos/genética , Genoma Humano , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos
12.
BMC Genomics ; 20(1): 272, 2019 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-30952207

RESUMO

BACKGROUND: The interferon-induced transmembrane (IFITM) protein family comprises a class of restriction factors widely characterised in humans for their potent antiviral activity. Their biological activity is well documented in several animal species, but their genetic variation and biological mechanism is less well understood, particularly in avian species. RESULTS: Here we report the complete sequence of the domestic chicken Gallus gallus IFITM locus from a wide variety of chicken breeds to examine the detailed pattern of genetic variation of the locus on chromosome 5, including the flanking genes ATHL1 and B4GALNT4. We have generated chIFITM sequences from commercial breeds (supermarket-derived chicken breasts), indigenous chickens from Nigeria (Nsukka) and Ethiopia, European breeds and inbred chicken lines from the Pirbright Institute, totalling of 206 chickens. Through mapping of genetic variants to the latest chIFITM consensus sequence our data reveal that the chIFITM locus does not show structural variation in the locus across the populations analysed, despite spanning diverse breeds from different geographic locations. However, single nucleotide variants (SNVs) in functionally important regions of the proteins within certain groups of chickens were detected, in particular the European breeds and indigenous birds from Ethiopia and Nigeria. In addition, we also found that two out of four SNVs located in the chIFITM1 (Ser36 and Arg77) and chIFITM3 (Val103) proteins were simultaneously under positive selection. CONCLUSIONS: Together these data suggest that IFITM genetic variation may contribute to the capacities of different chicken populations to resist virus infection.


Assuntos
Antígenos de Diferenciação/genética , Evolução Molecular , Loci Gênicos , Marcadores Genéticos , Polimorfismo de Nucleotídeo Único , Seleção Genética , Sequência de Aminoácidos , Animais , Galinhas , Mapeamento Cromossômico , Variações do Número de Cópias de DNA , Genoma , Análise de Sequência de DNA , Homologia de Sequência
13.
BMC Genomics ; 20(Suppl 2): 184, 2019 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-30967111

RESUMO

BACKGROUND: Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semi-global pairwise sequence alignment with traceback has so far been difficult to accelerate effectively on GPUs. RESULTS: We first analyze the characteristics of the semi-global alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intra-task parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPU-based implementation. This helps efficiently retrieve the backtracking matrix to obtain the optimal alignment in the second stage. CONCLUSIONS: Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pair-HMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC implementation, and 1.34x faster than the GATK HC implementation with the integrated GPU-based pair-HMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications.


Assuntos
Algoritmos , Gráficos por Computador , Variação Genética , Genoma Humano , Haplótipos , Alinhamento de Sequência/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software
14.
BMC Genomics ; 20(1): 160, 2019 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-30813897

RESUMO

BACKGROUND: Single nucleotide polymorphisms (SNP) have been applied as important molecular markers in genetics and breeding studies. The rapid advance of next generation sequencing (NGS) provides a high-throughput means of SNP discovery. However, SNP development is limited by the availability of reliable SNP discovery methods. Especially, the optimum assembler and SNP caller for accurate SNP prediction from next generation sequencing data are not known. RESULTS: Herein we performed SNP prediction based on RNA-seq data of peach and mandarin peel tissue under a comprehensive comparison of two paired-end read lengths (125 bp and 150 bp), five assemblers (Trinity, IDBA, oases, SOAPdenovo, Trans-abyss) and two SNP callers (GATK and GBS). The predicted SNPs were compared with the authentic SNPs identified via PCR amplification followed by gene cloning and sequencing procedures. A total of 40 and 240 authentic SNPs were presented in five anthocyanin biosynthesis related genes in peach and in nine carotenogenic genes in mandarin. Putative SNPs predicted from the same RNA-seq data with different strategies led to quite divergent results. The rate of false positive SNPs was significantly lower when the paired-end read length was 150 bp compared with 125 bp. Trinity was superior to the other four assemblers and GATK was substantially superior to GBS due to a low rate of missing authentic SNPs. The combination of assembler Trinity, SNP caller GATK, and the paired-end read length 150 bp had the best performance in SNP discovery with 100% accuracy both in peach and in mandarin cases. This strategy was applied to the characterization of SNPs in peach and mandarin transcriptomes. CONCLUSIONS: Through comparison of authentic SNPs obtained by PCR cloning strategy and putative SNPs predicted from different combinations of five assemblers, two SNP callers, and two paired-end read lengths, we provided a reliable and efficient strategy, Trinity-GATK with 150 bp paired-end read length, for SNP discovery from RNA-seq data. This strategy discovered SNP at 100% accuracy in peach and mandarin cases and might be applicable to a wide range of plants and other organisms.


Assuntos
Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de RNA/métodos , Citrus/genética , Anotação de Sequência Molecular , Prunus persica/genética
15.
BMC Genomics ; 20(1): 453, 2019 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-31159724

RESUMO

BACKGROUND: Recent advances in genomics have greatly increased research opportunities for non-model species. For wildlife, a growing availability of reference genomes means that population genetics is no longer restricted to a small set of anonymous loci. When used in conjunction with a reference genome, reduced-representation sequencing (RRS) provides a cost-effective method for obtaining reliable diversity information for population genetics. Many software tools have been developed to process RRS data, though few studies of non-model species incorporate genome alignment in calling loci. A commonly-used RRS analysis pipeline, Stacks, has this capacity and so it is timely to compare its utility with existing software originally designed for alignment and analysis of whole genome sequencing data. Here we examine population genetic inferences from two species for which reference-aligned reduced-representation data have been collected. Our two study species are a threatened Australian marsupial (Tasmanian devil Sarcophilus harrisii; declining population) and an Arctic-circle migrant bird (pink-footed goose Anser brachyrhynchus; expanding population). Analyses of these data are compared using Stacks versus two widely-used genomics packages, SAMtools and GATK. We also introduce a custom R script to improve the reliability of single nucleotide polymorphism (SNP) calls in all pipelines and conduct population genetic inferences for non-model species with reference genomes. RESULTS: Although we identified orders of magnitude fewer SNPs in our devil dataset than for goose, we found remarkable symmetry between the two species in our assessment of software performance. For both datasets, all three methods were able to delineate population structure, even with varying numbers of loci. For both species, population structure inferences were influenced by the percent of missing data. CONCLUSIONS: For studies of non-model species with a reference genome, we recommend combining Stacks output with further filtering (as included in our R pipeline) for population genetic studies, paying particular attention to potential impact of missing data thresholds. We recognise SAMtools as a viable alternative for researchers more familiar with this software. We caution against the use of GATK in studies with limited computational resources or time.


Assuntos
Gansos/genética , Genoma , Marsupiais/genética , Metagenômica/métodos , Metagenômica/normas , Polimorfismo de Nucleotídeo Único , Animais , Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala , Padrões de Referência , Software
16.
Arerugi ; 72(9): 1110-1112, 2023.
Artigo em Japonês | MEDLINE | ID: mdl-37967956
17.
BMC Genomics ; 18(1): 458, 2017 06 12.
Artigo em Inglês | MEDLINE | ID: mdl-28606096

RESUMO

BACKGROUND: Cancer research to date has largely focused on somatically acquired genetic aberrations. In contrast, the degree to which germline, or inherited, variation contributes to tumorigenesis remains unclear, possibly due to a lack of accessible germline variant data. Here we called germline variants on 9618 cases from The Cancer Genome Atlas (TCGA) database representing 31 cancer types. RESULTS: We identified batch effects affecting loss of function (LOF) variant calls that can be traced back to differences in the way the sequence data were generated both within and across cancer types. Overall, LOF indel calls were more sensitive to technical artifacts than LOF Single Nucleotide Variant (SNV) calls. In particular, whole genome amplification of DNA prior to sequencing led to an artificially increased burden of LOF indel calls, which confounded association analyses relating germline variants to tumor type despite stringent indel filtering strategies. The samples affected by these technical artifacts include all acute myeloid leukemia and practically all ovarian cancer samples. CONCLUSIONS: We demonstrate how technical artifacts induced by whole genome amplification of DNA can lead to false positive germline-tumor type associations and suggest TCGA whole genome amplified samples be used with caution. This study draws attention to the need to be sensitive to problems associated with a lack of uniformity in data generation in TCGA data.


Assuntos
Artefatos , Bases de Dados Genéticas , Genômica , Mutação em Linhagem Germinativa , Neoplasias/genética , Genoma Humano/genética , Humanos , Mutação com Perda de Função
18.
Genome ; 60(9): 743-755, 2017 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-28355490

RESUMO

The emergence of next generation sequencing has increased by several orders of magnitude the amount of data available for phylogenetics. Reduced representation approaches, such as restriction-sited associated DNA sequencing (RADseq), have proven useful for phylogenetic studies of non-model species at a wide range of phylogenetic depths. However, analysis of these datasets is not uniform and we know little about the potential benefits and drawbacks of de novo assembly versus assembly by mapping to a reference genome. Using RADseq data for 83 oak samples representing 16 taxa, we identified variants via three pipelines: mapping sequence reads to a recently published draft genome of Quercus lobata, and de novo assembly under two sets of locus filters. For each pipeline, we inferred the maximum likelihood phylogeny. All pipelines produced similar trees, with minor shifts in relationships within well-supported clades, despite the fact that they yielded different numbers of loci (68 000 - 111 000 loci) and different degrees of overlap with the reference genome. We conclude that both the reference-aligned and de novo assembly pipelines yield reliable results, and that advantages and disadvantages of these approaches pertain mainly to downstream uses of RADseq data, not to phylogenetic inference per se.


Assuntos
Quercus/genética , California , DNA de Plantas , Variação Genética , Filogenia , Quercus/classificação , Análise de Sequência de DNA
19.
J Allergy Clin Immunol ; 132(3): 656-664.e17, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23830146

RESUMO

BACKGROUND: Combined immunodeficiency with multiple intestinal atresias (CID-MIA) is a rare hereditary disease characterized by intestinal obstructions and profound immune defects. OBJECTIVE: We sought to determine the underlying genetic causes of CID-MIA by analyzing the exomic sequences of 5 patients and their healthy direct relatives from 5 unrelated families. METHODS: We performed whole-exome sequencing on 5 patients with CID-MIA and 10 healthy direct family members belonging to 5 unrelated families with CID-MIA. We also performed targeted Sanger sequencing for the candidate gene tetratricopeptide repeat domain 7A (TTC7A) on 3 additional patients with CID-MIA. RESULTS: Through analysis and comparison of the exomic sequence of the subjects from these 5 families, we identified biallelic damaging mutations in the TTC7A gene, for a total of 7 distinct mutations. Targeted TTC7A gene sequencing in 3 additional unrelated patients with CID-MIA revealed biallelic deleterious mutations in 2 of them, as well as an aberrant splice product in the third patient. Staining of normal thymus showed that the TTC7A protein is expressed in thymic epithelial cells, as well as in thymocytes. Moreover, severe lymphoid depletion was observed in the thymus and peripheral lymphoid tissues from 2 patients with CID-MIA. CONCLUSIONS: We identified deleterious mutations of the TTC7A gene in 8 unrelated patients with CID-MIA and demonstrated that the TTC7A protein is expressed in the thymus. Our results strongly suggest that TTC7A gene defects cause CID-MIA.


Assuntos
Síndromes de Imunodeficiência/genética , Atresia Intestinal/genética , Intestinos/anormalidades , Proteínas/genética , Animais , Pré-Escolar , Exoma/genética , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Camundongos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos , RNA Mensageiro/metabolismo , Timo/metabolismo , Análise Serial de Tecidos
20.
Asian Pac J Cancer Prev ; 24(6): 2129-2134, 2023 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-37378944

RESUMO

BACKGROUND: The use of high-throughput genotyping techniques has enabled us to identify the rare germline genetic variants with different pathogenicity and penetrance, and understand their role in cancer predisposition. We report here a familial cancer case, a study from Western Indian. METHODS: NGS-WES was carried out in a lung cancer patient who has a family history of multiple cancers across generations, including tongue, lung, brain, cervical, urothelial, and esophageal cancer. The results were validated by data mining from available data bases. I-TASSER, RasMol and PyMol were used for protein structure modelling. RESULTS: The sequencing by NGS-WES revealed PPM1D c.1654C>T (p.Arg552Ter) mutation in hotspot region exon 6 leading to sudden protein truncation and loss of the C-terminal, due to the substitution of C>T. This mutation was classified as a variant of uncertain significance (VUS), due to limited data on lung cancer, The three unaffected siblings of proband did not show any pathogenic variants and comparative analysis of the four siblings indicate 9 shared genetic variants, classified as benign as per ClinVar. CONCLUSION: PPM1D constitutional genetic alterations are rare and uncommon in different ethnic populations. This gene encodes a phosphatase playing role in regulating the P53 tumor suppressor pathway and DNA damage response. Genetic alterations in the PPM1D gene maybe linked to history of gliomas, breast cancer, and ovarian cancer onset in the proband's family.
.


Assuntos
Neoplasias da Mama , Neoplasias Pulmonares , Neoplasias Ovarianas , Feminino , Humanos , Neoplasias da Mama/genética , Éxons , Predisposição Genética para Doença , Mutação em Linhagem Germinativa/genética , Neoplasias Pulmonares/genética , Mutação , Neoplasias Ovarianas/genética , Proteína Fosfatase 2C/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA