Búsqueda | Portal de Búsqueda de la BVS

1.

Exome copy number variant detection, analysis, and classification in a large cohort of families with undiagnosed rare genetic disease.

Lemire, Gabrielle; Sanchis-Juan, Alba; Russell, Kathryn; Baxter, Samantha; Chao, Katherine R; Singer-Berk, Moriel; Groopman, Emily; Wong, Isaac; England, Eleina; Goodrich, Julia; Pais, Lynn; Austin-Tse, Christina; DiTroia, Stephanie; O'Heir, Emily; Ganesh, Vijay S; Wojcik, Monica H; Evangelista, Emily; Snow, Hana; Osei-Owusu, Ikeoluwa; Fu, Jack; Singh, Mugdha; Mostovoy, Yulia; Huang, Steve; Garimella, Kiran; Kirkham, Samantha L; Neil, Jennifer E; Shao, Diane D; Walsh, Christopher A; Argilli, Emanuela; Le, Carolyn; Sherr, Elliott H; Gleeson, Joseph G; Shril, Shirlee; Schneider, Ronen; Hildebrandt, Friedhelm; Sankaran, Vijay G; Madden, Jill A; Genetti, Casie A; Beggs, Alan H; Agrawal, Pankaj B; Bujakowska, Kinga M; Place, Emily; Pierce, Eric A; Donkervoort, Sandra; Bönnemann, Carsten G; Gallacher, Lyndon; Stark, Zornitza; Tan, Tiong Yang; White, Susan M; Töpf, Ana.

Am J Hum Genet ; 111(5): 863-876, 2024 05 02.

Artículo en Inglés | MEDLINE | ID: mdl-38565148

RESUMEN

Copy number variants (CNVs) are significant contributors to the pathogenicity of rare genetic diseases and, with new innovative methods, can now reliably be identified from exome sequencing. Challenges still remain in accurate classification of CNV pathogenicity. CNV calling using GATK-gCNV was performed on exomes from a cohort of 6,633 families (15,759 individuals) with heterogeneous phenotypes and variable prior genetic testing collected at the Broad Institute Center for Mendelian Genomics of the Genomics Research to Elucidate the Genetics of Rare Diseases consortium and analyzed using the seqr platform. The addition of CNV detection to exome analysis identified causal CNVs for 171 families (2.6%). The estimated sizes of CNVs ranged from 293 bp to 80 Mb. The causal CNVs consisted of 140 deletions, 15 duplications, 3 suspected complex structural variants (SVs), 3 insertions, and 10 complex SVs, the latter two groups being identified by orthogonal confirmation methods. To classify CNV variant pathogenicity, we used the 2020 American College of Medical Genetics and Genomics/ClinGen CNV interpretation standards and developed additional criteria to evaluate allelic and functional data as well as variants on the X chromosome to further advance the framework. We interpreted 151 CNVs as likely pathogenic/pathogenic and 20 CNVs as high-interest variants of uncertain significance. Calling CNVs from existing exome data increases the diagnostic yield for individuals undiagnosed after standard testing approaches, providing a higher-resolution alternative to arrays at a fraction of the cost of genome sequencing. Our improvements to the classification approach advances the systematic framework to assess the pathogenicity of CNVs.

Asunto(s)

Variaciones en el Número de Copia de ADN , Secuenciación del Exoma , Exoma , Enfermedades Raras , Humanos , Variaciones en el Número de Copia de ADN/genética , Enfermedades Raras/genética , Enfermedades Raras/diagnóstico , Exoma/genética , Masculino , Femenino , Estudios de Cohortes , Pruebas Genéticas/métodos

2.

A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.

Zhou, Yong; Kathiresan, Nagarajan; Yu, Zhichao; Rivera, Luis F; Yang, Yujian; Thimma, Manjula; Manickam, Keerthana; Chebotarov, Dmytro; Mauleon, Ramil; Chougule, Kapeel; Wei, Sharon; Gao, Tingting; Green, Carl D; Zuccolo, Andrea; Xie, Weibo; Ware, Doreen; Zhang, Jianwei; McNally, Kenneth L; Wing, Rod A.

BMC Biol ; 22(1): 13, 2024 Jan 25.

Artículo en Inglés | MEDLINE | ID: mdl-38273258

RESUMEN

BACKGROUND: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.

Asunto(s)

Genoma de Planta , Polimorfismo de Nucleótido Simple , Flujo de Trabajo , Fitomejoramiento , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos

3.

First Canadian report of transmission of fluconazole-resistant Candida parapsilosis within two hospital networks confirmed by genomic analysis.

McTaggart, Lisa R; Eshaghi, AliReza; Hota, Susy; Poutanen, Susan M; Johnstone, Jennie; De Luca, Domenica G; Bharat, Amrita; Patel, Samir N; Kus, Julianne V.

J Clin Microbiol ; 62(1): e0116123, 2024 01 17.

Artículo en Inglés | MEDLINE | ID: mdl-38112529

RESUMEN

Candida parapsilosis is a common cause of non-albicans candidemia. It can be transmitted in healthcare settings resulting in serious healthcare-associated infections and can develop drug resistance to commonly used antifungal agents. Following a significant increase in the percentage of fluconazole (FLU)-nonsusceptible isolates from sterile site specimens of patients in two Ontario acute care hospital networks, we used whole genome sequence (WGS) analysis to retrospectively investigate the genetic relatedness of isolates and to assess potential in-hospital spread. Phylogenomic analysis was conducted on all 19 FLU-resistant and seven susceptible-dose dependent (SDD) isolates from the two hospital networks, as well as 13 FLU susceptible C. parapsilosis isolates from the same facilities and 20 isolates from patients not related to the investigation. Twenty-five of 26 FLU-nonsusceptible isolates (resistant or SDD) and two susceptible isolates from the two hospital networks formed a phylogenomic cluster that was highly similar genetically and distinct from other isolates. The results suggest the presence of a persistent strain of FLU-nonsusceptible C. parapsilosis causing infections over a 5.5-year period. Results from WGS were largely comparable to microsatellite typing. Twenty-seven of 28 cluster isolates had a K143R substitution in lanosterol 14-α-demethylase (ERG11) associated with azole resistance. As the first report of a healthcare-associated outbreak of FLU-nonsusceptible C. parapsilosis in Canada, this study underscores the importance of monitoring local antimicrobial resistance trends and demonstrates the value of WGS analysis to detect and characterize clusters and outbreaks. Timely access to genomic epidemiological information can inform targeted infection control measures.

Asunto(s)

Candida parapsilosis , Fluconazol , Humanos , Fluconazol/farmacología , Estudios Retrospectivos , Pruebas de Sensibilidad Microbiana , Farmacorresistencia Fúngica/genética , Antifúngicos/farmacología , Antifúngicos/uso terapéutico , Genómica , Hospitales , Ontario

4.

Whole-genome sequencing and variant discovery of Citrus reticulata "Kinnow" from Pakistan.

Jabeen, Sadia; Saif, Rashid; Haq, Rukhama; Hayat, Akbar; Naz, Shagufta.

Funct Integr Genomics ; 23(3): 227, 2023 Jul 08.

Artículo en Inglés | MEDLINE | ID: mdl-37422603

RESUMEN

Citrus is a source of nutritional and medicinal advantages, cultivated worldwide with major groups of sweet oranges, mandarins, grapefruits, kumquats, lemons and limes. Pakistan produces all major citrus groups with mandarin (Citrus reticulata) being the prominent group that includes local commercial cultivars Feutral's Early, Dancy, Honey, and Kinnow. The present study designed to understand the genetic architecture of this unique variety of Citrus reticulata 'Kinnow.' The whole-genome resequencing and variant calling was performed to map the genomic variability that might be responsible for its particular characteristics like taste, seedlessness, juice content, thickness of peel, and shelf-life. A total of 139,436,350 raw sequence reads were generated with 20.9 Gb data in Fastq format having 98% effectiveness and 0.2% base call error rate. Overall, 3,503,033 SNPs, 176,949 MNPs, 323,287 INS, and 333,083 DEL were identified using the GATK4 variant calling pipeline against Citrus clementina. Furthermore, g:Profiler was applied for annotating the newly found variants, harbor genes/transcripts and their involved pathways. A total of 73,864 transcripts harbors 4,336,352 variants, most of the observed variants were predicted in non-coding regions and 1009 transcripts were found well annotated by different databases. Out of total aforementioned transcripts, 588 involved in biological processes, 234 in molecular functions and 167 transcripts in cellular components. In a nutshell, 18,153 high impact variants and 216 genic variants found in the current study, which may be used after its functional validation for marker-assisted breeding programs of "Kinnow" to propagate its valued traits for the improvement of contemporary citrus varieties in the region.

Asunto(s)

Citrus , Citrus/genética , Pakistán , Fitomejoramiento , Genoma de Planta , Análisis de Secuencia de ADN

5.

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.

Bathke, Jochen; Lühken, Gesine.

BMC Bioinformatics ; 22(1): 402, 2021 Aug 13.

Artículo en Inglés | MEDLINE | ID: mdl-34388963

RESUMEN

BACKGROUND: The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time. RESULTS: A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half. CONCLUSIONS: The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Genoma , Humanos , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Flujo de Trabajo

6.

Comparison of sequencing data processing pipelines and application to underrepresented African human populations.

Breton, Gwenna; Johansson, Anna C V; Sjödin, Per; Schlebusch, Carina M; Jakobsson, Mattias.

BMC Bioinformatics ; 22(1): 488, 2021 Oct 09.

Artículo en Inglés | MEDLINE | ID: mdl-34627144

RESUMEN

BACKGROUND: Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its "Best Practices" bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. RESULTS: We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK "Best Practices" are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. CONCLUSIONS: We conclude that applying the GATK "Best Practices" pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Asunto(s)

Genoma , Polimorfismo de Nucleótido Simple , Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación INDEL

7.

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines.

Weißbach, Stephan; Sys, Stanislav; Hewel, Charlotte; Todorov, Hristo; Schweiger, Susann; Winter, Jennifer; Pfenninger, Markus; Torkamani, Ali; Evans, Doug; Burger, Joachim; Everschor-Sitte, Karin; May-Simera, Helen Louise; Gerber, Susanne.

BMC Genomics ; 22(1): 62, 2021 Jan 19.

Artículo en Inglés | MEDLINE | ID: mdl-33468057

RESUMEN

BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Asunto(s)

Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento , Estudios Transversales , Genómica , Humanos , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados

8.

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework.

Ahmad, Tanveer; Ahmed, Nauman; Al-Ars, Zaid; Hofstee, H Peter.

BMC Genomics ; 21(Suppl 10): 683, 2020 Nov 18.

Artículo en Inglés | MEDLINE | ID: mdl-33208101

RESUMEN

BACKGROUND: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. IMPLEMENTATION: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. RESULTS: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. AVAILABILITY: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM .

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Genómica , Secuenciación Completa del Genoma , Flujo de Trabajo

9.

PEMapper and PECaller provide a simplified approach to whole-genome sequencing.

Johnston, H Richard; Chopra, Pankaj; Wingo, Thomas S; Patel, Viren; Epstein, Michael P; Mulle, Jennifer G; Warren, Stephen T; Zwick, Michael E; Cutler, David J.

Proc Natl Acad Sci U S A ; 114(10): E1923-E1932, 2017 03 07.

Artículo en Inglés | MEDLINE | ID: mdl-28223510

RESUMEN

The analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here, we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal in size, and run in a highly computationally efficient way, with the single goal of enabling whole-genome sequencing at scale. In addition to improved computational efficiency, we implement a statistical framework that allows for a base by base error model, allowing this package to perform as well or better than the widely used Genome Analysis Toolkit (GATK) in all key measures of performance on human whole-genome sequences.

Asunto(s)

Biología Computacional/métodos , Genoma Humano/genética , Programas Informáticos , Secuenciación Completa del Genoma/métodos , Algoritmos , Bases de Datos Genéticas , Humanos , Polimorfismo de Nucleótido Simple/genética

10.

Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project.

Naj, Adam C; Lin, Honghuang; Vardarajan, Badri N; White, Simon; Lancour, Daniel; Ma, Yiyi; Schmidt, Michael; Sun, Fangui; Butkiewicz, Mariusz; Bush, William S; Kunkle, Brian W; Malamon, John; Amin, Najaf; Choi, Seung Hoan; Hamilton-Nelson, Kara L; van der Lee, Sven J; Gupta, Namrata; Koboldt, Daniel C; Saad, Mohamad; Wang, Bowen; Nato, Alejandro Q; Sohi, Harkirat K; Kuzma, Amanda; Wang, Li-San; Cupples, L Adrienne; van Duijn, Cornelia; Seshadri, Sudha; Schellenberg, Gerard D; Boerwinkle, Eric; Bis, Joshua C; Dupuis, Josée; Salerno, William J; Wijsman, Ellen M; Martin, Eden R; DeStefano, Anita L.

Genomics ; 111(4): 808-818, 2019 07.

Artículo en Inglés | MEDLINE | ID: mdl-29857119

RESUMEN

The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed "consensus calling," to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.

Asunto(s)

Enfermedad de Alzheimer/genética , Estudio de Asociación del Genoma Completo/normas , Técnicas de Genotipaje/normas , Control de Calidad , Secuenciación Completa del Genoma/normas , Algoritmos , Femenino , Estudio de Asociación del Genoma Completo/métodos , Genotipo , Técnicas de Genotipaje/métodos , Humanos , Masculino , Polimorfismo Genético , Secuenciación Completa del Genoma/métodos

11.

Recommendations for performance optimizations when using GATK3.8 and GATK4.

Heldenbrand, Jacob R; Baheti, Saurabh; Bockol, Matthew A; Drucker, Travis M; Hart, Steven N; Hudson, Matthew E; Iyer, Ravishankar K; Kalmbach, Michael T; Kendig, Katherine I; Klee, Eric W; Mattson, Nathan R; Wieben, Eric D; Wiepert, Mathieu; Wildman, Derek E; Mainzer, Liudmila S.

BMC Bioinformatics ; 20(1): 557, 2019 Nov 08.

Artículo en Inglés | MEDLINE | ID: mdl-31703611

RESUMEN

BACKGROUND: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. RESULTS: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. CONCLUSIONS: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be â¼34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.

Asunto(s)

Genómica/métodos , Programas Informáticos , Algoritmos , Cromosomas Humanos/genética , Genoma Humano , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos

12.

Comparative analysis of the chicken IFITM locus by targeted genome sequencing reveals evolution of the locus and positive selection in IFITM1 and IFITM3.

Bassano, Irene; Ong, Swee Hoe; Sanz-Hernandez, Maximo; Vinkler, Michal; Kebede, Adebabay; Hanotte, Olivier; Onuigbo, Ebele; Fife, Mark; Kellam, Paul.

BMC Genomics ; 20(1): 272, 2019 Apr 05.

Artículo en Inglés | MEDLINE | ID: mdl-30952207

RESUMEN

BACKGROUND: The interferon-induced transmembrane (IFITM) protein family comprises a class of restriction factors widely characterised in humans for their potent antiviral activity. Their biological activity is well documented in several animal species, but their genetic variation and biological mechanism is less well understood, particularly in avian species. RESULTS: Here we report the complete sequence of the domestic chicken Gallus gallus IFITM locus from a wide variety of chicken breeds to examine the detailed pattern of genetic variation of the locus on chromosome 5, including the flanking genes ATHL1 and B4GALNT4. We have generated chIFITM sequences from commercial breeds (supermarket-derived chicken breasts), indigenous chickens from Nigeria (Nsukka) and Ethiopia, European breeds and inbred chicken lines from the Pirbright Institute, totalling of 206 chickens. Through mapping of genetic variants to the latest chIFITM consensus sequence our data reveal that the chIFITM locus does not show structural variation in the locus across the populations analysed, despite spanning diverse breeds from different geographic locations. However, single nucleotide variants (SNVs) in functionally important regions of the proteins within certain groups of chickens were detected, in particular the European breeds and indigenous birds from Ethiopia and Nigeria. In addition, we also found that two out of four SNVs located in the chIFITM1 (Ser36 and Arg77) and chIFITM3 (Val103) proteins were simultaneously under positive selection. CONCLUSIONS: Together these data suggest that IFITM genetic variation may contribute to the capacities of different chicken populations to resist virus infection.

Asunto(s)

Antígenos de Diferenciación/genética , Evolución Molecular , Sitios Genéticos , Marcadores Genéticos , Polimorfismo de Nucleótido Simple , Selección Genética , Secuencia de Aminoácidos , Animales , Pollos , Mapeo Cromosómico , Variaciones en el Número de Copia de ADN , Genoma , Análisis de Secuencia de ADN , Homología de Secuencia

13.

GPU accelerated sequence alignment with traceback for GATK HaplotypeCaller.

Ren, Shanshan; Ahmed, Nauman; Bertels, Koen; Al-Ars, Zaid.

BMC Genomics ; 20(Suppl 2): 184, 2019 Apr 04.

Artículo en Inglés | MEDLINE | ID: mdl-30967111

RESUMEN

BACKGROUND: Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semi-global pairwise sequence alignment with traceback has so far been difficult to accelerate effectively on GPUs. RESULTS: We first analyze the characteristics of the semi-global alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intra-task parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPU-based implementation. This helps efficiently retrieve the backtracking matrix to obtain the optimal alignment in the second stage. CONCLUSIONS: Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pair-HMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC implementation, and 1.34x faster than the GATK HC implementation with the integrated GPU-based pair-HMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications.

Asunto(s)

Algoritmos , Gráficos por Computador , Variación Genética , Genoma Humano , Haplotipos , Alineación de Secuencia/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Programas Informáticos

14.

A high-throughput SNP discovery strategy for RNA-seq data.

Zhao, Yun; Wang, Ke; Wang, Wen-Li; Yin, Ting-Ting; Dong, Wei-Qi; Xu, Chang-Jie.

BMC Genomics ; 20(1): 160, 2019 Feb 27.

Artículo en Inglés | MEDLINE | ID: mdl-30813897

RESUMEN

BACKGROUND: Single nucleotide polymorphisms (SNP) have been applied as important molecular markers in genetics and breeding studies. The rapid advance of next generation sequencing (NGS) provides a high-throughput means of SNP discovery. However, SNP development is limited by the availability of reliable SNP discovery methods. Especially, the optimum assembler and SNP caller for accurate SNP prediction from next generation sequencing data are not known. RESULTS: Herein we performed SNP prediction based on RNA-seq data of peach and mandarin peel tissue under a comprehensive comparison of two paired-end read lengths (125 bp and 150 bp), five assemblers (Trinity, IDBA, oases, SOAPdenovo, Trans-abyss) and two SNP callers (GATK and GBS). The predicted SNPs were compared with the authentic SNPs identified via PCR amplification followed by gene cloning and sequencing procedures. A total of 40 and 240 authentic SNPs were presented in five anthocyanin biosynthesis related genes in peach and in nine carotenogenic genes in mandarin. Putative SNPs predicted from the same RNA-seq data with different strategies led to quite divergent results. The rate of false positive SNPs was significantly lower when the paired-end read length was 150 bp compared with 125 bp. Trinity was superior to the other four assemblers and GATK was substantially superior to GBS due to a low rate of missing authentic SNPs. The combination of assembler Trinity, SNP caller GATK, and the paired-end read length 150 bp had the best performance in SNP discovery with 100% accuracy both in peach and in mandarin cases. This strategy was applied to the characterization of SNPs in peach and mandarin transcriptomes. CONCLUSIONS: Through comparison of authentic SNPs obtained by PCR cloning strategy and putative SNPs predicted from different combinations of five assemblers, two SNP callers, and two paired-end read lengths, we provided a reliable and efficient strategy, Trinity-GATK with 150 bp paired-end read length, for SNP discovery from RNA-seq data. This strategy discovered SNP at 100% accuracy in peach and mandarin cases and might be applicable to a wide range of plants and other organisms.

Asunto(s)

Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ARN/métodos , Citrus/genética , Anotación de Secuencia Molecular , Prunus persica/genética

15.

From reference genomes to population genomics: comparing three reference-aligned reduced-representation sequencing pipelines in two wildlife species.

Wright, Belinda; Farquharson, Katherine A; McLennan, Elspeth A; Belov, Katherine; Hogg, Carolyn J; Grueber, Catherine E.

BMC Genomics ; 20(1): 453, 2019 Jun 03.

Artículo en Inglés | MEDLINE | ID: mdl-31159724

RESUMEN

BACKGROUND: Recent advances in genomics have greatly increased research opportunities for non-model species. For wildlife, a growing availability of reference genomes means that population genetics is no longer restricted to a small set of anonymous loci. When used in conjunction with a reference genome, reduced-representation sequencing (RRS) provides a cost-effective method for obtaining reliable diversity information for population genetics. Many software tools have been developed to process RRS data, though few studies of non-model species incorporate genome alignment in calling loci. A commonly-used RRS analysis pipeline, Stacks, has this capacity and so it is timely to compare its utility with existing software originally designed for alignment and analysis of whole genome sequencing data. Here we examine population genetic inferences from two species for which reference-aligned reduced-representation data have been collected. Our two study species are a threatened Australian marsupial (Tasmanian devil Sarcophilus harrisii; declining population) and an Arctic-circle migrant bird (pink-footed goose Anser brachyrhynchus; expanding population). Analyses of these data are compared using Stacks versus two widely-used genomics packages, SAMtools and GATK. We also introduce a custom R script to improve the reliability of single nucleotide polymorphism (SNP) calls in all pipelines and conduct population genetic inferences for non-model species with reference genomes. RESULTS: Although we identified orders of magnitude fewer SNPs in our devil dataset than for goose, we found remarkable symmetry between the two species in our assessment of software performance. For both datasets, all three methods were able to delineate population structure, even with varying numbers of loci. For both species, population structure inferences were influenced by the percent of missing data. CONCLUSIONS: For studies of non-model species with a reference genome, we recommend combining Stacks output with further filtering (as included in our R pipeline) for population genetic studies, paying particular attention to potential impact of missing data thresholds. We recognise SAMtools as a viable alternative for researchers more familiar with this software. We caution against the use of GATK in studies with limited computational resources or time.

Asunto(s)

Gansos/genética , Genoma , Marsupiales/genética , Metagenómica/métodos , Metagenómica/normas , Polimorfismo de Nucleótido Simple , Animales , Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento , Estándares de Referencia , Programas Informáticos

16.

[CURRENT PIPELINES FOR WHOLE-GENOME SEQUENCING ANALYSES].

Namba, Shinichi; Okada, Yukinori.

Arerugi ; 72(9): 1110-1112, 2023.

Artículo en Japonés | MEDLINE | ID: mdl-37967956

17.

Pan-cancer analysis reveals technical artifacts in TCGA germline variant calls.

Buckley, Alexandra R; Standish, Kristopher A; Bhutani, Kunal; Ideker, Trey; Lasken, Roger S; Carter, Hannah; Harismendy, Olivier; Schork, Nicholas J.

BMC Genomics ; 18(1): 458, 2017 06 12.

Artículo en Inglés | MEDLINE | ID: mdl-28606096

RESUMEN

BACKGROUND: Cancer research to date has largely focused on somatically acquired genetic aberrations. In contrast, the degree to which germline, or inherited, variation contributes to tumorigenesis remains unclear, possibly due to a lack of accessible germline variant data. Here we called germline variants on 9618 cases from The Cancer Genome Atlas (TCGA) database representing 31 cancer types. RESULTS: We identified batch effects affecting loss of function (LOF) variant calls that can be traced back to differences in the way the sequence data were generated both within and across cancer types. Overall, LOF indel calls were more sensitive to technical artifacts than LOF Single Nucleotide Variant (SNV) calls. In particular, whole genome amplification of DNA prior to sequencing led to an artificially increased burden of LOF indel calls, which confounded association analyses relating germline variants to tumor type despite stringent indel filtering strategies. The samples affected by these technical artifacts include all acute myeloid leukemia and practically all ovarian cancer samples. CONCLUSIONS: We demonstrate how technical artifacts induced by whole genome amplification of DNA can lead to false positive germline-tumor type associations and suggest TCGA whole genome amplified samples be used with caution. This study draws attention to the need to be sensitive to problems associated with a lack of uniformity in data generation in TCGA data.

Asunto(s)

Artefactos , Bases de Datos Genéticas , Genómica , Mutación de Línea Germinal , Neoplasias/genética , Genoma Humano/genética , Humanos , Mutación con Pérdida de Función

18.

Phylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus section Quercus).

Fitz-Gibbon, Sorel; Hipp, Andrew L; Pham, Kasey K; Manos, Paul S; Sork, Victoria L.

Genome ; 60(9): 743-755, 2017 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-28355490

RESUMEN

The emergence of next generation sequencing has increased by several orders of magnitude the amount of data available for phylogenetics. Reduced representation approaches, such as restriction-sited associated DNA sequencing (RADseq), have proven useful for phylogenetic studies of non-model species at a wide range of phylogenetic depths. However, analysis of these datasets is not uniform and we know little about the potential benefits and drawbacks of de novo assembly versus assembly by mapping to a reference genome. Using RADseq data for 83 oak samples representing 16 taxa, we identified variants via three pipelines: mapping sequence reads to a recently published draft genome of Quercus lobata, and de novo assembly under two sets of locus filters. For each pipeline, we inferred the maximum likelihood phylogeny. All pipelines produced similar trees, with minor shifts in relationships within well-supported clades, despite the fact that they yielded different numbers of loci (68 000 - 111 000 loci) and different degrees of overlap with the reference genome. We conclude that both the reference-aligned and de novo assembly pipelines yield reliable results, and that advantages and disadvantages of these approaches pertain mainly to downstream uses of RADseq data, not to phylogenetic inference per se.

Asunto(s)

Quercus/genética , California , ADN de Plantas , Variación Genética , Filogenia , Quercus/clasificación , Análisis de Secuencia de ADN

19.

Whole-exome sequencing identifies tetratricopeptide repeat domain 7A (TTC7A) mutations for combined immunodeficiency with intestinal atresias.

Chen, Rui; Giliani, Silvia; Lanzi, Gaetana; Mias, George I; Lonardi, Silvia; Dobbs, Kerry; Manis, John; Im, Hogune; Gallagher, Jennifer E; Phanstiel, Douglas H; Euskirchen, Ghia; Lacroute, Philippe; Bettinger, Keith; Moratto, Daniele; Weinacht, Katja; Montin, Davide; Gallo, Eleonora; Mangili, Giovanna; Porta, Fulvio; Notarangelo, Lucia D; Pedretti, Stefania; Al-Herz, Waleed; Alfahdli, Wasmi; Comeau, Anne Marie; Traister, Russell S; Pai, Sung-Yun; Carella, Graziella; Facchetti, Fabio; Nadeau, Kari C; Snyder, Michael; Notarangelo, Luigi D.

J Allergy Clin Immunol ; 132(3): 656-664.e17, 2013 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-23830146

RESUMEN

BACKGROUND: Combined immunodeficiency with multiple intestinal atresias (CID-MIA) is a rare hereditary disease characterized by intestinal obstructions and profound immune defects. OBJECTIVE: We sought to determine the underlying genetic causes of CID-MIA by analyzing the exomic sequences of 5 patients and their healthy direct relatives from 5 unrelated families. METHODS: We performed whole-exome sequencing on 5 patients with CID-MIA and 10 healthy direct family members belonging to 5 unrelated families with CID-MIA. We also performed targeted Sanger sequencing for the candidate gene tetratricopeptide repeat domain 7A (TTC7A) on 3 additional patients with CID-MIA. RESULTS: Through analysis and comparison of the exomic sequence of the subjects from these 5 families, we identified biallelic damaging mutations in the TTC7A gene, for a total of 7 distinct mutations. Targeted TTC7A gene sequencing in 3 additional unrelated patients with CID-MIA revealed biallelic deleterious mutations in 2 of them, as well as an aberrant splice product in the third patient. Staining of normal thymus showed that the TTC7A protein is expressed in thymic epithelial cells, as well as in thymocytes. Moreover, severe lymphoid depletion was observed in the thymus and peripheral lymphoid tissues from 2 patients with CID-MIA. CONCLUSIONS: We identified deleterious mutations of the TTC7A gene in 8 unrelated patients with CID-MIA and demonstrated that the TTC7A protein is expressed in the thymus. Our results strongly suggest that TTC7A gene defects cause CID-MIA.

Asunto(s)

Síndromes de Inmunodeficiencia/genética , Atresia Intestinal/genética , Intestinos/anomalías , Proteínas/genética , Animales , Preescolar , Exoma/genética , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Ratones , Mutación , Análisis de Secuencia por Matrices de Oligonucleótidos , ARN Mensajero/metabolismo , Timo/metabolismo , Análisis de Matrices Tisulares

20.

Modeling biases from low-pass genome sequencing to enable accurate population genetic inferences.

Fonseca, Emanuel M; Tran, Linh N; Mendoza, Hannah; Gutenkunst, Ryan N.

bioRxiv ; 2024 Jul 23.

Artículo en Inglés | MEDLINE | ID: mdl-39091836

RESUMEN

Low-pass genome sequencing is cost-effective and enables analysis of large cohorts. However, it introduces biases by reducing heterozygous genotypes and low-frequency alleles, impacting subsequent analyses such as demographic history inference. We developed a probabilistic model of low-pass biases from the Genome Analysis Toolkit (GATK) multi-sample calling pipeline, and we implemented it in the population genomic inference software dadi. We evaluated the model using simulated low-pass datasets and found that it alleviated low-pass biases in inferred demographic parameters. We further validated the model by downsampling 1000 Genomes Project data, demonstrating its effectiveness on real data. Our model is widely applicable and substantially improves model-based inferences from low-pass population genomic data.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA