Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
Nucleic Acids Res ; 49(15): 8471-8487, 2021 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-34313777

RESUMEN

There is a pressing need today to mechanistically interpret sets of genomic variants associated with diseases. Here we present a tool called 'VarSAn' that uses a network analysis algorithm to identify pathways relevant to a given set of variants. VarSAn analyzes a configurable network whose nodes represent variants, genes and pathways, using a Random Walk with Restarts algorithm to rank pathways for relevance to the given variants, and reports P-values for pathway relevance. It treats non-coding and coding variants differently, properly accounts for the number of pathways impacted by each variant and identifies relevant pathways even if many variants do not directly impact genes of the pathway. We use VarSAn to identify pathways relevant to variants related to cancer and several other diseases, as well as drug response variation. We find VarSAn's pathway ranking to be complementary to the standard approach of enrichment tests on genes related to the query set. We adopt a novel benchmarking strategy to quantify its advantage over this baseline approach. Finally, we use VarSAn to discover key pathways, including the VEGFA-VEGFR2 pathway, related to de novo variants in patients of Hypoplastic Left Heart Syndrome, a rare and severe congenital heart defect.


Asunto(s)
Genómica/métodos , Polimorfismo de Nucleótido Simple , Programas Informáticos , Algoritmos , Neoplasias de la Mama/genética , Neoplasias de la Mama/metabolismo , Interpretación Estadística de Datos , Femenino , Genes , Humanos , Síndrome del Corazón Izquierdo Hipoplásico/genética , Síndrome del Corazón Izquierdo Hipoplásico/metabolismo , Masculino , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/metabolismo , Transducción de Señal/genética
2.
Bioinformatics ; 36(7): 2275-2277, 2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-31830243

RESUMEN

MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY AND IMPLEMENTATION: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Genoma , Genómica , Programas Informáticos
3.
BMC Bioinformatics ; 20(1): 557, 2019 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-31703611

RESUMEN

BACKGROUND: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. RESULTS: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. CONCLUSIONS: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be ∼34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.


Asunto(s)
Genómica/métodos , Programas Informáticos , Algoritmos , Cromosomas Humanos/genética , Genoma Humano , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos
4.
BMC Bioinformatics ; 20(1): 722, 2019 12 17.
Artículo en Inglés | MEDLINE | ID: mdl-31847808

RESUMEN

Following publication of the original article [1], the author explained that Table 2 is displayed incorrectly. The correct Table 2 is given below. The original article has been corrected.

5.
Sci Rep ; 11(1): 21680, 2021 11 04.
Artículo en Inglés | MEDLINE | ID: mdl-34737383

RESUMEN

The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.


Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Flujo de Trabajo , Macrodatos , Genómica , Humanos , Reproducibilidad de los Resultados
6.
PLoS One ; 14(7): e0211608, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31287816

RESUMEN

Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the "best" workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T's data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.


Asunto(s)
Biología Computacional , Genómica , Programas Informáticos , Animales , Humanos , Flujo de Trabajo
7.
Front Genet ; 10: 736, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31481971

RESUMEN

As reliable, efficient genome sequencing becomes ubiquitous, the need for similarly reliable and efficient variant calling becomes increasingly important. The Genome Analysis Toolkit (GATK), maintained by the Broad Institute, is currently the widely accepted standard for variant calling software. However, alternative solutions may provide faster variant calling without sacrificing accuracy. One such alternative is Sentieon DNASeq, a toolkit analogous to GATK but built on a highly optimized backend. We conducted an independent evaluation of the DNASeq single-sample variant calling pipeline in comparison to that of GATK. Our results support the near-identical accuracy of the two software packages, showcase optimal scalability and great speed from Sentieon, and describe computational performance considerations for the deployment of DNASeq.

8.
PLoS One ; 11(11): e0167047, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27893777

RESUMEN

An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the "ground truth" about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads.


Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Mutación/genética , Proteínas de Neoplasias/genética , Neoplasias/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Simulación por Computador , Genoma Humano , Genómica/métodos , Humanos , Neoplasias/patología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA