Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
Brief Bioinform ; 19(5): 893-904, 2018 09 28.
Artículo en Inglés | MEDLINE | ID: mdl-28407084

RESUMEN

Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.


Asunto(s)
Secuenciación del Exoma/métodos , Variación Genética , Mapeo Cromosómico/métodos , Mapeo Cromosómico/estadística & datos numéricos , Cromosomas Humanos Par 6/genética , Biología Computacional/métodos , Simulación por Computador , Femenino , Genoma Humano , Antígenos HLA/genética , Haplotipos , Humanos , Mutación INDEL , Polimorfismo de Nucleótido Simple , Embarazo , Secuenciación del Exoma/estadística & datos numéricos , Secuenciación Completa del Genoma/métodos , Secuenciación Completa del Genoma/estadística & datos numéricos
2.
BMC Bioinformatics ; 20(1): 557, 2019 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-31703611

RESUMEN

BACKGROUND: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. RESULTS: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. CONCLUSIONS: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be ∼34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.


Asunto(s)
Genómica/métodos , Programas Informáticos , Algoritmos , Cromosomas Humanos/genética , Genoma Humano , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos
3.
BMC Bioinformatics ; 20(1): 722, 2019 12 17.
Artículo en Inglés | MEDLINE | ID: mdl-31847808

RESUMEN

Following publication of the original article [1], the author explained that Table 2 is displayed incorrectly. The correct Table 2 is given below. The original article has been corrected.

4.
BMC Bioinformatics ; 18(1): 269, 2017 May 22.
Artículo en Inglés | MEDLINE | ID: mdl-28532394

RESUMEN

BACKGROUND: The sequence logo has been widely used to represent DNA or RNA motifs for more than three decades. Despite its intelligibility and intuitiveness, the traditional sequence logo is unable to display the intra-motif dependencies and therefore is insufficient to fully characterize nucleotide motifs. Many methods have been developed to quantify the intra-motif dependencies, but fewer tools are available for visualization. RESULT: We developed CircularLogo, a web-based interactive application, which is able to not only visualize the position-specific nucleotide consensus and diversity but also display the intra-motif dependencies. Applying CircularLogo to HNF6 binding sites and tRNA sequences demonstrated its ability to show intra-motif dependencies and intuitively reveal biomolecular structure. CircularLogo is implemented in JavaScript and Python based on the Django web framework. The program's source code and user's manual are freely available at http://circularlogo.sourceforge.net . CircularLogo web server can be accessed from http://bioinformaticstools.mayo.edu/circularlogo/index.html . CONCLUSION: CircularLogo is an innovative web application that is specifically designed to visualize and interactively explore intra-motif dependencies.


Asunto(s)
Internet , Motivos de Nucleótidos/genética , Programas Informáticos , Secuencia de Bases , Sitios de Unión , Células Eucariotas/metabolismo , Intrones/genética , Conformación de Ácido Nucleico , Sitios de Empalme de ARN/genética , ARN de Transferencia/química , ARN de Transferencia/genética , Análisis de Secuencia de ADN
5.
BMC Bioinformatics ; 17(1): 403, 2016 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-27716037

RESUMEN

BACKGROUND: GATK Best Practices workflows are widely used in large-scale sequencing projects and recommend post-alignment processing before variant calling. Two key post-processing steps include the computationally intensive local realignment around known INDELs and base quality score recalibration (BQSR). Both have been shown to reduce erroneous calls; however, the findings are mainly supported by the analytical pipeline that incorporates BWA and GATK UnifiedGenotyper. It is not known whether there is any benefit of post-processing and to what extent the benefit might be for pipelines implementing other methods, especially given that both mappers and callers are typically updated. Moreover, because sequencing platforms are upgraded regularly and the new platforms provide better estimations of read quality scores, the need for post-processing is also unknown. Finally, some regions in the human genome show high sequence divergence from the reference genome; it is unclear whether there is benefit from post-processing in these regions. RESULTS: We used both simulated and NA12878 exome data to comprehensively assess the impact of post-processing for five or six popular mappers together with five callers. Focusing on chromosome 6p21.3, which is a region of high sequence divergence harboring the human leukocyte antigen (HLA) system, we found that local realignment had little or no impact on SNP calling, but increased sensitivity was observed in INDEL calling for the Stampy + GATK UnifiedGenotyper pipeline. No or only a modest effect of local realignment was detected on the three haplotype-based callers and no evidence of effect on Novoalign. BQSR had virtually negligible effect on INDEL calling and generally reduced sensitivity for SNP calling that depended on caller, coverage and level of divergence. Specifically, for SAMtools and FreeBayes calling in the regions with low divergence, BQSR reduced the SNP calling sensitivity but improved the precision when the coverage is insufficient. However, in regions of high divergence (e.g., the HLA region), BQSR reduced the sensitivity of both callers with little gain in precision rate. For the other three callers, BQSR reduced the sensitivity without increasing the precision rate regardless of coverage and divergence level. CONCLUSIONS: We demonstrated that the gain from post-processing is not universal; rather, it depends on mapper and caller combination, and the benefit is influenced further by sequencing depth and divergence level. Our analysis highlights the importance of considering these key factors in deciding to apply the computationally intensive post-processing to Illumina exome data.


Asunto(s)
Biología Computacional/métodos , Biología Computacional/normas , Exoma/genética , Alineación de Secuencia/métodos , Programas Informáticos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación/genética , Polimorfismo de Nucleótido Simple/genética , Flujo de Trabajo
6.
Sci Rep ; 11(1): 8318, 2021 04 15.
Artículo en Inglés | MEDLINE | ID: mdl-33859327

RESUMEN

T cell prolymphocytic leukemia (T-PLL) is a rare disease with aggressive clinical course. Cytogenetic analysis, whole-exome and whole-genome sequencing have identified primary structural alterations in T-PLL, including inversion, translocation and copy number variation. Recurrent somatic mutations were also identified in genes encoding chromatin regulators and those in the JAK-STAT signaling pathway. Epigenetic alterations are the hallmark of many cancers. However, genome-wide epigenomic profiles have not been reported in T-PLL, limiting the mechanistic study of its carcinogenesis. We hypothesize epigenetic mechanisms also play a key role in T-PLL pathogenesis. To systematically test this hypothesis, we generated genome-wide maps of regulatory regions using H3K4me3 and H3K27ac ChIP-seq, as well as RNA-seq data in both T-PLL patients and healthy individuals. We found that genes down-regulated in T-PLL are mainly associated with defense response, immune system or adaptive immune response, while up-regulated genes are enriched in developmental process, as well as WNT signaling pathway with crucial roles in cell fate decision. In particular, our analysis revealed a global alteration of regulatory landscape in T-PLL, with differential peaks highly enriched for binding motifs of immune related transcription factors, supporting the epigenetic regulation of oncogenes and genes involved in DNA damage response and T-cell activation. Together, our work reveals a causal role of epigenetic dysregulation in T-PLL.


Asunto(s)
Reprogramación Celular/genética , Epigénesis Genética/genética , Epigénesis Genética/fisiología , Leucemia Prolinfocítica de Células T/genética , Leucemia Prolinfocítica de Células T/patología , Transcripción Genética/genética , Variaciones en el Número de Copia de ADN , Daño del ADN/genética , Estudio de Asociación del Genoma Completo , Humanos , Leucemia Prolinfocítica de Células T/inmunología , Activación de Linfocitos/genética , Linfocitos T/inmunología , Vía de Señalización Wnt/fisiología
7.
Sci Rep ; 11(1): 21680, 2021 11 04.
Artículo en Inglés | MEDLINE | ID: mdl-34737383

RESUMEN

The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.


Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Flujo de Trabajo , Macrodatos , Genómica , Humanos , Reproducibilidad de los Resultados
8.
PLoS One ; 14(4): e0214723, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-30943272

RESUMEN

Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.


Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina/métodos , Línea Celular Tumoral , Análisis de Datos , Conjuntos de Datos como Asunto , Células HeLa , Humanos , Células MCF-7 , Reacción en Cadena de la Polimerasa , Reproducibilidad de los Resultados
9.
Front Genet ; 10: 736, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31481971

RESUMEN

As reliable, efficient genome sequencing becomes ubiquitous, the need for similarly reliable and efficient variant calling becomes increasingly important. The Genome Analysis Toolkit (GATK), maintained by the Broad Institute, is currently the widely accepted standard for variant calling software. However, alternative solutions may provide faster variant calling without sacrificing accuracy. One such alternative is Sentieon DNASeq, a toolkit analogous to GATK but built on a highly optimized backend. We conducted an independent evaluation of the DNASeq single-sample variant calling pipeline in comparison to that of GATK. Our results support the near-identical accuracy of the two software packages, showcase optimal scalability and great speed from Sentieon, and describe computational performance considerations for the deployment of DNASeq.

10.
Dela J Public Health ; 9(2): 100-101, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-37622133
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA