Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
Genome Biol ; 24(1): 249, 2023 10 30.
Artículo en Inglés | MEDLINE | ID: mdl-37904256

RESUMEN

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .


Asunto(s)
Genoma Humano , Proteínas , Humanos , Filogenia , Proteínas/genética , Algoritmos , Programas Informáticos , Anotación de Secuencia Molecular
2.
Nat Protoc ; 17(12): 2815-2839, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-36171387

RESUMEN

Metagenomic experiments expose the wide range of microscopic organisms in any microbial environment through high-throughput DNA sequencing. The computational analysis of the sequencing data is critical for the accurate and complete characterization of the microbial community. To facilitate efficient and reproducible metagenomic analysis, we introduce a step-by-step protocol for the Kraken suite, an end-to-end pipeline for the classification, quantification and visualization of metagenomic datasets. Our protocol describes the execution of the Kraken programs, via a sequence of easy-to-use scripts, in two scenarios: (1) quantification of the species in a given metagenomics sample; and (2) detection of a pathogenic agent from a clinical sample taken from a human patient. The protocol, which is executed within 1-2 h, is targeted to biologists and clinicians working in microbiome or metagenomics analysis who are familiar with the Unix command-line environment.


Asunto(s)
Metagenoma , Microbiota , Humanos , Programas Informáticos , Metagenómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Microbiota/genética , Análisis de Secuencia de ADN/métodos
3.
Artículo en Inglés | MEDLINE | ID: mdl-37602140

RESUMEN

Kraken and KrakenUniq are widely-used tools for classifying metagenomics sequences. A key requirement for these systems is a database containing all k-mers from all genomes that the users want to be able to detect, where k = 31 by default. This database can be very large, easily exceeding 100 gigabytes (GB) and sometimes 400 GB. Previously, Kraken and KrakenUniq required loading the entire database into main memory (RAM), and if RAM was insufficient, they used memory mapping, which significantly increased the running time for large datasets. We have implemented a new algorithm in KrakenUniq that allows it to load and process the database in chunks, with only a modest increase in running time. This enhancement now makes it feasible to run KrakenUniq on very large datasets and huge databases on virtually any computer, even a laptop, while providing the same very high classification accuracy as the previous system. Statement of need: The KrakenUniq software classifies reads from metagenomic samples to establish which organisms are present in the samples and estimate their abundance. The software is widely used used by researchers and clinicians in medical diagnostics, microbiome and environmental studies.Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+.

4.
Bioinformatics ; 38(5): 1440-1442, 2022 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-34734986

RESUMEN

SUMMARY: PhyloCSF++ is an efficient and parallelized C++ implementation of the popular PhyloCSF method to distinguish protein-coding and non-coding regions in a genome based on multiple sequence alignments (MSAs). It can score alignments or produce browser tracks for entire genomes in the wig file format. Additionally, PhyloCSF++ annotates coding sequences in GFF/GTF files using precomputed tracks or computes and scores MSAs on the fly with MMseqs2. AVAILABILITY AND IMPLEMENTATION: PhyloCSF++ is released under the AGPLv3 license. Binaries and source code are available at https://github.com/cpockrandt/PhyloCSFpp. The software can be installed through bioconda. A variety of tracks can be accessed through ftp://ftp.ccb.jhu.edu/pub/software/phylocsfpp/.


Asunto(s)
Genoma , Programas Informáticos , Alineación de Secuencia , Exones
5.
Bioinformatics ; 37(20): 3650-3651, 2021 Oct 25.
Artículo en Inglés | MEDLINE | ID: mdl-33964128

RESUMEN

SUMMARY: Although the ability to programmatically summarize and visually inspect sequencing data is an integral part of genome analysis, currently available methods are not capable of handling large numbers of samples. In particular, making a visual comparison of transcriptional landscapes between two sets of thousands of RNA-seq samples is limited by available computational resources, which can be overwhelmed due to the sheer size of the data. In this work, we present TieBrush, a software package designed to process very large sequencing datasets (RNA, whole-genome, exome, etc.) into a form that enables quick visual and computational inspection. TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input. AVAILABILITY AND IMPLEMENTATION: TieBrush is provided as a C++ package under the MIT License. Precompiled binaries, source code and example data are available on GitHub (https://github.com/alevar/tiebrush). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.
Genetics ; 218(3)2021 07 14.
Artículo en Inglés | MEDLINE | ID: mdl-33983397

RESUMEN

The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, in case of the SARS-CoV-2, the low divergence of near-identical genomes sequenced over a short period of time makes conventional analysis infeasible. Using a novel method, we identified 225 anomalous SARS-CoV-2 genomes of likely recombinant origins out of the first 87,695 genomes to be released, several of which have persisted in the population. Bolotie is specifically designed to perform a rapid search for inter-clade recombination events over extremely large datasets, facilitating analysis of novel isolates in seconds. In cases where raw sequencing data were available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. The Bolotie software and other data from our study are available at https://github.com/salzberg-lab/bolotie.


Asunto(s)
SARS-CoV-2 , Genoma Viral , Filogenia , Recombinación Genética , Programas Informáticos
7.
F1000Res ; 10: 820, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-36212901

RESUMEN

Background: Metagenomic sequencing has the potential to identify a wide range of pathogens in human tissue samples. Sarcoidosis is a complex disorder whose etiology remains unknown and for which a variety of infectious causes have been hypothesized. We sought to conduct metagenomic sequencing on cases of ocular and periocular sarcoidosis, none of them with previously identified infectious causes. Methods: Archival tissue specimens of 16 subjects with biopsies of ocular and periocular tissues that were positive for non-caseating granulomas were used as cases. Four archival tissue specimens that did not demonstrate non-caseating granulomas were also included as controls. Genomic DNA was extracted from tissue sections. DNA libraries were generated from the extracted genomic DNA and the libraries underwent next-generation sequencing. Results: We generated between 4.8 and 20.7 million reads for each of the 16 cases plus four control samples. For eight of the cases, we identified microbial pathogens that were present well above the background, with one potential pathogen identified for seven of the cases and two possible pathogens for one of the cases. Five of the eight cases were associated with bacteria ( Campylobacter concisus, Neisseria elongata, Streptococcus salivarius, Pseudopropionibacterium propionicum, and Paracoccus yeei), two cases with fungi ( Exophiala oligosperma, Lomentospora prolificans and Aspergillus versicolor) and one case with a virus (Mupapillomavirus 1). Interestingly, four of the five bacterial species are also part of the human oral microbiome. Conclusions: Using a metagenomic sequencing we identified possible infectious causes in half of the ocular and periocular sarcoidosis cases analyzed. Our findings support the proposition that sarcoidosis could be an etiologically heterogenous disease. Because these are previously banked samples, direct follow-up in the respective patients is impossible, but these results suggest that sequencing may be a valuable tool in better understanding the etiopathogenesis of sarcoidosis and in diagnosing and treating this disease.


Asunto(s)
Microbiota , Sarcoidosis , Bacterias/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Metagenoma , Metagenómica/métodos , Microbiota/genética , Sarcoidosis/diagnóstico , Sarcoidosis/genética
8.
bioRxiv ; 2020 Sep 21.
Artículo en Inglés | MEDLINE | ID: mdl-32995774

RESUMEN

The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, previous methods for detecting recombination and reassortment events cannot handle the computational requirements of analyzing tens of thousands of genomes, a scenario that has now emerged in the effort to track the spread of the SARS-CoV-2 virus. Furthermore, the low divergence of near-identical genomes sequenced in short periods of time presents a statistical challenge not addressed by available methods. In this work we present Bolotie, an efficient method designed to detect recombination and reassortment events between clades of viral genomes. We applied our method to a large collection of SARS-CoV-2 genomes and discovered hundreds of isolates that are likely of a recombinant origin. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. Our findings further show that several recombinants appear to have persisted in the population.

9.
Bioinformatics ; 36(12): 3687-3692, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32246826

RESUMEN

MOTIVATION: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. RESULTS: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position. AVAILABILITY AND IMPLEMENTATION: GenMap can be installed via bioconda. Binaries and C++ source code are available on https://github.com/cpockrandt/genmap.


Asunto(s)
Genoma , Programas Informáticos , Algoritmos , Genómica , Análisis de Secuencia de ADN
10.
BMC Biotechnol ; 19(1): 40, 2019 06 27.
Artículo en Inglés | MEDLINE | ID: mdl-31248401

RESUMEN

BACKGROUND: Natural variations in a genome can drastically alter the CRISPR-Cas9 off-target landscape by creating or removing sites. Despite the resulting potential side-effects from such unaccounted for sites, current off-target detection pipelines are not equipped to include variant information. To address this, we developed VARiant-aware detection and SCoring of Off-Targets (VARSCOT). RESULTS: VARSCOT identifies only 0.6% of off-targets to be common between 4 individual genomes and the reference, with an average of 82% of off-targets unique to an individual. VARSCOT is the most sensitive detection method for off-targets, finding 40 to 70% more experimentally verified off-targets compared to other popular software tools and its machine learning model allows for CRISPR-Cas9 concentration aware off-target activity scoring. CONCLUSIONS: VARSCOT allows researchers to take genomic variation into account when designing individual or population-wide targeting strategies. VARSCOT is available from https://github.com/BauerLab/VARSCOT .


Asunto(s)
Sistemas CRISPR-Cas , Biología Computacional/métodos , Edición Génica/métodos , Marcación de Gen/métodos , Genómica/métodos , Programas Informáticos , Edición Génica/normas , Marcación de Gen/normas , Genómica/normas , Internet , Reproducibilidad de los Resultados
11.
J Biotechnol ; 261: 157-168, 2017 Nov 10.
Artículo en Inglés | MEDLINE | ID: mdl-28888961

RESUMEN

BACKGROUND: The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome (Venter et al., 2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there was a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. We previously addressed this by introducing the SeqAn library of efficient data types and algorithms in 2008 (Döring et al., 2008). RESULTS: The SeqAn library has matured considerably since its first publication 9 years ago. In this article we review its status as an established resource for programmers in the field of sequence analysis and its contributions to many analysis tools. CONCLUSIONS: We anticipate that SeqAn will continue to be a valuable resource, especially since it started to actively support various hardware acceleration techniques in a systematic manner.


Asunto(s)
Bases de Datos Genéticas , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Alineación de Secuencia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...