Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 91
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Genome Res ; 33(7): 1069-1077, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37258301

RESUMEN

Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.


Asunto(s)
Algoritmos , Bacterias , Análisis de Secuencia , Bacterias/genética
2.
Bioinformatics ; 40(Supplement_1): i39-i47, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940175

RESUMEN

MOTIVATION: World Health Organization estimates that there were over 10 million cases of tuberculosis (TB) worldwide in 2019, resulting in over 1.4 million deaths, with a worrisome increasing trend yearly. The disease is caused by Mycobacterium tuberculosis (MTB) through airborne transmission. Treatment of TB is estimated to be 85% successful, however, this drops to 57% if MTB exhibits multiple antimicrobial resistance (AMR), for which fewer treatment options are available. RESULTS: We develop a robust machine-learning classifier using both linear and nonlinear models (i.e. LASSO logistic regression (LR) and random forests (RF)) to predict the phenotypic resistance of Mycobacterium tuberculosis (MTB) for a broad range of antibiotic drugs. We use data from the CRyPTIC consortium to train our classifier, which consists of whole genome sequencing and antibiotic susceptibility testing (AST) phenotypic data for 13 different antibiotics. To train our model, we assemble the sequence data into genomic contigs, identify all unique 31-mers in the set of contigs, and build a feature matrix M, where M[i, j] is equal to the number of times the ith 31-mer occurs in the jth genome. Due to the size of this feature matrix (over 350 million unique 31-mers), we build and use a sparse matrix representation. Our method, which we refer to as MTB++, leverages compact data structures and iterative methods to allow for the screening of all the 31-mers in the development of both LASSO LR and RF. MTB++ is able to achieve high discrimination (F-1 >80%) for the first-line antibiotics. Moreover, MTB++ had the highest F-1 score in all but three classes and was the most comprehensive since it had an F-1 score >75% in all but four (rare) antibiotic drugs. We use our feature selection to contextualize the 31-mers that are used for the prediction of phenotypic resistance, leading to some insights about sequence similarity to genes in MEGARes. Lastly, we give an estimate of the amount of data that is needed in order to provide accurate predictions. AVAILABILITY: The models and source code are publicly available on Github at https://github.com/M-Serajian/MTB-Pipeline.


Asunto(s)
Aprendizaje Automático , Mycobacterium tuberculosis , Mycobacterium tuberculosis/genética , Mycobacterium tuberculosis/efectos de los fármacos , Farmacorresistencia Bacteriana/genética , Pruebas de Sensibilidad Microbiana , Antibacterianos/farmacología , Secuenciación Completa del Genoma/métodos , Genoma Bacteriano , Humanos
3.
Nucleic Acids Res ; 51(D1): D744-D752, 2023 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-36382407

RESUMEN

Antimicrobial resistance (AMR) is considered a critical threat to public health, and genomic/metagenomic investigations featuring high-throughput analysis of sequence data are increasingly common and important. We previously introduced MEGARes, a comprehensive AMR database with an acyclic hierarchical annotation structure that facilitates high-throughput computational analysis, as well as AMR++, a customized bioinformatic pipeline specifically designed to use MEGARes in high-throughput analysis for characterizing AMR genes (ARGs) in metagenomic sequence data. Here, we present MEGARes v3.0, a comprehensive database of published ARG sequences for antimicrobial drugs, biocides, and metals, and AMR++ v3.0, an update to our customized bioinformatic pipeline for high-throughput analysis of metagenomic data (available at MEGLab.org). Database annotations have been expanded to include information regarding specific genomic locations for single-nucleotide polymorphisms (SNPs) and insertions and/or deletions (indels) when required by specific ARGs for resistance expression, and the updated AMR++ pipeline uses this information to check for presence of resistance-conferring genetic variants in metagenomic sequenced reads. This new information encompasses 337 ARGs, whose resistance-conferring variants could not previously be confirmed in such a manner. In MEGARes 3.0, the nodes of the acyclic hierarchical ontology include 4 antimicrobial compound types, 59 resistance classes, 233 mechanisms and 1448 gene groups that classify the 8733 accessions.


Asunto(s)
Antibacterianos , Antiinfecciosos , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento
4.
Genome Res ; 31(1): 1-12, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-33328168

RESUMEN

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.


Asunto(s)
Algoritmos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Reproducibilidad de los Resultados
5.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35212354

RESUMEN

Antimicrobial resistance (AMR) is a growing threat to public health and farming at large. In clinical and veterinary practice, timely characterization of the antibiotic susceptibility profile of bacterial infections is a crucial step in optimizing treatment. High-throughput sequencing is a promising option for clinical point-of-care and ecological surveillance, opening the opportunity to develop genotyping-based AMR determination as a possibly faster alternative to phenotypic testing. In the present work, we compare the performance of state-of-the-art methods for detection of AMR using high-throughput sequencing data from clinical settings. We consider five computational approaches based on alignment (AMRPlusPlus), deep learning (DeepARG), k-mer genomic signatures (KARGA, ResFinder) or hidden Markov models (Meta-MARC). We use an extensive collection of 585 isolates with available AMR resistance profiles determined by phenotypic tests across nine antibiotic classes. We show how the prediction landscape of AMR classifiers is highly heterogeneous, with balanced accuracy varying from 0.40 to 0.92. Although some algorithms-ResFinder, KARGA and AMRPlusPlus-exhibit overall better balanced accuracy than others, the high per-AMR-class variance and related findings suggest that: (1) all algorithms might be subject to sampling bias both in data repositories used for training and experimental/clinical settings; and (2) a portion of clinical samples might contain uncharacterized AMR genes that the algorithms-mostly trained on known AMR genes-fail to generalize upon. These results lead us to formulate practical advice for software configuration and application, and give suggestions for future study designs to further develop AMR prediction tools from proof-of-concept to bedside.


Asunto(s)
Antibacterianos , Farmacorresistencia Bacteriana , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Empleo , Secuenciación de Nucleótidos de Alto Rendimiento , Pruebas de Sensibilidad Microbiana
6.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37688560

RESUMEN

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.


Asunto(s)
Bancos de Muestras Biológicas , Haplotipos , Secuenciación Completa del Genoma , Reino Unido
7.
Bioinformatics ; 38(Suppl 1): i177-i184, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758776

RESUMEN

MOTIVATION: Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. RESULTS: We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnalanko/syotti. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , ADN , Humanos , Metagenómica/métodos , Análisis de Secuencia de ADN/métodos
8.
Bioinformatics ; 37(14): 1946-1952, 2021 08 04.
Artículo en Inglés | MEDLINE | ID: mdl-32462192

RESUMEN

MOTIVATION: The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. RESULTS: In this article, we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost. AVAILABILITY AND IMPLEMENTATION: DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Proyectos de Investigación , Análisis de Secuencia de ADN
9.
Bioinformatics ; 36(22-23): 5275-5281, 2021 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-32049324

RESUMEN

MOTIVATION: Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS: We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION: Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metagenoma , Programas Informáticos , Algoritmos , Metagenómica , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN
10.
Nat Comput ; 21(1): 81-108, 2022 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-36969737

RESUMEN

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

11.
Nucleic Acids Res ; 48(D1): D561-D569, 2020 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-31722416

RESUMEN

Antimicrobial resistance (AMR) is a threat to global public health and the identification of genetic determinants of AMR is a critical component to epidemiological investigations. High-throughput sequencing (HTS) provides opportunities for investigation of AMR across all microbial genomes in a sample (i.e. the metagenome). Previously, we presented MEGARes, a hand-curated AMR database and annotation structure developed to facilitate the analysis of AMR within metagenomic samples (i.e. the resistome). Along with MEGARes, we released AmrPlusPlus, a bioinformatics pipeline that interfaces with MEGARes to identify and quantify AMR gene accessions contained within a metagenomic sequence dataset. Here, we present MEGARes 2.0 (https://megares.meglab.org), which incorporates previously published resistance sequences for antimicrobial drugs, while also expanding to include published sequences for metal and biocide resistance determinants. In MEGARes 2.0, the nodes of the acyclic hierarchical ontology include four antimicrobial compound types, 57 classes, 220 mechanisms of resistance, and 1,345 gene groups that classify the 7,868 accessions. In addition, we present an updated version of AmrPlusPlus (AMR ++ version 2.0), which improves accuracy of classifications, as well as expanding scalability and usability.


Asunto(s)
Antiinfecciosos/farmacología , Bases de Datos Genéticas , Bases de Datos Farmacéuticas , Farmacorresistencia Microbiana , Genes Bacterianos , Metagenómica/métodos , Programas Informáticos , Antiinfecciosos/química , Bacterias/efectos de los fármacos , Bacterias/genética , Desinfectantes/química , Desinfectantes/farmacología , Metagenoma , Metales/química , Metales/farmacología
12.
BMC Bioinformatics ; 22(1): 445, 2021 Sep 18.
Artículo en Inglés | MEDLINE | ID: mdl-34537012

RESUMEN

BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . CONCLUSIONS: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.


Asunto(s)
Algoritmos , Programas Informáticos
13.
Bioinformatics ; 36(16): 4399-4405, 2020 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-32277811

RESUMEN

MOTIVATION: Oxford Nanopore technologies (ONT) add miniaturization and real time to high-throughput sequencing. All available software for ONT data analytics run on cloud/clusters or personal computers. Instead, a linchpin to true portability is software that works on mobile devices of internet connections. Smartphones' and tablets' chipset/memory/operating systems differ from desktop computers, but software can be recompiled. We sought to understand how portable current ONT analysis methods are. RESULTS: Several tools, from base-calling to genome assembly, were ported and benchmarked on an Android smartphone. Out of 23 programs, 11 succeeded. Recompilation failures included lack of standard headers and unsupported instruction sets. Only DSK, BCALM2 and Kraken were able to process files up to 16 GB, with linearly scaling CPU-times. However, peak CPU temperatures were high. In conclusion, the portability scenario is not favorable. Given the fast market growth, attention of developers to ARM chipsets and Android/iOS is warranted, as well as initiatives to implement mobile-specific libraries. AVAILABILITY AND IMPLEMENTATION: The source code is freely available at: https://github.com/marco-oliva/portable-nanopore-analytics.


Asunto(s)
Nanoporos , Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Programas Informáticos
14.
Bioinformatics ; 36(3): 682-689, 2020 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-31504206

RESUMEN

MOTIVATION: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. RESULTS: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. AVAILABILITY AND IMPLEMENTATION: Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Programas Informáticos , Algoritmos , Genoma Humano , Humanos , Mapeo Restrictivo , Análisis de Secuencia de ADN
15.
Bioinformatics ; 35(14): i51-i60, 2019 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-31510647

RESUMEN

MOTIVATION: There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. RESULTS: We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods-including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT-and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction. AVAILABILITY AND IMPLEMENTATION: VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Análisis de Secuencia de ADN , Programas Informáticos , Genómica , Metagenoma
16.
Bioinformatics ; 35(18): 3250-3256, 2019 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-30698651

RESUMEN

MOTIVATION: Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. RESULTS: We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. AVAILABILITY AND IMPLEMENTATION: The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Genoma , Mapeo Restrictivo , Análisis de Secuencia de ADN
17.
BMC Genomics ; 20(Suppl 6): 434, 2019 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-31189471

RESUMEN

BACKGROUND: Biological networks describes the mechanisms which govern cellular functions. Temporal networks show how these networks evolve over time. Studying the temporal progression of network topologies is of utmost importance since it uncovers how a network evolves and how it resists to external stimuli and internal variations. Two temporal networks have co-evolving subnetworks if the evolving topologies of these subnetworks remain similar to each other as the network topology evolves over a period of time. In this paper, we consider the problem of identifying co-evolving subnetworks given a pair of temporal networks, which aim to capture the evolution of molecules and their interactions over time. Although this problem shares some characteristics of the well-known network alignment problems, it differs from existing network alignment formulations as it seeks a mapping of the two network topologies that is invariant to temporal evolution of the given networks. This is a computationally challenging problem as it requires capturing not only similar topologies between two networks but also their similar evolution patterns. RESULTS: We present an efficient algorithm, Tempo, for solving identifying co-evolving subnetworks with two given temporal networks. We formally prove the correctness of our method. We experimentally demonstrate that Tempo scales efficiently with the size of network as well as the number of time points, and generates statistically significant alignments-even when evolution rates of given networks are high. Our results on a human aging dataset demonstrate that Tempo identifies novel genes contributing to the progression of Alzheimer's, Huntington's and Type II diabetes, while existing methods fail to do so. CONCLUSIONS: Studying temporal networks in general and human aging specifically using Tempo enables us to identify age related genes from non age related genes successfully. More importantly, Tempo takes the network alignment problem one huge step forward by moving beyond the classical static network models.


Asunto(s)
Algoritmos , Evolución Molecular , Redes Reguladoras de Genes , Redes y Vías Metabólicas , Adulto , Anciano , Anciano de 80 o más Años , Envejecimiento , Enfermedad de Alzheimer/genética , Enfermedad de Alzheimer/metabolismo , Encéfalo/metabolismo , Biología Computacional/métodos , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/metabolismo , Humanos , Enfermedad de Huntington/genética , Enfermedad de Huntington/metabolismo , Persona de Mediana Edad , Mapeo de Interacción de Proteínas , Adulto Joven
18.
Bioinformatics ; 34(24): 4189-4195, 2018 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-29939217

RESUMEN

Motivation: The de Bruijn graph is fundamental to the analysis of next generation sequencing data and so, as datasets of DNA reads grow rapidly, it becomes more important to represent de Bruijn graphs compactly while still supporting fast assembly. Previous implementations of compact de Bruijn graphs have not supported node or edge deletion, however, which is important for pruning spurious elements from the graph. Results: Belazzougui et al. (2016b) recently proposed a compact and fully dynamic representation, which supports exact membership queries and insertions and deletions of both nodes and edges. In this paper, we give a practical implementation of their data structure, supporting exact membership queries and fully dynamic edge operations, as well as limited support for dynamic node operations. We demonstrate experimentally that its performance is comparable to that of state-of-the-art implementations based on Bloom filters. Availability and implementation: Our source-code is publicly available at https://github.com/csirac/dynamicDBG under an open-source license.


Asunto(s)
Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Biología Computacional , Eliminación de Secuencia
19.
Appl Environ Microbiol ; 85(13)2019 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-31053578

RESUMEN

The effectiveness of antibiotics has been challenged by the increasing frequency of antimicrobial resistance (AMR), which has emerged as a major threat to global health. Despite its negative impact on the development of AMR, there are few effective strategies for reducing AMR in food-producing animals. Using whole-genome sequencing and comparative genomics of 36 multidrug-resistant (MDR) Escherichia coli strains isolated from beef cattle with no previous exposure to antibiotics, we obtained results suggesting that the occurrence of MDR E. coli also arises in animals with no antibiotic selective pressure. Extended-spectrum-ß-lactamase-producing E. coli strains with enhanced virulence capacities for toxin production and adherence have evolved, which implies important ramifications for animal and human health. Gene exchanges by conjugative plasmids and insertion elements have driven widespread antibiotic resistance in clinically relevant pathogens. Phylogenetic relatedness of E. coli strains from various geographic locations and hosts, such as animals, environmental sources, and humans, suggests that transmission of MDR E. coli strains occurs intercontinentally without host barriers.IMPORTANCE Multidrug-resistant (MDR) Escherichia coli isolates pose global threats to public health due to the decreasing availability of treatment options. To better understand the characteristics of MDR E. coli isolated from food-producing animals with no antibiotic exposure, we employed genomic comparison, high-resolution phylogenetics, and functional characterization. Our findings highlight the potential capacity of MDR E. coli to cause severe disease and suggest that these strains are widespread intercontinentally. This study underlines the occurrence of MDR E. coli in food-producing animals raised without antibiotic use, which has alarming, critical ramifications within animal and human medical practice.


Asunto(s)
Farmacorresistencia Bacteriana Múltiple/genética , Escherichia coli/genética , beta-Lactamasas/genética , Animales , Bovinos , Enfermedades de los Bovinos/microbiología , Escherichia coli/efectos de los fármacos , Infecciones por Escherichia coli/microbiología , Infecciones por Escherichia coli/veterinaria , Genómica , Filogenia , Secuenciación Completa del Genoma/veterinaria , beta-Lactamasas/metabolismo
20.
Nucleic Acids Res ; 45(D1): D574-D580, 2017 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-27899569

RESUMEN

Antimicrobial resistance has become an imminent concern for public health. As methods for detection and characterization of antimicrobial resistance move from targeted culture and polymerase chain reaction to high throughput metagenomics, appropriate resources for the analysis of large-scale data are required. Currently, antimicrobial resistance databases are tailored to smaller-scale, functional profiling of genes using highly descriptive annotations. Such characteristics do not facilitate the analysis of large-scale, ecological sequence datasets such as those produced with the use of metagenomics for surveillance. In order to overcome these limitations, we present MEGARes (https://megares.meglab.org), a hand-curated antimicrobial resistance database and annotation structure that provides a foundation for the development of high throughput acyclical classifiers and hierarchical statistical analysis of big data. MEGARes can be browsed as a stand-alone resource through the website or can be easily integrated into sequence analysis pipelines through download. Also via the website, we provide documentation for AmrPlusPlus, a user-friendly Galaxy pipeline for the analysis of high throughput sequencing data that is pre-packaged for use with the MEGARes database.


Asunto(s)
Bases de Datos Genéticas , Farmacorresistencia Microbiana , Secuenciación de Nucleótidos de Alto Rendimiento , Biología Computacional/métodos , Metagenoma , Metagenómica/métodos , Navegador Web
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA