Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
BMC Bioinformatics ; 22(Suppl 10): 378, 2021 Jul 22.
Artículo en Inglés | MEDLINE | ID: mdl-34294039

RESUMEN

BACKGROUND: Due to the complexity of microbial communities, de novo assembly on next generation sequencing data is commonly unable to produce complete microbial genomes. Metagenome assembly binning becomes an essential step that could group the fragmented contigs into clusters to represent microbial genomes based on contigs' nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Contigs can be linked by sequence overlap (assembly graph) or by the paired-end reads aligned to them (PE graph), where the linked contigs have high chance to be derived from the same clusters. RESULTS: We developed METAMVGL, a multi-view graph-based metagenomic contig binning algorithm by integrating both assembly and PE graphs. It could strikingly rescue the short contigs and correct the binning errors from dead ends. METAMVGL learns the two graphs' weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. In experiments, we observed METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph. It also outperformed many state-of-the-art contig binning algorithms, including MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and GraphBin on the metagenomic sequencing data from simulation, two mock communities and Sharon infant fecal samples. CONCLUSIONS: Our findings demonstrate METAMVGL outstandingly improves the short contig binning and outperforms the other existing contig binning tools on the metagenomic sequencing data from simulation, mock communities and infant fecal samples.


Asunto(s)
Metagenoma , Microbiota , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Metagenoma/genética , Metagenómica , Microbiota/genética , Análisis de Secuencia de ADN , Programas Informáticos
2.
Mol Cell Proteomics ; 18(8 suppl 1): S183-S192, 2019 08 09.
Artículo en Inglés | MEDLINE | ID: mdl-31142575

RESUMEN

Matching metagenomic and/or metatranscriptomic data, currently often under-used, can be useful reference for metaproteomic tandem mass spectra (MS/MS) data analysis. Here we developed a software pipeline for identification of peptides and proteins from metaproteomic MS/MS data using proteins derived from matching metagenomic (and metatranscriptomic) data as the search database, based on two novel approaches Graph2Pro (published) and Var2Pep (new). Graph2Pro retains and uses uncertainties of metagenome assembly for reference-based MS/MS data analysis. Var2Pep considers the variations found in metagenomic/metatranscriptomic sequencing reads that are not retained in the assemblies (contigs). The new software pipeline provides one stop application of both tools, and it supports the use of metagenome assembly from commonly used assemblers including MegaHit and metaSPAdes. When tested on two collections of multi-omic microbiome data sets, our pipeline significantly improved the identification rate of the metaproteomic MS/MS spectra by about two folds, comparing to conventional contig- or read-based approaches (the Var2Pep alone identified 5.6% to 24.1% more unique peptides, depending on the data set). We also showed that identified variant peptides are important for functional profiling of microbiomes. All results suggested that it is important to take into consideration of the assembly uncertainties and genomic variants to facilitate metaproteomic MS/MS data interpretation.


Asunto(s)
Algoritmos , Microbiota/genética , Proteogenómica/métodos , Agua de Mar/microbiología , Aguas Residuales/microbiología , Bases de Datos de Proteínas , Variación Genética , Péptidos/genética , Espectrometría de Masas en Tándem
3.
BMC Bioinformatics ; 21(Suppl 12): 306, 2020 Jul 24.
Artículo en Inglés | MEDLINE | ID: mdl-32703258

RESUMEN

BACKGROUND: Graph-based representation of genome assemblies has been recently used in different contexts - from improved reconstruction of plasmid sequences and refined analysis of metagenomic data to read error correction and reference-free haplotype reconstruction. While many of these applications heavily utilize the alignment of long nucleotide sequences to assembly graphs, first general-purpose software tools for finding such alignments have been released only recently and their deficiencies and limitations are yet to be discovered. Moreover, existing tools can not perform alignment of amino acid sequences, which could prove useful in various contexts - in particular the analysis of metagenomic sequencing data. RESULTS: In this work we present a novel SPAligner (Saint-Petersburg Aligner) tool for aligning long diverged nucleotide and amino acid sequences to assembly graphs. We demonstrate that SPAligner is an efficient solution for mapping third generation sequencing reads onto assembly graphs of various complexity and also show how it can facilitate the identification of known genes in complex metagenomic datasets. CONCLUSIONS: Our work will facilitate accelerating the development of graph-based approaches in solving sequence to genome assembly alignment problem. SPAligner is implemented as a part of SPAdes tools library and is available on Github.


Asunto(s)
Algoritmos , Variación Genética , Alineación de Secuencia , Secuencia de Bases , Haplotipos/genética , Humanos , Programas Informáticos , Estadística como Asunto , beta-Lactamasas/química
4.
J Comput Biol ; 31(5): 381-395, 2024 05.
Artículo en Inglés | MEDLINE | ID: mdl-38687333

RESUMEN

Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.


Asunto(s)
COVID-19 , Redes Neurales de la Computación , SARS-CoV-2 , Humanos , COVID-19/virología , SARS-CoV-2/genética , Algoritmos , Telómero/genética , Biología Computacional/métodos , Genómica/métodos , Aprendizaje Profundo , Genoma Humano
5.
Microb Genom ; 10(2)2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38376388

RESUMEN

Accurate reconstruction of Escherichia coli antibiotic resistance gene (ARG) plasmids from Illumina sequencing data has proven to be a challenge with current bioinformatic tools. In this work, we present an improved method to reconstruct E. coli plasmids using short reads. We developed plasmidEC, an ensemble classifier that identifies plasmid-derived contigs by combining the output of three different binary classification tools. We showed that plasmidEC is especially suited to classify contigs derived from ARG plasmids with a high recall of 0.941. Additionally, we optimized gplas, a graph-based tool that bins plasmid-predicted contigs into distinct plasmid predictions. Gplas2 is more effective at recovering plasmids with large sequencing coverage variations and can be combined with the output of any binary classifier. The combination of plasmidEC with gplas2 showed a high completeness (median=0.818) and F1-Score (median=0.812) when reconstructing ARG plasmids and exceeded the binning capacity of the reference-based method MOB-suite. In the absence of long-read data, our method offers an excellent alternative to reconstruct ARG plasmids in E. coli.


Asunto(s)
Escherichia coli , Secuenciación de Nucleótidos de Alto Rendimiento , Escherichia coli/genética , Antibacterianos/farmacología , Farmacorresistencia Microbiana , Plásmidos/genética
6.
bioRxiv ; 2024 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-38529499

RESUMEN

Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de-novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de-novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio-phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.

7.
Gene ; 849: 146904, 2023 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-36150535

RESUMEN

Unlike the chloroplast genomes (ptDNA), the plant mitochondrial genomes (mtDNA) are much more plastic in structure and size but maintain a conserved and essential gene set related to oxidative phosphorylation. Moreover, the plant mitochondrial genes and mtDNA are good markers for phylogenetic, evolutive, and comparative analyses. The two most known species in Theobroma L. (Malvaceae s.l.) genus are T. cacao, and T. grandiflorum. Besides the economic value, both species also show considerable biotechnology potential due to their other derived products, thus, aggregating additional economic value for the agroindustry. Here, we assembled and compared the mtDNA of Theobroma cacao and T. grandiflorum to generate a new genomics resource and unravel evolutionary trends. Graph-based analyses revealed that both mtDNA exhibit multiple alternative arrangements, confirming the dynamism commonly observed in plant mtDNA. The disentangled assembly graph revealed potential predominant circular molecules. The master circle molecules span 543,794 bp for T. cacao and 501,598 bp for T. grandiflorum, showing 98.9% of average sequence identity. Both mtDNA contains the same set of 39 plant mitochondrial genes, commonly found in other rosid mitogenomes. The main features are a duplicated copy of atp4, the absence of rpl6, rps2, rps8, and rps11, and the presence of two chimeric open-reading frames. Moreover, we detected few ptDNA integrations mainly represented by tRNAs, and no viral sequences were detected. Phylogenomics analyses indicate Theobroma spp. are nested in Malvaceae family. The main mtDNA differences are related to distinct structural rearrangements and exclusive regions associated with relics of Transposable Elements, supporting the hypothesis of dynamic mitochondrial genome maintenance and divergent evolutionary paths and pressures after species differentiation.


Asunto(s)
Cacao , Genoma Mitocondrial , Cacao/genética , Genoma Mitocondrial/genética , Filogenia , Elementos Transponibles de ADN , Plásticos , ADN Mitocondrial
8.
Front Microbiol ; 14: 1267695, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37869681

RESUMEN

Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. We employ graph neural networks (GNNs) and the assembly graph to propagate the information from nearby nodes, which leads to more accurate classification, especially for short contigs that are difficult to classify based on sequence features or database searches alone. We trained plASgraph2 on a data set of samples from the ESKAPEE group of pathogens. plASgraph2 either outperforms or performs on par with a wide range of state-of-the-art methods on testing sets of independent ESKAPEE samples and samples from related pathogens. On one hand, our study provides a new accurate and easy to use tool for contig classification in bacterial isolates; on the other hand, it serves as a proof-of-concept for the use of GNNs in genomics. Our software is available at https://github.com/cchauve/plasgraph2 and the training and testing data sets are available at https://github.com/fmfi-compbio/plasgraph2-datasets.

9.
Comput Struct Biotechnol J ; 21: 2394-2404, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37066122

RESUMEN

De novo assembly of next generation metagenomic reads is widely used to provide taxonomic and functional information of genomes in a microbial community. As strains are functionally specific, recovery of strain-resolved genomes is important but still a challenge. Unitigs and assembly graphs are mid-products generated during the assembly of reads into contigs, and they provide higher resolution for sequences connection information. In this study, we propose a new approach UGMAGrefiner (a unitig level assembly graph-based metagenome-assembled Genome refiner), which uses the connection and coverage information from unitig level assembly graphs to recruit unbinned unitigs to MAGs, adjust binning result, and infer unitigs shared by multiple MAGs. In two simulated datasets (Simdata and CAMI data) and one real dataset (GD02), it outperforms two state-of-the-art assembly graph-based binning refine tools in the refinement of MAGs' quality by stably increasing the completeness of genomes. UGMAGrefiner can identify genome specific clusters of genomes with below 99% average nucleotide identity for homologous sequences. For MAGs mixed with 99% similarity genome clusters, it could distinguish 8 out of 9 genomes in Simdata and 8 out of 12 genomes in CAMI data. In GD02 data, it could identify 16 new unitig clusters representing genome specific regions of mixed genomes and 4 unitig clusters representing new genomes from total 135 MAGs for further functional analysis. UGMAGrefiner provides an efficient way to obtain more complete MAGs and study genome specific functions. It will be useful to improve taxonomic and functional information of genomes after de novo assembly.

10.
Genome Biol ; 22(1): 214, 2021 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-34311761

RESUMEN

We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.


Asunto(s)
Algoritmos , Genoma Bacteriano , Metagenoma , Consorcios Microbianos/genética , Programas Informáticos , Teorema de Bayes , Mapeo Contig , Haplotipos , Metagenómica/métodos , Análisis de Secuencia de ADN
11.
Genome Biol ; 21(1): 241, 2020 09 10.
Artículo en Inglés | MEDLINE | ID: mdl-32912315

RESUMEN

GetOrganelle is a state-of-the-art toolkit to accurately assemble organelle genomes from whole genome sequencing data. It recruits organelle-associated reads using a modified "baiting and iterative mapping" approach, conducts de novo assembly, filters and disentangles the assembly graph, and produces all possible configurations of circular organelle genomes. For 50 published plant datasets, we are able to reassemble the circular plastomes from 47 datasets using GetOrganelle. GetOrganelle assemblies are more accurate than published and/or NOVOPlasty-reassembled plastomes as assessed by mapping. We also assemble complete mitochondrial genomes using GetOrganelle. GetOrganelle is freely released under a GPL-3 license ( https://github.com/Kinggerm/GetOrganelle ).


Asunto(s)
Genoma Mitocondrial , Genoma de Planta , Genoma de Plastidios , Genómica/métodos , Programas Informáticos
12.
Genome Biol ; 21(1): 68, 2020 03 14.
Artículo en Inglés | MEDLINE | ID: mdl-32171299

RESUMEN

Hybrid genome assembly has emerged as an important technique in bacterial genomics, but cost and labor requirements limit large-scale application. We present Ultraplexing, a method to improve per-sample sequencing cost and hands-on time of Nanopore sequencing for hybrid assembly by at least 50% compared to molecular barcoding while maintaining high assembly quality. Ultraplexing requires the availability of Illumina data and uses inter-sample genetic variability to assign reads to isolates, which obviates the need for molecular barcoding. Thus, Ultraplexing can enable significant sequencing and labor cost reductions in large-scale bacterial genome projects.


Asunto(s)
Genoma Bacteriano , Secuenciación de Nanoporos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Plásmidos/genética , Staphylococcus aureus/genética , Staphylococcus aureus/aislamiento & purificación
13.
J Comput Biol ; 27(3): 317-329, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-32058803

RESUMEN

Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications. In this article, we give a new graph kernel, which we call graph traversal edit distance (GTED). We introduce the GTED problem and give the first polynomial time algorithm for it. Informally, the GTED is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. Also, GTED is motivated by and provides the first mathematical formalism for sequence co-assembly and de novo variation detection in bioinformatics. We demonstrate that GTED admits a polynomial time algorithm using a linear program in the graph product space that is guaranteed to yield an integer solution. To the best of our knowledge, this is the first approach to this problem. We also give a linear programming relaxation algorithm for a lower bound on GTED. We use GTED as a graph kernel and evaluate it by computing the accuracy of a support vector machine (SVM) classifier on a few data sets in the literature. Our results suggest that our kernel outperforms many of the common graph kernels in the tested data sets. As a second set of experiments, we successfully cluster viral genomes using GTED on their assembly graphs obtained from de novo assembly of next-generation sequencing reads.


Asunto(s)
Biología Computacional/métodos , Programación Lineal , Algoritmos , Animales , Minería de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Máquina de Vectores de Soporte
14.
PeerJ ; 4: e2681, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27843717

RESUMEN

The "Graphical Fragment Assembly" (GFA) is an emerging format for the representation of sequence assembly graphs, which can be adopted by both de Bruijn graph- and string graph-based assemblers. Here we present RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA. Furthermore, we show how the API provided by RGFA can be employed to design complex graph editing algorithms. As an example, we developed a detection algorithm for CRISPRs in a de Bruijn graph. Finally, RGFA can be used for comparing assembly graphs, e.g., to document the changes in a graph after applying a GUI editor. A program, GFAdiff is provided, which compares the information in two graphs, and generate a report or a Ruby script documenting the transformation steps between the graphs.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA