Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
Cell ; 184(13): 3376-3393.e17, 2021 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-34043940

RESUMO

We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.


Assuntos
Farmacorresistência Bacteriana/genética , Metagenômica , Microbiota/genética , População Urbana , Biodiversidade , Bases de Dados Genéticas , Humanos
2.
Nature ; 607(7917): 111-118, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35732736

RESUMO

Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds2,3. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters ('Candidatus Eudoremicrobiaceae') that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.


Assuntos
Vias Biossintéticas , Microbiota , Oceanos e Mares , Bactérias/classificação , Bactérias/genética , Vias Biossintéticas/genética , Genômica , Microbiota/genética , Família Multigênica/genética , Filogenia
3.
Genome Res ; 33(7): 1208-1217, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37072187

RESUMO

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Alinhamento de Sequência , Análise de Sequência de DNA/métodos
4.
Genome Res ; 2022 May 24.
Artigo em Inglês | MEDLINE | ID: mdl-35609994

RESUMO

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

5.
Bioinformatics ; 40(Suppl 1): i337-i346, 2024 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940164

RESUMO

MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.


Assuntos
Algoritmos , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas
6.
Bioinformatics ; 37(Suppl_1): i169-i176, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252940

RESUMO

MOTIVATION: Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. RESULTS: In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. AVAILABILITY AND IMPLEMENTATION: RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.


Assuntos
Algoritmos , Pesquisa Biomédica , Software
7.
Bioinformatics ; 35(3): 407-414, 2019 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-30020403

RESUMO

Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Compressão de Dados , Software , Algoritmos , Cor , Genômica , Sequenciamento de Nucleotídeos em Larga Escala
8.
BMC Biol ; 15(1): 100, 2017 10 30.
Artigo em Inglês | MEDLINE | ID: mdl-29084520

RESUMO

BACKGROUND: Internal tagging of proteins by inserting small functional peptides into surface accessible permissive sites has proven to be an indispensable tool for basic and applied science. Permissive sites are typically identified by transposon mutagenesis on a case-by-case basis, limiting scalability and their exploitation as a system-wide protein engineering tool. METHODS: We developed an apporach for predicting permissive stretches (PSs) in proteins based on the identification of length-variable regions (regions containing indels) in homologous proteins. RESULTS: We verify that a protein's primary structure information alone is sufficient to identify PSs. Identified PSs are predicted to be predominantly surface accessible; hence, the position of inserted peptides is likely suitable for diverse applications. We demonstrate the viability of this approach by inserting a Tobacco etch virus protease recognition site (TEV-tag) into several PSs in a wide range of proteins, from small monomeric enzymes (adenylate kinase) to large multi-subunit molecular machines (ATP synthase) and verify their functionality after insertion. We apply this method to engineer conditional protein knockdowns directly in the Escherichia coli chromosome and generate a cell-free platform with enhanced nucleotide stability. CONCLUSIONS: Functional internally tagged proteins can be rationally designed and directly chromosomally implemented. Critical for the successful design of protein knockdowns was the incorporation of surface accessibility and secondary structure predictions, as well as the design of an improved TEV-tag that enables efficient hydrolysis when inserted into the middle of a protein. This versatile and portable approach can likely be adapted for other applications, and broadly adopted. We provide guidelines for the design of internally tagged proteins in order to empower scientists with little or no protein engineering expertise to internally tag their target proteins.


Assuntos
Endopeptidases/genética , Proteínas de Escherichia coli/genética , Escherichia coli/genética , Engenharia Genética/métodos , Endopeptidases/metabolismo , Escherichia coli/metabolismo , Engenharia Genética/instrumentação
9.
Nucleic Acids Res ; 41(17): e169, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23921633

RESUMO

High-throughput sequencing technologies have allowed for the cataloguing of variation in personal human genomes. In this manuscript, we present alu-detect, a tool that combines read-pair and split-read information to detect novel Alus and their precise breakpoints directly from either whole-genome or whole-exome sequencing data while also identifying insertions directly in the vicinity of existing Alus. To set the parameters of our method, we use simulation of a faux reference, which allows us to compute the precision and recall of various parameter settings using real sequencing data. Applying our method to 100 bp paired Illumina data from seven individuals, including two trios, we detected on average 1519 novel Alus per sample. Based on the faux-reference simulation, we estimate that our method has 97% precision and 85% recall. We identify 808 novel Alus not previously described in other studies. We also demonstrate the use of alu-detect to study the local sequence and global location preferences for novel Alu insertions.


Assuntos
Elementos Alu , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Pontos de Quebra do Cromossomo , Exoma , Genoma Humano , Estudo de Associação Genômica Ampla , Humanos , Reação em Cadeia da Polimerase , Software
10.
Turk Neurosurg ; 32(5): 720-726, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35179729

RESUMO

AIM: To report our experience of mechanical thrombectomy using the SOFIA < sup > TM < /sup > catheter, in terms of its effectivenessicacy and safety. MATERIAL AND METHODS: Acute ischemic stroke patients with large vessel occlusions who underwent mechanical thrombectomy, with the SOFIA < sup > TM < /sup > aspiration catheter as the first-line approach, were retrospectively identified. For all patients, the data, including reperfusion success (modified Thrombolysis in Cerebral Infarction [mTICI]), procedural details, clinical status at the baseline and post-discharge at 90 days, and complications, were analysed. RESULTS: During the study period (January 2017-July 2020), 73 patients underwent endovascular thrombectomy. The mean age and the baseline National Institutes of Health Stroke scores were 72 (41-83) and 16 (12-25), respectively. Successful reperfusion (mTICI≥2b-3) was obtained in 80.8 % (n=59) of the patients. Using ADAPT, a first-pass effect was achieved in 63.01% (n=46) of the patients. Rescue stent retriever (SRV) had to be utilized in 36.98% (n=27) of the patients; all presented with a favourable clinical outcome (modified Rankin score ≤0-2) at 90 days. The complication rate in the study was 13.7% (n=10). CONCLUSION: The contact aspiration approach with SOFIA < sup > TM < /sup > catheters as a first-line device appears to be fast, safe, and effective. Our results were comparable to the findings of other series. In the case of insufficient response on contact aspiration, we could easily modify the SOFIA < sup > TM < /sup > catheter approach for an additional stent retriever rescue treatment.


Assuntos
Isquemia Encefálica , Procedimentos Endovasculares , AVC Isquêmico , Acidente Vascular Cerebral , Trombose , Assistência ao Convalescente , Isquemia Encefálica/complicações , Catéteres/efeitos adversos , Infarto Cerebral/complicações , Procedimentos Endovasculares/métodos , Humanos , Alta do Paciente , Estudos Retrospectivos , Stents/efeitos adversos , Acidente Vascular Cerebral/complicações , Acidente Vascular Cerebral/cirurgia , Resultado do Tratamento
11.
J Comput Biol ; 27(4): 626-639, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-31891531

RESUMO

High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.


Assuntos
Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Biologia Computacional , Compressão de Dados , Humanos , Anotação de Sequência Molecular/métodos
12.
Mob Genet Elements ; 4(5): 1-7, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26442170

RESUMO

Repetitive elements generally, and Alu inserts specifically are a large contributor to the recent evolution of the human genome. By assembling the sequences of novel Alu inserts using their respective subfamily consensus sequences as references, we found an exponential decay in the Alu subfamily call enrichment with increased number of sequence variants (Pearson correlation [Formula: see text], [Formula: see text]). By mapping the sequences of these inserts to a human reference genome, we infer the reference Alu sources of a subset of the novel Alus, of which 85% were previously shown to be active. We also evaluate relationships between the loci of the novel inserts and their inferred sources.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA