Pesquisa | Secretaria de Estado da Saúde

Petabase-scale sequence alignment catalyses viral discovery.

Edgar, Robert C; Taylor, Brie; Lin, Victor; Altman, Tomer; Barbera, Pierre; Meleshko, Dmitry; Lohr, Dan; Novakovsky, Gherman; Buchfink, Benjamin; Al-Shayeb, Basem; Banfield, Jillian F; de la Peña, Marcos; Korobeynikov, Anton; Chikhi, Rayan; Babaian, Artem.

Nature ; 602(7895): 142-147, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-35082445

RESUMO

Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.

Assuntos

Computação em Nuvem , Bases de Dados Genéticas , Vírus de RNA/genética , Vírus de RNA/isolamento & purificação , Alinhamento de Sequência/métodos , Virologia/métodos , Viroma/genética , Animais , Arquivos , Bacteriófagos/enzimologia , Bacteriófagos/genética , Biodiversidade , Coronavirus/classificação , Coronavirus/enzimologia , Coronavirus/genética , Evolução Molecular , Vírus Delta da Hepatite/enzimologia , Vírus Delta da Hepatite/genética , Humanos , Modelos Moleculares , Vírus de RNA/classificação , Vírus de RNA/enzimologia , RNA Polimerase Dependente de RNA/química , RNA Polimerase Dependente de RNA/genética , Software

Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult.

Morel, Benoit; Barbera, Pierre; Czech, Lucas; Bettisworth, Ben; Hübner, Lukas; Lutteropp, Sarah; Serdari, Dora; Kostaki, Evangelia-Georgia; Mamais, Ioannis; Kozlov, Alexey M; Pavlidis, Pavlos; Paraskevis, Dimitrios; Stamatakis, Alexandros.

Mol Biol Evol ; 38(5): 1777-1791, 2021 05 04.

Artigo em Inglês | MEDLINE | ID: mdl-33316067

RESUMO

Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8,736 out of all 16,453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into subclasses using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.

Assuntos

COVID-19/genética , Evolução Molecular , Genoma Viral , Mutação , Filogenia , SARS-CoV-2/genética , Humanos

Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data.

Czech, Lucas; Barbera, Pierre; Stamatakis, Alexandros.

Bioinformatics ; 36(10): 3263-3265, 2020 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32016344

RESUMO

SUMMARY: We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. AVAILABILITY AND IMPLEMENTATION: Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Bibliotecas , Software , Biblioteca Gênica , Filogenia

Methods for automatic reference trees and multilevel phylogenetic placement.

Czech, Lucas; Barbera, Pierre; Stamatakis, Alexandros.

Bioinformatics ; 35(7): 1151-1158, 2019 04 01.

Artigo em Inglês | MEDLINE | ID: mdl-30169747

RESUMO

MOTIVATION: In most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results. RESULTS: We present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence datasets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results. AVAILABILITY AND IMPLEMENTATION: Freely available under GPLv3 at http://github.com/lczech/gappa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Biologia Computacional , Evolução Molecular , Filogenia , Software , Biologia Computacional/métodos , Metagenoma , Fluxo de Trabalho

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

Barbera, Pierre; Kozlov, Alexey M; Czech, Lucas; Morel, Benoit; Darriba, Diego; Flouri, Tomás; Stamatakis, Alexandros.

Syst Biol ; 68(2): 365-369, 2019 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-30165689

RESUMO

Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACER, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Herein, we present EPA-NG, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and PPLACER. EPA-NG can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-NG, we placed $1$ billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3748 taxa in just under $7$ h, using 2048 cores. Our performance assessment shows that EPA-NG outperforms RAxML-EPA and PPLACER by up to a factor of $30$ in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-NG scales well up to 2048 cores. EPA-NG is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng.

Assuntos

Algoritmos , Classificação/métodos , Filogenia , Análise de Sequência de DNA , Software

Metagenomic Analysis Using Phylogenetic Placement-A Review of the First Decade.

Czech, Lucas; Stamatakis, Alexandros; Dunthorn, Micah; Barbera, Pierre.

Front Bioinform ; 2: 871393, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36304302

RESUMO

Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and taking evolutionary history into account. Thereby, one can increase the accuracy of metagenomic surveys and eliminate the requirement for having exact or close matches with existing sequence databases. Phylogenetic placement constitutes a valuable analysis tool per se, but also entails a plethora of downstream tools to interpret its results. A common use case is to analyze species communities obtained from metagenomic sequencing, for example via taxonomic assignment, diversity quantification, sample comparison, and identification of correlations with environmental variables. In this review, we provide an overview over the methods developed during the first 10 years. In particular, the goals of this review are 1) to motivate the usage of phylogenetic placement and illustrate some of its use cases, 2) to outline the full workflow, from raw sequences to publishable figures, including best practices, 3) to introduce the most common tools and methods and their capabilities, 4) to point out common placement pitfalls and misconceptions, 5) to showcase typical placement-based analyses, and how they can help to analyze, visualize, and interpret phylogenetic placement data.

SCRAPP: A tool to assess the diversity of microbial samples from phylogenetic placements.

Barbera, Pierre; Czech, Lucas; Lutteropp, Sarah; Stamatakis, Alexandros.

Mol Ecol Resour ; 21(1): 340-349, 2021 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-32996237

RESUMO

Microbial ecology research is currently driven by the continuously decreasing cost of DNA sequencing and the improving accuracy of data analysis methods. One such analysis method is phylogenetic placement, which establishes the phylogenetic identity of the anonymous environmental sequences in a sample by means of a given phylogenetic reference tree. However, assessing the diversity of a sample remains challenging, as traditional methods do not scale well with the increasing data volumes and/or do not leverage the phylogenetic placement information. Here, we present scrapp, a highly parallel and scalable tool that uses a molecular species delimitation algorithm to quantify the diversity distribution over the reference phylogeny for a given phylogenetic placement of the sample. scrapp employs a novel approach to cluster phylogenetic placements, called placement space clustering, to efficiently perform dimensionality reduction, so as to scale on large data volumes. Furthermore, it uses the phylogeny-aware molecular species delimitation method mPTP to quantify diversity. We evaluated scrapp using both, simulated and empirical data sets. We use simulated data to verify our approach. Tests on an empirical data set show that scrapp-derived metrics can classify samples by their diversity-correlated features equally well or better than existing, commonly used approaches. scrapp is available at https://github.com/pbdas/scrapp.

Assuntos

Algoritmos , Microbiota , Filogenia , Software , Análise de Sequência de DNA

Long-read metabarcoding of the eukaryotic rDNA operon to phylogenetically and taxonomically resolve environmental diversity.

Jamy, Mahwash; Foster, Rachel; Barbera, Pierre; Czech, Lucas; Kozlov, Alexey; Stamatakis, Alexandros; Bending, Gary; Hilton, Sally; Bass, David; Burki, Fabien.

Mol Ecol Resour ; 20(2): 429-443, 2020 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-31705734

RESUMO

High-throughput DNA metabarcoding of amplicon sizes below 500 bp has revolutionized the analysis of environmental microbial diversity. However, these short regions contain limited phylogenetic signal, which makes it impractical to use environmental DNA in full phylogenetic inferences. This lesser phylogenetic resolution of short amplicons may be overcome by new long-read sequencing technologies. To test this idea, we amplified soil DNA and used PacBio Circular Consensus Sequencing (CCS) to obtain an ~4500-bp region spanning most of the eukaryotic small subunit (18S) and large subunit (28S) ribosomal DNA genes. We first treated the CCS reads with a novel curation workflow, generating 650 high-quality operational taxonomic units (OTUs) containing the physically linked 18S and 28S regions. To assign taxonomy to these OTUs, we developed a phylogeny-aware approach based on the 18S region that showed greater accuracy and sensitivity than similarity-based methods. The taxonomically annotated OTUs were then combined with available 18S and 28S reference sequences to infer a well-resolved phylogeny spanning all major groups of eukaryotes, allowing us to accurately derive the evolutionary origin of environmental diversity. A total of 1,019 sequences were included, of which a majority (58%) corresponded to the new long environmental OTUs. The long reads also allowed us to directly investigate the relationships among environmental sequences themselves, which represents a key advantage over the placement of short reads on a reference phylogeny. Together, our results show that long amplicons can be treated in a full phylogenetic framework to provide greater taxonomic resolution and a robust evolutionary perspective to environmental DNA.

Assuntos

Eucariotos/classificação , Eucariotos/genética , Eucariotos/isolamento & purificação , Filogenia , Biodiversidade , Código de Barras de DNA Taxonômico , DNA Ambiental/genética , DNA Ribossômico/genética , Óperon , RNA Ribossômico 18S/genética , RNA Ribossômico 28S/genética , Solo/parasitologia

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa