RESUMO
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.
Assuntos
Computação em Nuvem , Bases de Dados Genéticas , Vírus de RNA/genética , Vírus de RNA/isolamento & purificação , Alinhamento de Sequência/métodos , Virologia/métodos , Viroma/genética , Animais , Arquivos , Bacteriófagos/enzimologia , Bacteriófagos/genética , Biodiversidade , Coronavirus/classificação , Coronavirus/enzimologia , Coronavirus/genética , Evolução Molecular , Vírus Delta da Hepatite/enzimologia , Vírus Delta da Hepatite/genética , Humanos , Modelos Moleculares , Vírus de RNA/classificação , Vírus de RNA/enzimologia , RNA Polimerase Dependente de RNA/química , RNA Polimerase Dependente de RNA/genética , SoftwareRESUMO
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP.
Assuntos
Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência , AlgoritmosRESUMO
The alignment of sequencing reads against a protein reference database is a major computational bottleneck in metagenomics and data-intensive evolutionary projects. Although recent tools offer improved performance over the gold standard BLASTX, they exhibit only a modest speedup or low sensitivity. We introduce DIAMOND, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.
Assuntos
Metagenômica/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Sequência de Bases , Humanos , Microbiota/genética , Sensibilidade e Especificidade , Análise de Sequência de DNARESUMO
Sequence alignments are the foundations of life science research, but most innovation so far focuses on optimal alignments, while information derived from suboptimal solutions is ignored. We argue that one optimal alignment per pairwise sequence comparison is a reasonable approximation when dealing with very similar sequences but is insufficient when exploring the biodiversity of the protein universe at tree-of-life scale. To overcome this limitation, we introduce pairwise alignment-safety to uncover the amino acid positions robustly shared across all suboptimal solutions. We implement EMERALD, a software library for alignment-safety inference, and apply it to 400k sequences from the SwissProt database.
Assuntos
Algoritmos , Software , Animais , Sequência de Aminoácidos , Alinhamento de Sequência , Proteínas/genética , Proteínas/química , AvesRESUMO
BACKGROUND: Short-read sequencing technologies have long been the work-horse of microbiome analysis. Continuing technological advances are making the application of long-read sequencing to metagenomic samples increasingly feasible. RESULTS: We demonstrate that whole bacterial chromosomes can be obtained from an enriched community, by application of MinION sequencing to a sample from an EBPR bioreactor, producing 6 Gb of sequence that assembles into multiple closed bacterial chromosomes. We provide a simple pipeline for processing such data, which includes a new approach to correcting erroneous frame-shifts. CONCLUSIONS: Advances in long-read sequencing technology and corresponding algorithms will allow the routine extraction of whole chromosomes from environmental samples, providing a more detailed picture of individual members of a microbiome.