RESUMEN
To quantify known and unknown microorganisms at species-level resolution using shotgun sequencing data, we developed a method that establishes metagenomic operational taxonomic units (mOTUs) based on single-copy phylogenetic marker genes. Applied to 252 human fecal samples, the method revealed that on average 43% of the species abundance and 58% of the richness cannot be captured by current reference genome-based methods. An implementation of the method is available at http://www.bork.embl.de/software/mOTU/.
Asunto(s)
Metagenómica , Microbiota , Alineación de Secuencia/métodos , Algoritmos , Calibración , Análisis por Conglomerados , Biología Computacional/métodos , ADN Ribosómico/genética , Ligamiento Genético , Marcadores Genéticos , Genoma , Humanos , Intestinos/microbiología , Filogenia , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN/métodosRESUMEN
SUMMARY: New sequence data useful for phylogenetic and evolutionary analyses continues to be added to public databases. The construction of multiple sequence alignments and inference of huge phylogenies comprising large taxonomic groups are expensive tasks, both in terms of man hours and computational resources. Therefore, maintaining comprehensive phylogenies, based on representative and up-to-date molecular sequences, is challenging. PUmPER is a framework that can perpetually construct multi-gene alignments (with PHLAWD) and phylogenetic trees (with ExaML or RAxML-Light) for a given NCBI taxonomic group. When sufficient numbers of new gene sequences for the selected taxonomic group have accumulated in GenBank, PUmPER automatically extends the alignment and infers extended phylogenetic trees by using previously inferred smaller trees as starting topologies. Using our framework, large phylogenetic trees can be perpetually updated without human intervention. Importantly, resulting phylogenies are not statistically significantly worse than trees inferred from scratch. AVAILABILITY AND IMPLEMENTATION: PUmPER can run in stand-alone mode on a single server, or offload the computationally expensive phylogenetic searches to a parallel computing cluster. Source code, documentation, and tutorials are available at https://github.com/fizquierdo/perpetually-updated-trees. CONTACT: Fernando.Izquierdo@h-its.org SUPPLEMENTARY INFORMATION: Supplementary Material is available at Bioinformatics online.
Asunto(s)
Filogenia , Alineación de Secuencia/métodos , Bases de Datos Genéticas , Embryophyta/genética , Programas InformáticosRESUMEN
Verification in phylogenetics represents an extremely difficult subject. Phylogenetic analysis deals with the reconstruction of evolutionary histories of species, and as long as mankind is not able to travel in time, it will not be possible to verify deep evolutionary histories reconstructed with modern computational methods. Here, we focus on two more tangible issues that are related to verification in phylogenetics (i) the inference of support values on trees that provide some notion about the 'correctness' of the tree within narrow limits and, more importantly; (ii) issues pertaining to program verification, especially with respect to codes that rely heavily on floating-point arithmetics. Program verification represents a largely underestimated problem in computational science that can have fatal effects on scientific conclusions.
Asunto(s)
Simulación por Computador , Filogenia , Bases de Datos Genéticas , Evolución MolecularRESUMEN
BACKGROUND: The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood. RESULTS: We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times and memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems. CONCLUSIONS: We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.
Asunto(s)
Algoritmos , Funciones de Verosimilitud , Filogenia , Teorema de Bayes , Modelos Genéticos , Datos de Secuencia Molecular , ProbabilidadRESUMEN
Insects are the most speciose group of animals, but the phylogenetic relationships of many major lineages remain unresolved. We inferred the phylogeny of insects from 1478 protein-coding genes. Phylogenomic analyses of nucleotide and amino acid sequences, with site-specific nucleotide or domain-specific amino acid substitution models, produced statistically robust and congruent results resolving previously controversial phylogenetic relations hips. We dated the origin of insects to the Early Ordovician [~479 million years ago (Ma)], of insect flight to the Early Devonian (~406 Ma), of major extant lineages to the Mississippian (~345 Ma), and the major diversification of holometabolous insects to the Early Cretaceous. Our phylogenomic study provides a comprehensive reliable scaffold for future comparative analyses of evolutionary innovations among insects.