Phylogeny-aware identification and correction of taxonomically mislabeled sequences.

Kozlov, Alexey M; Zhang, Jiajie; Yilmaz, Pelin; Glöckner, Frank Oliver; Stamatakis, Alexandros

Kozlov, Alexey M; Zhang, Jiajie; Yilmaz, Pelin; Glöckner, Frank Oliver; Stamatakis, Alexandros.

Afiliação

Kozlov AM; The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany Alexey.Kozlov@h-its.org.
Zhang J; The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany.
Yilmaz P; Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, 28359 Bremen, Germany.
Glöckner FO; Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, 28359 Bremen, Germany Jacobs University Bremen gGmbH, Campus Ring 1, 28759 Bremen, Germany.
Stamatakis A; The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach 6980, 76128 Karlsruhe, Germany.

Nucleic Acids Res ; 44(11): 5022-33, 2016 06 20.

Article em En | MEDLINE | ID: mdl-27166378

ABSTRACT

ABSTRACT

Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences ('mislabels') using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https//github.com/amkozlov/sativa.

Assuntos

Biologia Computacional/métodos; Código de Barras de DNA Taxonômico/normas; Genômica/métodos; Anotação de Sequência Molecular/normas; Filogenia; Bactérias/genética; Bases de Dados de Ácidos Nucleicos; RNA Ribossômico 16S; Reprodutibilidade dos Testes; Análise de Sequência de DNA; Software; Navegador

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Filogenia / Biologia Computacional / Genômica / Código de Barras de DNA Taxonômico / Anotação de Sequência Molecular Tipo de estudo: Diagnostic_studies / Prognostic_studies / Risk_factors_studies Idioma: En Ano de publicação: 2016 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google