Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

Leimeister, Chris-Andre; Morgenstern, Burkhard

Leimeister, Chris-Andre; Morgenstern, Burkhard.

Afiliação

Leimeister CA; Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, France.
Morgenstern B; Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, FranceDepartment of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 80

Bioinformatics ; 30(14): 2000-8, 2014 Jul 15.

Article em En | MEDLINE | ID: mdl-24828656

RESUMO

MOTIVATION: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. RESULTS: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood. AVAILABILITY AND IMPLEMENTATION: kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/.

Assuntos

Filogenia; Análise de Sequência de DNA/métodos; Análise de Sequência de Proteína/métodos; Algoritmos; Animais; Genoma Bacteriano; Genoma Mitocondrial; Primatas; Roseobacter/genética; Alinhamento de Sequência

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Filogenia / Análise de Sequência de DNA / Análise de Sequência de Proteína Limite: Animals Idioma: En Ano de publicação: 2014 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google