Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-André; Morgenstern, Burkhard

Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-André; Morgenstern, Burkhard.

Afiliação

Horwege S; University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany Sebastian.Horwege@stud.uni-goettingen.de.
Lindner S; University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany.
Boden M; University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany.
Hatje K; Max-Planck-Institute for Biophysical Chemistry, Department of NMR-based Structural Biology, Group Systems Biology of Motor Proteins, Am Fassberg 11, 37077 Göttingen, Germany.
Kollmar M; Max-Planck-Institute for Biophysical Chemistry, Department of NMR-based Structural Biology, Group Systems Biology of Motor Proteins, Am Fassberg 11, 37077 Göttingen, Germany.
Leimeister CA; University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany.
Morgenstern B; University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, France.

Nucleic Acids Res ; 42(Web Server issue): W7-11, 2014 Jul.

Article em En | MEDLINE | ID: mdl-24829447

RESUMO

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.

Assuntos

Filogenia; Análise de Sequência de DNA/métodos; Análise de Sequência de Proteína/métodos; Software; Algoritmos; Internet; Alinhamento de Sequência; Interface Usuário-Computador

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Filogenia / Software / Análise de Sequência de DNA / Análise de Sequência de Proteína Idioma: En Ano de publicação: 2014 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google