Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
Bioinformatics ; 35(2): 211-218, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-29992260

RESUMO

Motivation: Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods. Results: In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation: http://spacedanchor.gobics.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Alinhamento de Sequência/métodos , Software , Biologia Computacional
2.
BMC Bioinformatics ; 20(Suppl 20): 638, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842735

RESUMO

BACKGROUND: In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. RESULTS: We adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM. CONCLUSIONS: Test runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.


Assuntos
Genoma Bacteriano , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Escherichia coli/genética , Filogenia
3.
Bioinformatics ; 33(7): 971-979, 2017 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-28073754

RESUMO

Motivation: Word-based or 'alignment-free' algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results: We propose Filtered Spaced Word Matches (FSWM) , a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don't-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/. Contact: chris.leimeister@stud.uni-goettingen.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Filogenia , Sequência de Bases , Simulação por Computador , Genoma Bacteriano , Genoma de Planta , Genômica/métodos , Alinhamento de Sequência , Análise de Sequência de DNA , Homologia de Sequência do Ácido Nucleico , Software , Fatores de Tempo
4.
PLoS Comput Biol ; 12(10): e1005107, 2016 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-27760124

RESUMO

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.


Assuntos
Algoritmos , DNA/genética , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Análise de Sequência de DNA/métodos , Software , DNA/química , Análise Mutacional de DNA/métodos , Mineração de Dados/métodos , Aprendizado de Máquina , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos
5.
Nucleic Acids Res ; 42(Web Server issue): W7-11, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24829447

RESUMO

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Internet , Alinhamento de Sequência , Interface Usuário-Computador
6.
Bioinformatics ; 30(14): 2000-8, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24828656

RESUMO

MOTIVATION: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. RESULTS: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood. AVAILABILITY AND IMPLEMENTATION: kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Genoma Bacteriano , Genoma Mitocondrial , Primatas , Roseobacter/genética , Alinhamento de Sequência
7.
Bioinformatics ; 30(14): 1991-9, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24700317

RESUMO

MOTIVATION: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. RESULTS: To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. AVAILABILITY AND IMPLEMENTATION: Our program is freely available at http://spaced.gobics.de/.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Genoma Mitocondrial , Genoma de Planta , Genômica/métodos , Primatas , Alinhamento de Sequência
8.
NAR Genom Bioinform ; 2(1): lqz013, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-33575565

RESUMO

Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.

9.
Gigascience ; 8(3)2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30535314

RESUMO

Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.


Assuntos
Filogenia , Proteoma/química , Alinhamento de Sequência/métodos , Software , Sequência de Aminoácidos , Animais , Bactérias/classificação , Bases de Dados de Proteínas , Plantas/classificação
10.
Genome Biol ; 20(1): 144, 2019 07 25.
Artigo em Inglês | MEDLINE | ID: mdl-31345254

RESUMO

BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.


Assuntos
Análise de Sequência , Benchmarking , Transferência Genética Horizontal , Internet , Filogenia , Sequências Reguladoras de Ácido Nucleico , Alinhamento de Sequência , Análise de Sequência de Proteína , Software
11.
Algorithms Mol Biol ; 12: 27, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29238399

RESUMO

BACKGROUND: Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487-1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. RESULTS: In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.

12.
Algorithms Mol Biol ; 10: 5, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25685176

RESUMO

Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA