Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 63
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
PLoS One ; 15(2): e0228070, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32040534

RESUMO

We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.

2.
BMC Bioinformatics ; 20(Suppl 20): 638, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842735

RESUMO

BACKGROUND: In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. RESULTS: We adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM. CONCLUSIONS: Test runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.


Assuntos
Genoma Bacteriano , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Escherichia coli/genética , Filogenia
3.
Genome Biol ; 20(1): 144, 2019 07 25.
Artigo em Inglês | MEDLINE | ID: mdl-31345254

RESUMO

BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.


Assuntos
Análise de Sequência , Benchmarking , Transferência Genética Horizontal , Internet , Filogenia , Sequências Reguladoras de Ácido Nucleico , Alinhamento de Sequência , Análise de Sequência de Proteína , Software
4.
Bioinformatics ; 35(2): 211-218, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-29992260

RESUMO

Motivation: Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods. Results: In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation: http://spacedanchor.gobics.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Alinhamento de Sequência/métodos , Software , Biologia Computacional
5.
Gigascience ; 8(3)2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30535314

RESUMO

Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.


Assuntos
Filogenia , Proteoma/química , Alinhamento de Sequência/métodos , Software , Sequência de Aminoácidos , Animais , Bactérias/classificação , Bases de Dados de Proteínas , Plantas/classificação
6.
Algorithms Mol Biol ; 12: 27, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29238399

RESUMO

Background: Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487-1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. Results: In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.

7.
Bioinformatics ; 33(7): 971-979, 2017 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-28073754

RESUMO

Motivation: Word-based or 'alignment-free' algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results: We propose Filtered Spaced Word Matches (FSWM) , a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don't-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/. Contact: chris.leimeister@stud.uni-goettingen.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Filogenia , Sequência de Bases , Simulação por Computador , Genoma Bacteriano , Genoma de Planta , Genômica/métodos , Alinhamento de Sequência , Análise de Sequência de DNA , Homologia de Sequência do Ácido Nucleico , Software , Fatores de Tempo
8.
PLoS Comput Biol ; 12(10): e1005107, 2016 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-27760124

RESUMO

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.


Assuntos
Algoritmos , DNA/genética , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Análise de Sequência de DNA/métodos , Software , DNA/química , Análise Mutacional de DNA/métodos , Mineração de Dados/métodos , Aprendizado de Máquina , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos
9.
Nat Commun ; 6: 7822, 2015 Jul 28.
Artigo em Inglês | MEDLINE | ID: mdl-26215380

RESUMO

Genetic screens are powerful tools to identify the genes required for a given biological process. However, for technical reasons, comprehensive screens have been restricted to very few model organisms. Therefore, although deep sequencing is revealing the genes of ever more insect species, the functional studies predominantly focus on candidate genes previously identified in Drosophila, which is biasing research towards conserved gene functions. RNAi screens in other organisms promise to reduce this bias. Here we present the results of the iBeetle screen, a large-scale, unbiased RNAi screen in the red flour beetle, Tribolium castaneum, which identifies gene functions in embryonic and postembryonic development, physiology and cell biology. The utility of Tribolium as a screening platform is demonstrated by the identification of genes involved in insect epithelial adhesion. This work transcends the restrictions of the candidate gene approach and opens fields of research not accessible in Drosophila.


Assuntos
Desenvolvimento Embrionário/genética , Proteínas de Insetos/genética , Metamorfose Biológica/genética , Oogênese/genética , Interferência de RNA , Tribolium/genética , Animais , Besouros/embriologia , Besouros/genética , Besouros/fisiologia , Sequenciamento de Nucleotídeos em Larga Escala , Larva/genética , Pupa/genética , Tribolium/embriologia , Tribolium/fisiologia
10.
Metabolomics ; 11(3): 764-777, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25972773

RESUMO

A central aim in the evaluation of non-targeted metabolomics data is the detection of intensity patterns that differ between experimental conditions as well as the identification of the underlying metabolites and their association with metabolic pathways. In this context, the identification of metabolites based on non-targeted mass spectrometry data is a major bottleneck. In many applications, this identification needs to be guided by expert knowledge and interactive tools for exploratory data analysis can significantly support this process. Additionally, the integration of data from other omics platforms, such as DNA microarray-based transcriptomics, can provide valuable hints and thereby facilitate the identification of metabolites via the reconstruction of related metabolic pathways. We here introduce the MarVis-Pathway tool, which allows the user to identify metabolites by annotation of pathways from cross-omics data. The analysis is supported by an extensive framework for pathway enrichment and meta-analysis. The tool allows the mapping of data set features by ID, name, and accurate mass, and can incorporate information from adduct and isotope correction of mass spectrometry data. MarVis-Pathway was integrated in the MarVis-Suite (http://marvis.gobics.de), which features the seamless highly interactive filtering, combination, clustering, and visualization of omics data sets. The functionality of the new software tool is illustrated using combined mass spectrometry and DNA microarray data. This application confirms jasmonate biosynthesis as important metabolic pathway that is upregulated during the wound response of Arabidopsis plants.

11.
Algorithms Mol Biol ; 10: 5, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25685176

RESUMO

Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.

12.
Nucleic Acids Res ; 42(Web Server issue): W7-11, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24829447

RESUMO

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Internet , Alinhamento de Sequência , Interface Usuário-Computador
13.
Bioinformatics ; 30(14): 2000-8, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24828656

RESUMO

MOTIVATION: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. RESULTS: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood. AVAILABILITY AND IMPLEMENTATION: kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Genoma Bacteriano , Genoma Mitocondrial , Primatas , Roseobacter/genética , Alinhamento de Sequência
14.
Bioinformatics ; 30(14): 1991-9, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24700317

RESUMO

MOTIVATION: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. RESULTS: To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. AVAILABILITY AND IMPLEMENTATION: Our program is freely available at http://spaced.gobics.de/.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Genoma Mitocondrial , Genoma de Planta , Genômica/métodos , Primatas , Alinhamento de Sequência
15.
PLoS One ; 9(2): e89297, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24586671

RESUMO

A major challenge in current systems biology is the combination and integrative analysis of large data sets obtained from different high-throughput omics platforms, such as mass spectrometry based Metabolomics and Proteomics or DNA microarray or RNA-seq-based Transcriptomics. Especially in the case of non-targeted Metabolomics experiments, where it is often impossible to unambiguously map ion features from mass spectrometry analysis to metabolites, the integration of more reliable omics technologies is highly desirable. A popular method for the knowledge-based interpretation of single data sets is the (Gene) Set Enrichment Analysis. In order to combine the results from different analyses, we introduce a methodical framework for the meta-analysis of p-values obtained from Pathway Enrichment Analysis (Set Enrichment Analysis based on pathways) of multiple dependent or independent data sets from different omics platforms. For dependent data sets, e.g. obtained from the same biological samples, the framework utilizes a covariance estimation procedure based on the nonsignificant pathways in single data set enrichment analysis. The framework is evaluated and applied in the joint analysis of Metabolomics mass spectrometry and Transcriptomics DNA microarray data in the context of plant wounding. In extensive studies of simulated data set dependence, the introduced correlation could be fully reconstructed by means of the covariance estimation based on pathway enrichment. By restricting the range of p-values of pathways considered in the estimation, the overestimation of correlation, which is introduced by the significant pathways, could be reduced. When applying the proposed methods to the real data sets, the meta-analysis was shown not only to be a powerful tool to investigate the correlation between different data sets and summarize the results of multiple analyses but also to distinguish experiment-specific key pathways.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos , Bases de Dados Genéticas , Humanos , Metabolômica , Biologia de Sistemas/métodos
16.
Methods Mol Biol ; 1079: 191-202, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24170403

RESUMO

DIALIGN is a software tool for multiple sequence alignment by combining global and local alignment features. It composes multiple alignments from local pairwise sequence similarities. This approach is particularly useful to discover conserved functional regions in sequences that share only local homologies but are otherwise unrelated. An anchoring option allows to use external information and expert knowledge in addition to primary-sequence similarity alone. The latest version of DIALIGN optionally uses matches to the PFAM database to detect weak homologies. Various versions of the program are available through Göttingen Bioinformatics Compute Server (GOBICS) at http://www.gobics.de/department/software.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Internet , Proteínas/química , Interface Usuário-Computador
17.
Nucleic Acids Res ; 41(Web Server issue): W3-7, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23620293

RESUMO

DIALIGN is an established tool for multiple sequence alignment that is particularly useful to detect local homologies in sequences with low overall similarity. In recent years, various versions of the program have been developed, some of which are fully automated, whereas others are able to accept user-specified external information. In this article, we review some versions of the program that are available through 'Göttingen Bioinformatics Compute Server'. In addition to previously described implementations, we present a new release of DIALIGN called 'DIALIGN-PFAM', which uses hits to the PFAM database for improved protein alignment. Our software is available through http://dialign.gobics.de/.


Assuntos
Alinhamento de Sequência/métodos , Software , Algoritmos , Internet , Análise de Sequência de DNA , Análise de Sequência de Proteína
18.
Nucleic Acids Res ; 40(Web Server issue): W193-8, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22600739

RESUMO

jpHMM is a very accurate and widely used tool for recombination detection in genomic sequences of HIV-1. Here, we present an extension of jpHMM to analyze recombinations in viruses with circular genomes such as the hepatitis B virus (HBV). Sequence analysis of circular genomes is usually performed on linearized sequences using linear models. Since linear models are unable to model dependencies between nucleotides at the 5'- and 3'-end of a sequence, this can result in inaccurate predictions of recombination breakpoints and thus in incorrect classification of viruses with circular genomes. The proposed circular jpHMM takes into account the circularity of the genome and is not biased against recombination breakpoints close to the 5'- or 3'-end of the linearized version of the circular genome. It can be applied automatically to any query sequence without assuming a specific origin for the sequence coordinates. We apply the method to genomic sequences of HBV and visualize its output in a circular form. jpHMM is available online at http://jphmm.gobics.de for download and as a web server for HIV-1 and HBV sequences.


Assuntos
Genoma Viral , Vírus da Hepatite B/genética , Recombinação Genética , Software , Genômica/métodos , Internet , Cadeias de Markov , Alinhamento de Sequência
19.
BMC Bioinformatics ; 12: 425, 2011 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-22040322

RESUMO

BACKGROUND: Long-term sample storage, tracing of data flow and data export for subsequent analyses are of great importance in genetics studies. Therefore, molecular labs do need a proper information system to handle an increasing amount of data from different projects. RESULTS: We have developed a molecular labs information management system (MolabIS). It was implemented as a web-based system allowing the users to capture original data at each step of their workflow. MolabIS provides essential functionality for managing information on individuals, tracking samples and storage locations, capturing raw files, importing final data from external files, searching results, accessing and modifying data. Further important features are options to generate ready-to-print reports and convert sequence and microsatellite data into various data formats, which can be used as input files in subsequent analyses. Moreover, MolabIS also provides a tool for data migration. CONCLUSIONS: MolabIS is designed for small-to-medium sized labs conducting Sanger sequencing and microsatellite genotyping to store and efficiently handle a relative large amount of data. MolabIS not only helps to avoid time consuming tasks but also ensures the availability of data for further analyses. The software is packaged as a virtual appliance which can run on different platforms (e.g. Linux, Windows). MolabIS can be distributed to a wide range of molecular genetics labs since it was developed according to a general data model. Released under GPL, MolabIS is freely available at http://www.molabis.org.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Animais , Genótipo , Gestão da Informação , Internet , Sistemas de Informação Administrativa , Repetições de Microssatélites
20.
BMC Bioinformatics ; 12: 93, 2011 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-21481263

RESUMO

BACKGROUND: Methods of determining whether or not any particular HIV-1 sequence stems - completely or in part - from some unknown HIV-1 subtype are important for the design of vaccines and molecular detection systems, as well as for epidemiological monitoring. Nevertheless, a single algorithm only, the Branching Index (BI), has been developed for this task so far. Moving along the genome of a query sequence in a sliding window, the BI computes a ratio quantifying how closely the query sequence clusters with a subtype clade. In its current version, however, the BI does not provide predicted boundaries of unknown fragments. RESULTS: We have developed Unknown Subtype Finder (USF), an algorithm based on a probabilistic model, which automatically determines which parts of an input sequence originate from a subtype yet unknown. The underlying model is based on a simple profile hidden Markov model (pHMM) for each known subtype and an additional pHMM for an unknown subtype. The emission probabilities of the latter are estimated using the emission frequencies of the known subtypes by means of a (position-wise) probabilistic model for the emergence of new subtypes. We have applied USF to SIV and HIV-1 sequences formerly classified as having emerged from an unknown subtype. Moreover, we have evaluated its performance on artificial HIV-1 recombinants and non-recombinant HIV-1 sequences. The results have been compared with the corresponding results of the BI. CONCLUSIONS: Our results demonstrate that USF is suitable for detecting segments in HIV-1 sequences stemming from yet unknown subtypes. Comparing USF with the BI shows that our algorithm performs as good as the BI or better.


Assuntos
Algoritmos , Biologia Computacional/métodos , HIV-1/genética , Simulação por Computador , Variação Genética , Modelos Genéticos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA