Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 72
Filtrar
1.
PLoS Comput Biol ; 18(8): e1010303, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35939516

RESUMO

Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods.


Assuntos
Mutação INDEL , Software , Algoritmos , Mutação INDEL/genética , Filogenia , Alinhamento de Sequência
2.
BMC Bioinformatics ; 22(1): 64, 2021 Feb 11.
Artigo em Inglês | MEDLINE | ID: mdl-33573603

RESUMO

BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. RESULTS: We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. CONCLUSIONS: S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA , Software
3.
Bioinformatics ; 35(2): 211-218, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-29992260

RESUMO

Motivation: Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods. Results: In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation: http://spacedanchor.gobics.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Alinhamento de Sequência/métodos , Software , Biologia Computacional
4.
BMC Bioinformatics ; 20(Suppl 20): 638, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842735

RESUMO

BACKGROUND: In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. RESULTS: We adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM. CONCLUSIONS: Test runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.


Assuntos
Genoma Bacteriano , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Escherichia coli/genética , Filogenia
5.
Bioinformatics ; 33(7): 971-979, 2017 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-28073754

RESUMO

Motivation: Word-based or 'alignment-free' algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results: We propose Filtered Spaced Word Matches (FSWM) , a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don't-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/. Contact: chris.leimeister@stud.uni-goettingen.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Filogenia , Sequência de Bases , Simulação por Computador , Genoma Bacteriano , Genoma de Planta , Genômica/métodos , Alinhamento de Sequência , Análise de Sequência de DNA , Homologia de Sequência do Ácido Nucleico , Software , Fatores de Tempo
6.
PLoS Comput Biol ; 12(10): e1005107, 2016 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-27760124

RESUMO

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.


Assuntos
Algoritmos , DNA/genética , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Análise de Sequência de DNA/métodos , Software , DNA/química , Análise Mutacional de DNA/métodos , Mineração de Dados/métodos , Aprendizado de Máquina , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos
7.
Nucleic Acids Res ; 42(Web Server issue): W7-11, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24829447

RESUMO

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Internet , Alinhamento de Sequência , Interface Usuário-Computador
8.
Bioinformatics ; 30(14): 2000-8, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24828656

RESUMO

MOTIVATION: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. RESULTS: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood. AVAILABILITY AND IMPLEMENTATION: kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Genoma Bacteriano , Genoma Mitocondrial , Primatas , Roseobacter/genética , Alinhamento de Sequência
9.
Bioinformatics ; 30(14): 1991-9, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24700317

RESUMO

MOTIVATION: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. RESULTS: To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. AVAILABILITY AND IMPLEMENTATION: Our program is freely available at http://spaced.gobics.de/.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Genoma Mitocondrial , Genoma de Planta , Genômica/métodos , Primatas , Alinhamento de Sequência
10.
Nucleic Acids Res ; 41(Web Server issue): W3-7, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23620293

RESUMO

DIALIGN is an established tool for multiple sequence alignment that is particularly useful to detect local homologies in sequences with low overall similarity. In recent years, various versions of the program have been developed, some of which are fully automated, whereas others are able to accept user-specified external information. In this article, we review some versions of the program that are available through 'Göttingen Bioinformatics Compute Server'. In addition to previously described implementations, we present a new release of DIALIGN called 'DIALIGN-PFAM', which uses hits to the PFAM database for improved protein alignment. Our software is available through http://dialign.gobics.de/.


Assuntos
Alinhamento de Sequência/métodos , Software , Algoritmos , Internet , Análise de Sequência de DNA , Análise de Sequência de Proteína
11.
Plant Cell ; 23(4): 1556-72, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21487095

RESUMO

In the postgenomic era, accurate prediction tools are essential for identification of the proteomes of cell organelles. Prediction methods have been developed for peroxisome-targeted proteins in animals and fungi but are missing specifically for plants. For development of a predictor for plant proteins carrying peroxisome targeting signals type 1 (PTS1), we assembled more than 2500 homologous plant sequences, mainly from EST databases. We applied a discriminative machine learning approach to derive two different prediction methods, both of which showed high prediction accuracy and recognized specific targeting-enhancing patterns in the regions upstream of the PTS1 tripeptides. Upon application of these methods to the Arabidopsis thaliana genome, 392 gene models were predicted to be peroxisome targeted. These predictions were extensively tested in vivo, resulting in a high experimental verification rate of Arabidopsis proteins previously not known to be peroxisomal. The prediction methods were able to correctly infer novel PTS1 tripeptides, which even included novel residues. Twenty-three newly predicted PTS1 tripeptides were experimentally confirmed, and a high variability of the plant PTS1 motif was discovered. These prediction methods will be instrumental in identifying low-abundance and stress-inducible peroxisomal proteins and defining the entire peroxisomal proteome of Arabidopsis and agronomically important crop plants.


Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Inteligência Artificial , Biologia Computacional/métodos , Peroxissomos/metabolismo , Sinais Direcionadores de Proteínas , Sequência de Aminoácidos , Arabidopsis/genética , Proteínas de Arabidopsis/química , Bases de Dados de Proteínas , Genoma de Planta/genética , Modelos Biológicos , Dados de Sequência Molecular , Peptídeos , Transporte Proteico , Reprodutibilidade dos Testes , Frações Subcelulares/metabolismo
12.
Nucleic Acids Res ; 40(Web Server issue): W193-8, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22600739

RESUMO

jpHMM is a very accurate and widely used tool for recombination detection in genomic sequences of HIV-1. Here, we present an extension of jpHMM to analyze recombinations in viruses with circular genomes such as the hepatitis B virus (HBV). Sequence analysis of circular genomes is usually performed on linearized sequences using linear models. Since linear models are unable to model dependencies between nucleotides at the 5'- and 3'-end of a sequence, this can result in inaccurate predictions of recombination breakpoints and thus in incorrect classification of viruses with circular genomes. The proposed circular jpHMM takes into account the circularity of the genome and is not biased against recombination breakpoints close to the 5'- or 3'-end of the linearized version of the circular genome. It can be applied automatically to any query sequence without assuming a specific origin for the sequence coordinates. We apply the method to genomic sequences of HBV and visualize its output in a circular form. jpHMM is available online at http://jphmm.gobics.de for download and as a web server for HIV-1 and HBV sequences.


Assuntos
Genoma Viral , Vírus da Hepatite B/genética , Recombinação Genética , Software , Genômica/métodos , Internet , Cadeias de Markov , Alinhamento de Sequência
13.
Mol Cell Proteomics ; 10(6): M110.003350, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21444828

RESUMO

We describe a method to identify and analyze translationally regulative 5'UTRs (5'TRU) in Saccharomyces cerevisiae. Two-dimensional analyses of (35)S-methionine metabolically labeled cells revealed 13 genes and proteins, whose protein biosynthesis is post-transcriptionally up-regulated on amino acid starvation. The 5'UTRs of the respective mRNAs were further investigated. A plasmid-based reporter-testing system was developed to analyze their capability to influence translation dependent on amino acid availability. Most of the 13 candidate 5'UTRs are able to enhance translation independently of amino acids. Two 5'UTRs generally repressed translation, and the 5'UTRs of ENO1, FBA1, and TPI1 specifically up-regulated translation when cells were starved for amino acids. The TPI1-5'UTR exhibited the strongest effect in the testing system, which is consistent with elevated Tpi1p-levels in amino acid starved cells. Bioinformatical analyses support that an unstructured A-rich 5' leader is beneficial for efficient translation when amino acids are scarce. Accordingly, the TPI1-5'UTR was shown to contain an A-rich tract in proximity to the mRNA-initiation codon, required for its amino acid dependent regulatory function.


Assuntos
Regiões 5' não Traduzidas , Aminoácidos/metabolismo , Regulação Fúngica da Expressão Gênica , RNA Mensageiro/genética , Saccharomyces cerevisiae/genética , Amitrol (Herbicida)/metabolismo , Sequência de Bases , Genes Reporter , Dados de Sequência Molecular , Conformação de Ácido Nucleico , Proteoma/genética , Proteoma/metabolismo , Saccharomyces cerevisiae/crescimento & desenvolvimento , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Estresse Fisiológico , Regulação para Cima , beta-Galactosidase/biossíntese , beta-Galactosidase/genética
14.
Nucleic Acids Res ; 38(Web Server issue): W19-22, 2010 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-20497995

RESUMO

We introduce web interfaces for two recent extensions of the multiple-alignment program DIALIGN. DIALIGN-TX combines the greedy heuristic previously used in DIALIGN with a more traditional 'progressive' approach for improved performance on locally and globally related sequence sets. In addition, we offer a version of DIALIGN that uses predicted protein secondary structures together with primary sequence information to construct multiple protein alignments. Both programs are available through 'Göttingen Bioinformatics Compute Server' (GOBICS).


Assuntos
Estrutura Secundária de Proteína , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína , Software , Internet
15.
BMC Bioinformatics ; 12: 425, 2011 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-22040322

RESUMO

BACKGROUND: Long-term sample storage, tracing of data flow and data export for subsequent analyses are of great importance in genetics studies. Therefore, molecular labs do need a proper information system to handle an increasing amount of data from different projects. RESULTS: We have developed a molecular labs information management system (MolabIS). It was implemented as a web-based system allowing the users to capture original data at each step of their workflow. MolabIS provides essential functionality for managing information on individuals, tracking samples and storage locations, capturing raw files, importing final data from external files, searching results, accessing and modifying data. Further important features are options to generate ready-to-print reports and convert sequence and microsatellite data into various data formats, which can be used as input files in subsequent analyses. Moreover, MolabIS also provides a tool for data migration. CONCLUSIONS: MolabIS is designed for small-to-medium sized labs conducting Sanger sequencing and microsatellite genotyping to store and efficiently handle a relative large amount of data. MolabIS not only helps to avoid time consuming tasks but also ensures the availability of data for further analyses. The software is packaged as a virtual appliance which can run on different platforms (e.g. Linux, Windows). MolabIS can be distributed to a wide range of molecular genetics labs since it was developed according to a general data model. Released under GPL, MolabIS is freely available at http://www.molabis.org.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Animais , Genótipo , Gestão da Informação , Internet , Sistemas de Informação Administrativa , Repetições de Microssatélites
16.
BMC Bioinformatics ; 12: 93, 2011 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-21481263

RESUMO

BACKGROUND: Methods of determining whether or not any particular HIV-1 sequence stems - completely or in part - from some unknown HIV-1 subtype are important for the design of vaccines and molecular detection systems, as well as for epidemiological monitoring. Nevertheless, a single algorithm only, the Branching Index (BI), has been developed for this task so far. Moving along the genome of a query sequence in a sliding window, the BI computes a ratio quantifying how closely the query sequence clusters with a subtype clade. In its current version, however, the BI does not provide predicted boundaries of unknown fragments. RESULTS: We have developed Unknown Subtype Finder (USF), an algorithm based on a probabilistic model, which automatically determines which parts of an input sequence originate from a subtype yet unknown. The underlying model is based on a simple profile hidden Markov model (pHMM) for each known subtype and an additional pHMM for an unknown subtype. The emission probabilities of the latter are estimated using the emission frequencies of the known subtypes by means of a (position-wise) probabilistic model for the emergence of new subtypes. We have applied USF to SIV and HIV-1 sequences formerly classified as having emerged from an unknown subtype. Moreover, we have evaluated its performance on artificial HIV-1 recombinants and non-recombinant HIV-1 sequences. The results have been compared with the corresponding results of the BI. CONCLUSIONS: Our results demonstrate that USF is suitable for detecting segments in HIV-1 sequences stemming from yet unknown subtypes. Comparing USF with the BI shows that our algorithm performs as good as the BI or better.


Assuntos
Algoritmos , Biologia Computacional/métodos , HIV-1/genética , Simulação por Computador , Variação Genética , Modelos Genéticos
17.
Bioinformatics ; 26(8): 1015-21, 2010 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-20189940

RESUMO

MOTIVATION: Multiple sequence alignments can be constructed on the basis of pairwise local sequence similarities. This approach is rather flexible and can combine the advantages of global and local alignment methods. The restriction to pairwise alignments as building blocks, however, can lead to misalignments since weak homologies may be missed if only pairs of sequences are compared. RESULTS: Herein, we propose a graph-theoretical approach to find local multiple sequence similarities. Starting with pairwise alignments produced by DIALIGN, we use a min-cut algorithm to find potential (partial) alignment columns that we use to construct a final multiple alignment. On real and simulated benchmark data, our approach consistently outperforms the standard version of DIALIGN where local pairwise alignments are greedily incorporated into a multiple alignment. AVAILABILITY: The prototype is freely available under GNU Public Licence from E.C.


Assuntos
Algoritmos , Genômica/métodos , Alinhamento de Sequência/métodos , Bases de Dados Genéticas , Análise de Sequência de DNA
18.
Bioinformatics ; 26(11): 1409-15, 2010 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-20400454

RESUMO

MOTIVATION: Existing coalescent models and phylogenetic tools based on them are not designed for studying the genealogy of sequences like those of HIV, since in HIV recombinants with multiple cross-over points between the parental strains frequently arise. Hence, ambiguous cases in the classification of HIV sequences into subtypes and circulating recombinant forms (CRFs) have been treated with ad hoc methods in lack of tools based on a comprehensive coalescent model accounting for complex recombination patterns. RESULTS: We developed the program ARGUS that scores classifications of sequences into subtypes and recombinant forms. It reconstructs ancestral recombination graphs (ARGs) that reflect the genealogy of the input sequences given a classification hypothesis. An ARG with maximal probability is approximated using a Markov chain Monte Carlo approach. ARGUS was able to distinguish the correct classification with a low error rate from plausible alternative classifications in simulation studies with realistic parameters. We applied our algorithm to decide between two recently debated alternatives in the classification of CRF02 of HIV-1 and find that CRF02 is indeed a recombinant of Subtypes A and G. AVAILABILITY: ARGUS is implemented in C++ and the source code is available at http://gobics.de/software.


Assuntos
Algoritmos , HIV/classificação , HIV/genética , HIV-1/classificação , Cadeias de Markov , Filogenia , Análise de Sequência de DNA
19.
Nucleic Acids Res ; 37(Web Server issue): W185-8, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19491309

RESUMO

In the absence of whole genome sequences for many organisms, the use of expressed sequence tags (EST) offers an affordable approach for researchers conducting phylogenetic analyses to gain insight about the evolutionary history of organisms. Reliable alignments for phylogenomic analyses are based on orthologous gene sequences from different taxa. So far, researchers have not sufficiently tackled the problem of the completely automated construction of such datasets. Existing software tools are either semi-automated, covering only part of the necessary data processing, or implemented as a pipeline, requiring the installation and configuration of a cascade of external tools, which may be time-consuming and hard to manage. To simplify data set construction for phylogenomic studies, we set up a web server that uses our recently developed OrthoSelect approach. To the best of our knowledge, our web server is the first web-based EST analysis pipeline that allows the detection of orthologous gene sequences in EST libraries and outputs orthologous gene alignments. Additionally, OrthoSelect provides the user with an extensive results section that lists and visualizes all important results, such as annotations, data matrices for each gene/taxon and orthologous gene alignments. The web server is available at http://orthoselect.gobics.de.


Assuntos
Etiquetas de Sequências Expressas/química , Filogenia , Alinhamento de Sequência , Software , Genes , Genômica , Internet , Interface Usuário-Computador
20.
Nucleic Acids Res ; 37(Web Server issue): W647-51, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19443440

RESUMO

Previously, we developed jumping profile hidden Markov model (jpHMM), a new method to detect recombinations in HIV-1 genomes. The jpHMM predicts recombination breakpoints in a query sequence and assigns to each position of the sequence one of the major HIV-1 subtypes. Since incorrect subtype assignment or recombination prediction may lead to wrong conclusions in epidemiological or vaccine research, information about the reliability of the predicted parental subtypes and breakpoint positions is valuable. For this reason, we extended the output of jpHMM to include such information in terms of 'uncertainty' regions in the recombination prediction and an interval estimate of the breakpoint. Both types of information are computed based on the posterior probabilities of the subtypes at each query sequence position. Our results show that this extension strongly improves the reliability of the jpHMM recombination prediction. The jpHMM is available online at http://jphmm.gobics.de/.


Assuntos
HIV-1/classificação , HIV-1/genética , Recombinação Genética , Software , Sequência de Bases , Quebras de DNA , Internet , Cadeias de Markov , Filogenia , Reprodutibilidade dos Testes , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa