Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
1.
Bioinformatics ; 37(19): 3277-3284, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33970217

RESUMO

MOTIVATION: The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or near-minimal in the number of posited recombination and mutation events. RESULTS: Given an input dataset of aligned sequences, KwARG outputs a list of possible candidate solutions, each comprising a list of mutation and recombination events that could have generated the dataset; the relative proportion of recombinations and recurrent mutations in a solution can be controlled via specifying a set of 'cost' parameters. We demonstrate that the algorithm performs well when compared against existing methods. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/a-ignatieva/kwarg. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
BMC Bioinformatics ; 16: 108, 2015 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25888064

RESUMO

BACKGROUND: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .


Assuntos
Algoritmos , Biologia Computacional/métodos , Gráficos por Computador , Modelos Estatísticos , Alinhamento de Sequência/métodos , Software , Simulação por Computador , Humanos , Incerteza
3.
Methods Mol Biol ; 1097: 143-62, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24639159

RESUMO

Stochastic context-free grammars (SCFGs) were first established in the context of natural language modelling, and only later found their applications in RNA secondary structure prediction. In this chapter, we discuss the basic SCFG algorithms (CYK and inside-outside algorithms) in an application-centered manner and use the pfold grammar as a case study to show how the algorithms can be adapted to a grammar in a nonstandard form. We extend our discussion to the use of grammars with additional information (such as evolutionary information) to improve the quality of predictions. Finally, we provide a brief survey of programs that use stochastic context-free grammars for RNA secondary structure prediction and modelling.


Assuntos
Biologia Computacional/métodos , Conformação de Ácido Nucleico , RNA/química , Software , Algoritmos , Dobramento de RNA , Análise de Sequência de RNA , Termodinâmica
4.
Bioinformatics ; 29(6): 704-10, 2013 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-23396120

RESUMO

MOTIVATION: Many computational methods for RNA secondary structure prediction, and, in particular, for the prediction of a consensus structure of an alignment of RNA sequences, have been developed. Most methods, however, ignore biophysical factors, such as the kinetics of RNA folding; no current implementation considers both evolutionary information and folding kinetics, thus losing information that, when considered, might lead to better predictions. RESULTS: We present an iterative algorithm, Oxfold, in the framework of stochastic context-free grammars, that emulates the kinetics of RNA folding in a simplified way, in combination with a molecular evolution model. This method improves considerably on existing grammatical models that do not consider folding kinetics. Additionally, the model compares favourably to non-kinetic thermodynamic models.


Assuntos
Algoritmos , Dobramento de RNA , RNA/química , Teorema de Bayes , Evolução Molecular , Cinética , Modelos Moleculares , Alinhamento de Sequência , Análise de Sequência de RNA/métodos , Processos Estocásticos , Termodinâmica
5.
BMC Bioinformatics ; 13: 260, 2012 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-23043260

RESUMO

BACKGROUND: RNA secondary structure prediction, or folding, is a classic problem in bioinformatics: given a sequence of nucleotides, the aim is to predict the base pairs formed in its three dimensional conformation. The inverse problem of designing a sequence folding into a particular target structure has only more recently received notable interest. With a growing appreciation and understanding of the functional and structural properties of RNA motifs, and a growing interest in utilising biomolecules in nano-scale designs, the interest in the inverse RNA folding problem is bound to increase. However, whereas the RNA folding problem from an algorithmic viewpoint has an elegant and efficient solution, the inverse RNA folding problem appears to be hard. RESULTS: In this paper we present a genetic algorithm approach to solve the inverse folding problem. The main aims of the development was to address the hitherto mostly ignored extension of solving the inverse folding problem, the multi-target inverse folding problem, while simultaneously designing a method with superior performance when measured on the quality of designed sequences. The genetic algorithm has been implemented as a Python program called Frnakenstein. It was benchmarked against four existing methods and several data sets totalling 769 real and predicted single structure targets, and on 292 two structure targets. It performed as well as or better at finding sequences which folded in silico into the target structure than all existing methods, without the heavy bias towards CG base pairs that was observed for all other top performing methods. On the two structure targets it also performed well, generating a perfect design for about 80% of the targets. CONCLUSIONS: Our method illustrates that successful designs for the inverse RNA folding problem does not necessarily have to rely on heavy biases in base pair and unpaired base distributions. The design problem seems to become more difficult on larger structures when the target structures are real structures, while no deterioration was observed for predicted structures. Design for two structure targets is considerably more difficult, but far from impossible, demonstrating the feasibility of automated design of artificial riboswitches. The Python implementation is available at http://www.stats.ox.ac.uk/research/genome/software/frnakenstein.


Assuntos
Algoritmos , Biologia Computacional/métodos , Dobramento de RNA/genética , RNA/química , RNA/genética , Software , Pareamento de Bases , Sequência de Bases , Simulação por Computador , Riboswitch
6.
BMC Bioinformatics ; 13: 78, 2012 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-22559985

RESUMO

BACKGROUND: Stochastic Context-Free Grammars (SCFGs) were applied successfully to RNA secondary structure prediction in the early 90s, and used in combination with comparative methods in the late 90s. The set of SCFGs potentially useful for RNA secondary structure prediction is very large, but a few intuitively designed grammars have remained dominant. In this paper we investigate two automatic search techniques for effective grammars - exhaustive search for very compact grammars and an evolutionary algorithm to find larger grammars. We also examine whether grammar ambiguity is as problematic to structure prediction as has been previously suggested. RESULTS: These search techniques were applied to predict RNA secondary structure on a maximal data set and revealed new and interesting grammars, though none are dramatically better than classic grammars. In general, results showed that many grammars with quite different structure could have very similar predictive ability. Many ambiguous grammars were found which were at least as effective as the best current unambiguous grammars. CONCLUSIONS: Overall the method of evolving SCFGs for RNA secondary structure prediction proved effective in finding many grammars that had strong predictive accuracy, as good or slightly better than those designed manually. Furthermore, several of the best grammars found were ambiguous, demonstrating that such grammars should not be disregarded.


Assuntos
Algoritmos , Biologia Computacional/métodos , Conformação de Ácido Nucleico , RNA/química , Análise de Sequência de RNA , Processos Estocásticos
7.
Mol Biol Evol ; 28(6): 1777-84, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21212152

RESUMO

One of the key objectives of comparative genomics is the characterization of the forces that shape genomes over the course of evolution. In the last decades, evidence has been accumulated that for vertebrate genomes also epigenetic modifications have to be considered in this context. Especially, the elevated mutation frequency of 5-methylcytosine (5mC) is assumed to facilitate the depletion of CpG dinucleotides in species that exhibit global DNA methylation. For instance, the underrepresentation of CpG dinucleotides in many mammalian genomes is attributed to this effect, which is only neutralized in so-called CpG islands (CGIs) that are preferentially unmethylated and thus partially protected from rapid CpG decay. For primate-specific CpG-rich transposable elements from the ALU family, it is unclear whether their elevated CpG frequency is caused by their small age or by the absence of DNA methylation. In consequence, these elements are often misclassified in CGI annotations. We present a method for the estimation of germ line methylation from pairwise ancestral-descendant alignments. The approach is validated in a simulation study and tested on DNA repeats from the AluSx family. We conclude that a predicted unmethylated state in the germ line is highly correlated with epigenetic activity of the respective genomic region. Thus, CpG-rich repeats can be facilitated as in silico probes for the epigenetic potential of their genomic neighborhood.


Assuntos
Metilação de DNA/genética , Células Germinativas/metabolismo , Modelos Genéticos , Elementos Alu/genética , Células Sanguíneas/metabolismo , Simulação por Computador , Ilhas de CpG/genética , Epigenômica , Humanos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
8.
BMC Bioinformatics ; 11: 188, 2010 Apr 14.
Artigo em Inglês | MEDLINE | ID: mdl-20398246

RESUMO

BACKGROUND: The dynamic motions of many proteins are central to their function. It therefore follows that the dynamic requirements of a protein are evolutionary constrained. In order to assess and quantify this, one needs to compare the dynamic motions of different proteins. Comparing the dynamics of distinct proteins may also provide insight into how protein motions are modified by variations in sequence and, consequently, by structure. The optimal way of comparing complex molecular motions is, however, far from trivial. The majority of comparative molecular dynamics studies performed to date relied upon prior sequence or structural alignment to define which residues were equivalent in 3-dimensional space. RESULTS: Here we discuss an alternative methodology for comparative molecular dynamics that does not require any prior alignment information. We show it is possible to align proteins based solely on their dynamics and that we can use these dynamics-based alignments to quantify the dynamic similarity of proteins. Our method was tested on 10 representative members of the PDZ domain family. CONCLUSIONS: As a result of creating pair-wise dynamics-based alignments of PDZ domains, we have found evolutionarily conserved patterns in their backbone dynamics. The dynamic similarity of PDZ domains is highly correlated with their structural similarity as calculated with Dali. However, significant differences in their dynamics can be detected indicating that sequence has a more refined role to play in protein dynamics than just dictating the overall fold. We suggest that the method should be generally applicable.


Assuntos
Proteínas/química , Alinhamento de Sequência/métodos , Bases de Dados de Proteínas , Modelos Moleculares , Simulação de Dinâmica Molecular , Estrutura Terciária de Proteína , Análise de Sequência de Proteína/métodos
9.
BMC Evol Biol ; 9: 217, 2009 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-19715598

RESUMO

BACKGROUND: We have previously combined statistical alignment and phylogenetic footprinting to detect conserved functional elements without assuming a fixed alignment. Considering a probability-weighted distribution of alignments removes sensitivity to alignment errors, properly accommodates regions of alignment uncertainty, and increases the accuracy of functional element prediction. Our method utilized standard dynamic programming hidden markov model algorithms to analyze up to four sequences. RESULTS: We present a novel approach, implemented in the software package BigFoot, for performing phylogenetic footprinting on greater numbers of sequences. We have developed a Markov chain Monte Carlo (MCMC) approach which samples both sequence alignments and locations of slowly evolving regions. We implement our method as an extension of the existing StatAlign software package and test it on well-annotated regions controlling the expression of the even-skipped gene in Drosophila and the alpha-globin gene in vertebrates. The results exhibit how adding additional sequences to the analysis has the potential to improve the accuracy of functional predictions, and demonstrate how BigFoot outperforms existing alignment-based phylogenetic footprinting techniques. CONCLUSION: BigFoot extends a combined alignment and phylogenetic footprinting approach to analyze larger amounts of sequence data using MCMC. Our approach is robust to alignment error and uncertainty and can be applied to a variety of biological datasets. The source code and documentation are publicly available for download from http://www.stats.ox.ac.uk/~satija/BigFoot/


Assuntos
Algoritmos , Teorema de Bayes , Modelos Genéticos , Evolução Molecular , Cadeias de Markov
10.
Stat Methods Med Res ; 18(5): 453-85, 2009 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-19221170

RESUMO

Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 4-5 sequences. MCMC techniques can bring this to about 10-15 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets.


Assuntos
Evolução Molecular , Genômica , Modelos Genéticos , Modelos Estatísticos , Análise de Sequência de DNA/métodos , Algoritmos , Genoma , Humanos , Filogenia , Alinhamento de Sequência , Processos Estocásticos
11.
Mol Biol Evol ; 26(1): 209-16, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-18948299

RESUMO

Noncoding RNAs (ncRNAs) are transcripts that do not code for protein but rather function as RNA in catalytic, regulatory, or structural roles in the cell. ncRNAs are involved in universally conserved biological processes, including protein synthesis and gene regulation, and have more specific roles, such as in X-chromosome inactivation in eutherian mammals. In this paper, we propose and investigate a hypothesis for patterns of sequence selection in structurally conserved ncRNAs. Previous attempts at defining RNA selection compared rates of evolution between paired and unpaired bases with largely inconclusive results. Our approach focuses only on paired bases in ncRNAs with conserved structure. By analogy to the different properties of codon positions based on the genetic code, we use a well-developed energy model for RNA structure to classify stem positions into structural classes and argue that they are under different selective constraints. We validate the hypothesis on several RNA families and use simulated data to verify the evolutionary origin of signals. Our class labeling is shown to be a better model of ncRNA evolution than the tradition of treating stem positions equally. As well as providing a better understanding of RNA evolution, the evolutionary footprint we identify can easily be incorporated into gene finders to improve their specificity.


Assuntos
Evolução Molecular , RNA não Traduzido/genética , Animais , Sequência de Bases , Galinhas , Cães , Humanos , Camundongos , Coelhos , Alinhamento de Sequência
12.
Bioinformatics ; 24(20): 2403-4, 2008 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-18753153

RESUMO

MOTIVATION: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods. RESULTS: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed.


Assuntos
Evolução Molecular , Alinhamento de Sequência/métodos , Software , Teorema de Bayes , Cadeias de Markov , Filogenia
13.
Genetics ; 177(4): 2151-60, 2007 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-17947442

RESUMO

Coalescent theory deals with the dynamics of how sampled genetic material has spread through a population from a single ancestor over many generations and is ubiquitous in contemporary molecular population genetics. Inherent in most applications is a continuous-time approximation that is derived under the assumption that sample size is small relative to the actual population size. In effect, this precludes multiple and simultaneous coalescent events that take place in the history of large samples. If sequences do not recombine, the number of sequences ancestral to a large sample is reduced sufficiently after relatively few generations such that use of the continuous-time approximation is justified. However, in tracing the history of large chromosomal segments, a large recombination rate per generation will consistently maintain a large number of ancestors. This can create a major disparity between discrete-time and continuous-time models and we analyze its importance, illustrated with model parameters typical of the human genome. The presence of gene conversion exacerbates the disparity and could seriously undermine applications of coalescent theory to complete genomes. However, we show that multiple and simultaneous coalescent events influence global quantities, such as total number of ancestors, but have negligible effect on local quantities, such as linkage disequilibrium. Reassuringly, most applications of the coalescent model with recombination (including association mapping) focus on local quantities.


Assuntos
Genética Populacional/métodos , Modelos Genéticos , Conversão Gênica , Genoma Humano , Humanos , Recombinação Genética , Tempo
14.
Artigo em Inglês | MEDLINE | ID: mdl-17048462

RESUMO

Given a set D of input sequences, a genealogy for D can be constructed backward in time using such evolutionary events as mutation, coalescent, and recombination. An ancestral configuration (AC) can be regarded as the multiset of all sequences present at a particular point in time in a possible genealogy for D. The complexity of computing the likelihood of observing D depends heavily on the total number of distinct ACs of D and, therefore, it is of interest to estimate that number. For D consisting of binary sequences of finite length, we consider the problem of enumerating exactly all distinct ACs. We assume that the root sequence type is known and that the mutation process is governed by the infinite-sites model. When there is no recombination, we construct a general method of obtaining closed-form formulas for the total number of ACs. The enumeration problem becomes much more complicated when recombination is involved. In that case, we devise a method of enumeration based on counting contingency tables and construct a dynamic programming algorithm for the approach. Last, we describe a method of counting the number of ACs that can appear in genealogies with less than or equal to a given number R of recombinations. Of particular interest is the case in which R is close to the minimum number of recombinations for D.


Assuntos
Evolução Biológica , Mapeamento Cromossômico/métodos , Evolução Molecular , Genética Populacional , Modelos Genéticos , Linhagem , Análise de Sequência de DNA/métodos , Variação Genética/genética , Modelos Estatísticos , Filogenia , Tamanho da Amostra , Alinhamento de Sequência/métodos
15.
Nucleic Acids Res ; 33(Web Server issue): W650-3, 2005 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-15980555

RESUMO

Foldalign is a Sankoff-based algorithm for making structural alignments of RNA sequences. Here, we present a web server for making pairwise alignments between two RNA sequences, using the recently updated version of foldalign. The server can be used to scan two sequences for a common structural RNA motif of limited size, or the entire sequences can be aligned locally or globally. The web server offers a graphical interface, which makes it simple to make alignments and manually browse the results. The web server can be accessed at http://foldalign.kvl.dk.


Assuntos
RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software , Algoritmos , Gráficos por Computador , Internet , Conformação de Ácido Nucleico , Interface Usuário-Computador
16.
Bioinformatics ; 21(9): 1815-24, 2005 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-15657094

RESUMO

MOTIVATION: Searching for non-coding RNA (ncRNA) genes and structural RNA elements (eleRNA) are major challenges in gene finding today as these often are conserved in structure rather than in sequence. Even though the number of available methods is growing, it is still of interest to pairwise detect two genes with low sequence similarity, where the genes are part of a larger genomic region. RESULTS: Here we present such an approach for pairwise local alignment which is based on foldalign and the Sankoff algorithm for simultaneous structural alignment of multiple sequences. We include the ability to conduct mutual scans of two sequences of arbitrary length while searching for common local structural motifs of some maximum length. This drastically reduces the complexity of the algorithm. The scoring scheme includes structural parameters corresponding to those available for free energy as well as for substitution matrices similar to RIBOSUM. The new foldalign implementation is tested on a dataset where the ncRNAs and eleRNAs have sequence similarity <40% and where the ncRNAs and eleRNAs are energetically indistinguishable from the surrounding genomic sequence context. The method is tested in two ways: (1) its ability to find the common structure between the genes only and (2) its ability to locate ncRNAs and eleRNAs in a genomic context. In case (1), it makes sense to compare with methods like Dynalign, and the performances are very similar, but foldalign is substantially faster. The structure prediction performance for a family is typically around 0.7 using Matthews correlation coefficient. In case (2), the algorithm is successful at locating RNA families with an average sensitivity of 0.8 and a positive predictive value of 0.9 using a BLAST-like hit selection scheme. AVAILABILITY: The program is available online at http://foldalign.kvl.dk/


Assuntos
Algoritmos , RNA não Traduzido/genética , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software , Sequência Conservada , Conformação de Ácido Nucleico , Homologia de Sequência do Ácido Nucleico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA