Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
Methods Mol Biol ; 2231: 147-162, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33289892

RESUMO

Large-scale multigene datasets used in phylogenomics and comparative genomics often contain sequence errors inherited from source genomes and transcriptomes. These errors typically manifest as stretches of non-homologous characters and derive from sequencing, assembly, and/or annotation errors. The lack of automatic tools to detect and remove sequence errors leads to the propagation of these errors in large-scale datasets. PREQUAL is a command line tool that identifies and masks regions with non-homologous adjacent characters in sets of unaligned homologous sequences. PREQUAL uses a full probabilistic approach based on pair hidden Markov models. On the front end, PREQUAL is user-friendly and simple to use while also allowing full customization to adjust filtering sensitivity. It is primarily aimed at amino acid sequences but can handle protein-coding nucleotide sequences. PREQUAL is computationally efficient and shows high sensitivity and accuracy. In this chapter, we briefly introduce the motivation for PREQUAL and its underlying methodology, followed by a description of basic and advanced usage, and conclude with some notes and recommendations. PREQUAL fills an important gap in the current bioinformatics tool kit for phylogenomics, contributing toward increased accuracy and reproducibility in future studies.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Cadeias de Markov , Análise de Sequência de DNA/métodos , Software , Algoritmos , Modelos Estatísticos , Filogenia , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Homologia de Sequência
2.
Syst Biol ; 69(5): 863-883, 2020 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-31985800

RESUMO

In recent years, there has been controversy whether multidimensional data such as geometric morphometric data or information on gene expression can be used for estimating phylogenies. This study uses simulations of evolution in multidimensional phenotype spaces to address this question and to identify specific factors that are important for answering it. Most of the simulations use phylogenies with four taxa, so that there are just three possible unrooted trees and the effect of different combinations of branch lengths can be studied systematically. In a comparison of methods, squared-change parsimony performed similarly well as maximum likelihood, and both methods outperformed Wagner and Euclidean parsimony, neighbor-joining and UPGMA. Under an evolutionary model of isotropic Brownian motion, phylogeny can be estimated reliably if dimensionality is high, even with relatively unfavorable combinations of branch lengths. By contrast, if there is phenotypic integration such that most variation is concentrated in one or a few dimensions, the reliability of phylogenetic estimates is severely reduced. Evolutionary models with stabilizing selection also produce highly unreliable estimates, which are little better than picking a phylogenetic tree at random. To examine how these results apply to phylogenies with more than four taxa, we conducted further simulations with up to eight taxa, which indicated that the effects of dimensionality and phenotypic integration extend to more than four taxa, and that convergence among internal nodes may produce additional complications specifically for greater numbers of taxa. Overall, the simulations suggest that multidimensional data, under evolutionary models that are plausible for biological data, do not produce reliable estimates of phylogeny. [Brownian motion; gene expression data; geometric morphometrics; morphological integration; squared-change parsimony; phylogeny; shape; stabilizing selection.].


Assuntos
Classificação/métodos , Modelos Biológicos , Filogenia , Simulação por Computador
3.
Mol Biol Evol ; 36(10): 2340-2351, 2019 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-31209473

RESUMO

Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.


Assuntos
Alinhamento de Sequência/métodos , Homologia de Sequência , Animais , Aves/genética
4.
Mol Biol Evol ; 36(4): 679-690, 2019 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-30668757

RESUMO

Substitutions between chemically distant amino acids are known to occur less frequently than those between more similar amino acids. This knowledge, however, is not reflected in most codon substitution models, which treat all nonsynonymous changes as if they were equivalent in terms of impact on the protein. A variety of methods for integrating chemical distances into models have been proposed, with a common approach being to divide substitutions into radical or conservative categories. Nevertheless, it remains unclear whether the resulting models describe sequence evolution better than their simpler counterparts. We propose a parametric codon model that distinguishes between radical and conservative substitutions, allowing us to assess if radical substitutions are preferentially removed by selection. Applying our new model to a range of phylogenomic data, we find differentiating between radical and conservative substitutions provides significantly better fit for large populations, but see no equivalent improvement for smaller populations. Comparing codon and amino acid models using these same data shows that alignments from large populations tend to select phylogenetic models containing information about amino acid exchangeabilities, whereas the structure of the genetic code is more important for smaller populations. Our results suggest selection against radical substitutions is, on average, more pronounced in large populations than smaller ones. The reduced observable effect of selection in smaller populations may be due to stronger genetic drift making it more challenging to detect preferences. Our results imply an important connection between the life history of a phylogenetic group and the model that best describes its evolution.


Assuntos
Substituição de Aminoácidos , Aminoácidos/química , Evolução Molecular , Modelos Genéticos , Seleção Genética , Aminoácidos/genética , Animais , Densidade Demográfica
5.
Bioinformatics ; 34(22): 3929-3930, 2018 11 15.
Artigo em Inglês | MEDLINE | ID: mdl-29868763

RESUMO

Summary: Phylogenomic datasets invariably contain undetected stretches of non-homologous characters due to poor-quality sequences or erroneous gene models. The large-scale multi-gene nature of these datasets renders impractical or impossible detailed manual curation of sequences, but few tools exist that can automate this task. To address this issue, we developed a new method that takes as input a set of unaligned homologous sequences and uses an explicit probabilistic approach to identify and mask regions with non-homologous adjacent characters. These regions are defined as sharing no statistical support for homology with any other sequence in the set, which can result from e.g. sequencing errors or gene prediction errors creating frameshifts. Our methodology is implemented in the program PREQUAL, which is a fast and accurate tool for high-throughput filtering of sequences. The program is primarily aimed at amino acid sequences, although it can handle protein coding DNA sequences as well. It is fully customizable to allow fine-tuning of the filtering sensitivity. Availability and implementation: The program PREQUAL is written in C/C++ and available through a GNU GPL v3.0 at https://github.com/simonwhelan/prequal. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Sequência de Aminoácidos , Homologia de Sequência , Software , Biologia Computacional , Filogenia
6.
Methods Mol Biol ; 1525: 349-377, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-27896728

RESUMO

Molecular evolution can reveal the relationship between sets of homologous sequences and the patterns of change that occur during their evolution. An important aspect of these studies is the inference of a phylogenetic tree, which explicitly describes evolutionary relationships between homologous sequences. This chapter provides an introduction to evolutionary trees and how to infer them from sequence data using some commonly used inferential methodology. It focuses on statistical methods for inferring trees and how to assess the confidence one should have in any resulting tree, with a particular emphasis on the underlying assumptions of the methods and how they might affect the tree estimate. There is also some discussion of the underlying algorithms used to perform tree search and recommendations regarding the performance of different algorithms. Finally, there are a few practical guidelines, including how to combine multiple software packages to improve inference, and a comparison between Bayesian and Maximum likelihood phylogenetics.


Assuntos
Algoritmos , Biologia Computacional/métodos , Teorema de Bayes , Evolução Molecular , Funções Verossimilhança , Modelos Genéticos , Filogenia
7.
Syst Biol ; 66(2): 218-231, 2017 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-27633353

RESUMO

Phylogenetic tree inference is a critical component of many systematic and evolutionary studies. The majority of these studies are based on the two-step process of multiple sequence alignment followed by tree inference, despite persistent evidence that the alignment step can lead to biased results. Here we present a two-part study that first presents PaHMM-Tree, a novel neighbor joining-based method that estimates pairwise distances without assuming a single alignment. We then use simulations to benchmark its performance against a wide-range of other phylogenetic tree inference methods, including the first comparison of alignment-free distance-based methods against more conventional tree estimation methods. Our new method for calculating pairwise distances based on statistical alignment provides distance estimates that are as accurate as those obtained using standard methods based on the true alignment. Pairwise distance estimates based on the two-step process tend to be substantially less accurate. This improved performance carries through to tree inference, where PaHMM-Tree provides more accurate tree estimates than all of the pairwise distance methods assessed. For close to moderately divergent sequence data we find that the two-step methods using statistical inference, where information from all sequences is included in the estimation procedure, tend to perform better than PaHMM-Tree, particularly full statistical alignment, which simultaneously estimates both the tree and the alignment. For deep divergences we find the alignment step becomes so prone to error that our distance-based PaHMM-Tree outperforms all other methods of tree inference. Finally, we find that the accuracy of alignment-free methods tends to decline faster than standard two-step methods in the presence of alignment uncertainty, and identify no conditions where alignment-free methods are equal to or more accurate than standard phylogenetic methods even in the presence of substantial alignment error. [Alignment-free; distance-based phylogenetics; pair Hidden Markov Models; phylogenetic inference; statistical alignment.].


Assuntos
Classificação/métodos , Modelos Genéticos , Filogenia , Alinhamento de Sequência , Algoritmos , Benchmarking , Evolução Biológica , Evolução Molecular
8.
Genome Biol Evol ; 7(8): 2102-16, 2015 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-26139831

RESUMO

Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference.


Assuntos
Filogenia , Alinhamento de Sequência/métodos , Artefatos , Modelos Estatísticos , Incerteza
9.
Mol Biol Evol ; 32(9): 2456-68, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25944916

RESUMO

Recent developments in the analysis of amino acid covariation are leading to breakthroughs in protein structure prediction, protein design, and prediction of the interactome. It is assumed that observed patterns of covariation are caused by molecular coevolution, where substitutions at one site affect the evolutionary forces acting at neighboring sites. Our theoretical and empirical results cast doubt on this assumption. We demonstrate that the strongest coevolutionary signal is a decrease in evolutionary rate and that unfeasibly long times are required to produce coordinated substitutions. We find that covarying substitutions are mostly found on different branches of the phylogenetic tree, indicating that they are independent events that may or may not be attributable to coevolution. These observations undermine the hypothesis that molecular coevolution is the primary cause of the covariation signal. In contrast, we find that the pairs of residues with the strongest covariation signal tend to have low evolutionary rates, and that it is this low rate that gives rise to the covariation signal. Slowly evolving residue pairs are disproportionately located in the protein's core, which explains covariation methods' ability to detect pairs of residues that are close in three dimensions. These observations lead us to propose the "coevolution paradox": The strength of coevolution required to cause coordinated changes means the evolutionary rate is so low that such changes are highly unlikely to occur. As modern covariation methods may lead to breakthroughs in structural genomics, it is critical to recognize their biases and limitations.


Assuntos
Evolução Molecular , Cadeias de Markov , Modelos Genéticos , Taxa de Mutação , Filogenia , Dobramento de Proteína , Proteínas/genética
10.
Bioinformatics ; 31(14): 2391-3, 2015 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-25725494

RESUMO

UNLABELLED: Phylogenetic models are an important tool in molecular evolution allowing us to study the pattern and rate of sequence change. The recent influx of new sequence data in the biosciences means that to address evolutionary questions, we need a means for rapid and easy model development and implementation. Here we present GeLL, a Java library that lets users use text to quickly and efficiently define novel forms of discrete data and create new substitution models that describe how those data change on a phylogeny. GeLL allows users to define general substitution models and data structures in a way that is not possible in other existing libraries, including mixture models and non-reversible models. Classes are provided for calculating likelihoods, optimizing model parameters and branch lengths, ancestral reconstruction and sequence simulation. AVAILABILITY AND IMPLEMENTATION: http://phylo.bio.ku.edu/GeLL under a GPL v3 license.


Assuntos
Filogenia , Software , Funções Verossimilhança , Modelos Genéticos
11.
Syst Biol ; 64(1): 42-55, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25209223

RESUMO

Molecular phylogenetics is a powerful tool for inferring both the process and pattern of evolution from genomic sequence data. Statistical approaches, such as maximum likelihood and Bayesian inference, are now established as the preferred methods of inference. The choice of models that a researcher uses for inference is of critical importance, and there are established methods for model selection conditioned on a particular type of data, such as nucleotides, amino acids, or codons. A major limitation of existing model selection approaches is that they can only compare models acting upon a single type of data. Here, we extend model selection to allow comparisons between models describing different types of data by introducing the idea of adapter functions, which project aggregated models onto the originally observed sequence data. These projections are implemented in the program ModelOMatic and used to perform model selection on 3722 families from the PANDIT database, 68 genes from an arthropod phylogenomic data set, and 248 genes from a vertebrate phylogenomic data set. For the PANDIT and arthropod data, we find that amino acid models are selected for the overwhelming majority of alignments; with progressively smaller numbers of alignments selecting codon and nucleotide models, and no families selecting RY-based models. In contrast, nearly all alignments from the vertebrate data set select codon-based models. The sequence divergence, the number of sequences, and the degree of selection acting upon the protein sequences may contribute to explaining this variation in model selection. Our ModelOMatic program is fast, with most families from PANDIT taking fewer than 150 s to complete, and should therefore be easily incorporated into existing phylogenetic pipelines. ModelOMatic is available at https://code.google.com/p/modelomatic/.


Assuntos
Classificação/métodos , Modelos Biológicos , Filogenia , Aminoácidos/genética , Animais , Códon/genética , Nucleotídeos/genética , Software
12.
Mol Phylogenet Evol ; 76: 102-9, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24631855

RESUMO

Deep coalescence and the nongenealogical pattern of descent caused by recombination have emerged as a common problem for phylogenetic inference at the species level. Here we use computer simulations to assess whether AFLP-based phylogenies are robust to the uncertainties introduced by these factors. Our results indicate that phylogenetic signal can prevail even in the face of extensive deep coalescence allowing recovering the correct species tree topology. The impact of recombination on tree accuracy was related to total tree depth and species effective population size. The correct tree topology could be recovered upon many simulation settings due to a trade-off between the conflicting signals resulting from intra-locus recombination and the benefits of the joint consideration of unlinked loci that better matched overall the true species tree. Errors in tree topology were not only determined by deep coalescence, but also by the timing of divergence and the tree-building errors arising from an insufficient number of characters. DNA sequences generally outperformed AFLPs upon any simulated scenario, but this difference in performance was nearly negligible when a sufficient number of AFLP characters were sampled. Our simulations suggest that the impact of deep coalescence and intra-locus recombination on the reliability of AFLP trees could be minimal for effective population sizes equal to or lower than 10,000 (typical of many vertebrates and tree plants) given tree depths above 0.02 substitutions per site.


Assuntos
Análise do Polimorfismo de Comprimento de Fragmentos Amplificados/métodos , Filogenia , Recombinação Genética , Animais , Sequência de Bases , Simulação por Computador , Modelos Genéticos , Reprodutibilidade dos Testes , Análise de Sequência de DNA
13.
Genome Biol Evol ; 6(1): 65-75, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24391153

RESUMO

Phylogenetic inference is widely used to investigate the relationships between homologous sequences. RNA molecules have played a key role in these studies because they are present throughout life and tend to evolve slowly. Phylogenetic inference has been shown to be dependent on the substitution model used. A wide range of models have been developed to describe RNA evolution, either with 16 states describing all possible canonical base pairs or with 7 states where the 10 mismatched nucleotides are reduced to a single state. Formal model selection has become a standard practice for choosing an inferential model and works well for comparing models of a specific type, such as comparisons within nucleotide models or within amino acid models. Model selection cannot function across different sized state spaces because the likelihoods are conditioned on different data. Here, we introduce statistical state-space projection methods that allow the direct comparison of likelihoods between nucleotide models and 7-state and 16-state RNA models. To demonstrate the general applicability of our new methods, we extract 287 RNA families from genomic alignments and perform model selection. We find that in 281/287 families, RNA models are selected in preference to nucleotide models, with simple 7-state RNA models selected for more conserved families with shorter stems and more complex 16-state RNA models selected for more divergent families with longer stems. Other factors, such as the function of the RNA molecule or the GC-content, have limited impact on model selection. Our models and model selection methods are freely available in the open-source PHASE 3.0 software.


Assuntos
Evolução Molecular , Modelos Genéticos , RNA não Traduzido/genética , Software , Animais , Composição de Bases , Pareamento Incorreto de Bases , Humanos , Filogenia , RNA não Traduzido/química
14.
Mol Biol Evol ; 30(3): 642-53, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23144040

RESUMO

Multiple sequence alignment (MSA) is the heart of comparative sequence analysis. Recent studies demonstrate that MSA algorithms can produce different outcomes when analyzing genomes, including phylogenetic tree inference and the detection of adaptive evolution. These studies also suggest that the difference between MSA algorithms is of a similar order to the uncertainty within an algorithm and suggest integrating across this uncertainty. In this study, we examine further the problem of disagreements between MSA algorithms and how they affect downstream analyses. We also investigate whether integrating across alignment uncertainty affects downstream analyses. We address these questions by analyzing 200 chordate gene families, with properties reflecting those used in large-scale genomic analyses. We find that newly developed distance metrics reveal two significantly different classes of MSA methods (MSAMs). The similarity-based class includes progressive aligners and consistency aligners, representing many methodological innovations for sequence alignment, whereas the evolution-based class includes phylogenetically aware alignment and statistical alignment. We proceed to show that the class of an MSAM has a substantial impact on downstream analyses. For phylogenetic inference, tree estimates and their branch lengths appear highly dependent on the class of aligner used. The number of families, and the sites within those families, inferred to have undergone adaptive evolution depend on the class of aligner used. Similarity-based aligners tend to identify more adaptive evolution. We also develop and test methods for incorporating MSA uncertainty when detecting adaptive evolution but find that although accounting for MSA uncertainty does affect downstream analyses, it appears less important than the class of aligner chosen. Our results demonstrate the critical role that MSA methodology has on downstream analysis, highlighting that the class of aligner chosen in an analysis has a demonstrable effect on its outcome.


Assuntos
Algoritmos , Modelos Genéticos , Alinhamento de Sequência/métodos , Adaptação Biológica/genética , Teorema de Bayes , Evolução Molecular , Genoma Humano , Humanos , Funções Verossimilhança , Cadeias de Markov , Método de Monte Carlo , Filogenia , Seleção Genética , Análise de Sequência de DNA/métodos
15.
Front Plant Sci ; 3: 1, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22645563

RESUMO

Cation transport is a critical process in all organisms and is essential for mineral nutrition, ion stress tolerance, and signal transduction. Transporters that are members of the Ca(2+)/cation antiporter (CaCA) superfamily are involved in the transport of Ca(2+) and/or other cations using the counter exchange of another ion such as H(+) or Na(+). The CaCA superfamily has been previously divided into five transporter families: the YRBG, Na(+)/Ca(2+) exchanger (NCX), Na(+)/Ca(2+), K(+) exchanger (NCKX), H(+)/cation exchanger (CAX), and cation/Ca(2+) exchanger (CCX) families, which include the well-characterized NCX and CAX transporters. To examine the evolution of CaCA transporters within higher plants and the green plant lineage, CaCA genes were identified from the genomes of sequenced flowering plants, a bryophyte, lycophyte, and freshwater and marine algae, and compared with those from non-plant species. We found evidence of the expansion and increased diversity of flowering plant genes within the CAX and CCX families. Genes related to the NCX family are present in land plant though they encode distinct MHX homologs which probably have an altered transport function. In contrast, the NCX and NCKX genes which are absent in land plants have been retained in many species of algae, especially the marine algae, indicating that these organisms may share "animal-like" characteristics of Ca(2+) homeostasis and signaling. A group of genes encoding novel CAX-like proteins containing an EF-hand domain were identified from plants and selected algae but appeared to be lacking in any other species. Lack of functional data for most of the CaCA proteins make it impossible to reliably predict substrate specificity and function for many of the groups or individual proteins. The abundance and diversity of CaCA genes throughout all branches of life indicates the importance of this class of cation transporter, and that many transporters with novel functions are waiting to be discovered.

16.
Protein Sci ; 21(6): 769-85, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-22528593

RESUMO

Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.


Assuntos
Evolução Molecular , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos , Animais , Humanos , Modelos Moleculares , Dados de Sequência Molecular , Conformação Proteica , Dobramento de Proteína , RNA Mensageiro/genética , Alinhamento de Sequência
17.
Bioinformatics ; 28(4): 495-502, 2012 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-22199391

RESUMO

MOTIVATION: Multiple sequence alignment (MSA) is a core method in bioinformatics. The accuracy of such alignments may influence the success of downstream analyses such as phylogenetic inference, protein structure prediction, and functional prediction. The importance of MSA has lead to the proliferation of MSA methods, with different objective functions and heuristics to search for the optimal MSA. Different methods of inferring MSAs produce different results in all but the most trivial cases. By measuring the differences between inferred alignments, we may be able to develop an understanding of how these differences (i) relate to the objective functions and heuristics used in MSA methods, and (ii) affect downstream analyses. RESULTS: We introduce four metrics to compare MSAs, which include the position in a sequence where a gap occurs or the location on a phylogenetic tree where an insertion or deletion (indel) event occurs. We use both real and synthetic data to explore the information given by these metrics and demonstrate how the different metrics in combination can yield more information about MSA methods and the differences between them. AVAILABILITY: MetAl is a free software implementation of these metrics in Haskell. Source and binaries for Windows, Linux and Mac OS X are available from http://kumiho.smith.man.ac.uk/whelan/software/metal/.


Assuntos
Filogenia , Alinhamento de Sequência/métodos , Software , Computadores , Mutação INDEL , Proteínas/química , Proteínas/genética
18.
Bioinformatics ; 28(1): 48-55, 2012 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-22039210

RESUMO

MOTIVATION: Recent large-scale studies of individuals within a population have demonstrated that there is widespread variation in copy number in many gene families. In addition, there is increasing evidence that the variation in gene copy number can give rise to substantial phenotypic effects. In some cases, these variations have been shown to be adaptive. These observations show that a full understanding of the evolution of biological function requires an understanding of gene gain and gene loss. Accurate, robust evolutionary models of gain and loss events are, therefore, required. RESULTS: We have developed weighted parsimony and maximum likelihood methods for inferring gain and loss events. To test these methods, we have used Markov models of gain and loss to simulate data with known properties. We examine three models: a simple birth-death model, a single rate model and a birth-death innovation model with parameters estimated from Drosophila genome data. We find that for all simulations maximum likelihood-based methods are very accurate for reconstructing the number of duplication events on the phylogenetic tree, and that maximum likelihood and weighted parsimony have similar accuracy for reconstructing the ancestral state. Our implementations are robust to different model parameters and provide accurate inferences of ancestral states and the number of gain and loss events. For ancestral reconstruction, we recommend weighted parsimony because it has similar accuracy to maximum likelihood, but is much faster. For inferring the number of individual gene loss or gain events, maximum likelihood is noticeably more accurate, albeit at greater computational cost. AVAILABILITY: www.bioinf.manchester.ac.uk/dupliphy CONTACT: simon.lovell@manchester.ac.uk; simon.whelan@manchester.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Drosophila/genética , Evolução Molecular , Modelos Genéticos , Animais , Simulação por Computador , Drosophila/classificação , Genoma de Inseto , Funções Verossimilhança , Cadeias de Markov
19.
Syst Biol ; 61(2): 228-39, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22076302

RESUMO

Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.


Assuntos
Filogenia , Leveduras/genética , Algoritmos , Classificação/métodos , Biologia Computacional , Genoma , Funções Verossimilhança , Modelos Biológicos , Leveduras/classificação
20.
Mol Biol Evol ; 28(1): 449-58, 2011 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-20724379

RESUMO

There is widespread evidence of lineage-specific rate variation, known as heterotachy, during protein evolution. Changes in the structural and functional constraints acting on a protein can lead to heterotachy, and it is plausible that such changes, known as covarion shifts, may affect many amino acids at once. Several previous attempts to model heterotachy have used covarion models, where the sequence undergoes covarion drift, whereby each site may switch independently among a set of discrete classes having different substitution rates. However, such independent switching may not capture biologically important events where the selective forces acting on a protein affect many sites at once. We describe a new class of models that allow the rates of substitution and switching to vary among branches of a phylogenetic tree. Such models are better able to handle covarion shifts. We apply these models to a set of genes occurring in nonphotosynthetic bacteria, cyanobacteria, and the plastids of green and red algae. We find that 4/5 genes show evidence of some form of rate switching and that 3/5 genes show evidence that the relative switching rate differs among taxonomic groups. We conclude that covarion shifts may be frequent during the deep evolution of plastid genes and that our methodology may provide a powerful new tool for investigating such shifts in other systems.


Assuntos
Evolução Biológica , Variação Genética , Modelos Genéticos , Filogenia , Plastídeos/genética , Algoritmos , Sequência de Bases , Clorófitas/citologia , Clorófitas/genética , Simulação por Computador , Dados de Sequência Molecular , Proteínas/química , Proteínas/genética , Rodófitas/citologia , Rodófitas/genética , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...