Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
Syst Biol ; 69(5): 863-883, 2020 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-31985800

RESUMEN

In recent years, there has been controversy whether multidimensional data such as geometric morphometric data or information on gene expression can be used for estimating phylogenies. This study uses simulations of evolution in multidimensional phenotype spaces to address this question and to identify specific factors that are important for answering it. Most of the simulations use phylogenies with four taxa, so that there are just three possible unrooted trees and the effect of different combinations of branch lengths can be studied systematically. In a comparison of methods, squared-change parsimony performed similarly well as maximum likelihood, and both methods outperformed Wagner and Euclidean parsimony, neighbor-joining and UPGMA. Under an evolutionary model of isotropic Brownian motion, phylogeny can be estimated reliably if dimensionality is high, even with relatively unfavorable combinations of branch lengths. By contrast, if there is phenotypic integration such that most variation is concentrated in one or a few dimensions, the reliability of phylogenetic estimates is severely reduced. Evolutionary models with stabilizing selection also produce highly unreliable estimates, which are little better than picking a phylogenetic tree at random. To examine how these results apply to phylogenies with more than four taxa, we conducted further simulations with up to eight taxa, which indicated that the effects of dimensionality and phenotypic integration extend to more than four taxa, and that convergence among internal nodes may produce additional complications specifically for greater numbers of taxa. Overall, the simulations suggest that multidimensional data, under evolutionary models that are plausible for biological data, do not produce reliable estimates of phylogeny. [Brownian motion; gene expression data; geometric morphometrics; morphological integration; squared-change parsimony; phylogeny; shape; stabilizing selection.].


Asunto(s)
Clasificación/métodos , Modelos Biológicos , Filogenia , Simulación por Computador
2.
Mol Biol Evol ; 36(4): 679-690, 2019 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-30668757

RESUMEN

Substitutions between chemically distant amino acids are known to occur less frequently than those between more similar amino acids. This knowledge, however, is not reflected in most codon substitution models, which treat all nonsynonymous changes as if they were equivalent in terms of impact on the protein. A variety of methods for integrating chemical distances into models have been proposed, with a common approach being to divide substitutions into radical or conservative categories. Nevertheless, it remains unclear whether the resulting models describe sequence evolution better than their simpler counterparts. We propose a parametric codon model that distinguishes between radical and conservative substitutions, allowing us to assess if radical substitutions are preferentially removed by selection. Applying our new model to a range of phylogenomic data, we find differentiating between radical and conservative substitutions provides significantly better fit for large populations, but see no equivalent improvement for smaller populations. Comparing codon and amino acid models using these same data shows that alignments from large populations tend to select phylogenetic models containing information about amino acid exchangeabilities, whereas the structure of the genetic code is more important for smaller populations. Our results suggest selection against radical substitutions is, on average, more pronounced in large populations than smaller ones. The reduced observable effect of selection in smaller populations may be due to stronger genetic drift making it more challenging to detect preferences. Our results imply an important connection between the life history of a phylogenetic group and the model that best describes its evolution.


Asunto(s)
Sustitución de Aminoácidos , Aminoácidos/química , Evolución Molecular , Modelos Genéticos , Selección Genética , Aminoácidos/genética , Animales , Densidad de Población
3.
Mol Biol Evol ; 36(10): 2340-2351, 2019 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-31209473

RESUMEN

Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.


Asunto(s)
Alineación de Secuencia/métodos , Homología de Secuencia , Animales , Aves/genética
4.
Bioinformatics ; 34(22): 3929-3930, 2018 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-29868763

RESUMEN

Summary: Phylogenomic datasets invariably contain undetected stretches of non-homologous characters due to poor-quality sequences or erroneous gene models. The large-scale multi-gene nature of these datasets renders impractical or impossible detailed manual curation of sequences, but few tools exist that can automate this task. To address this issue, we developed a new method that takes as input a set of unaligned homologous sequences and uses an explicit probabilistic approach to identify and mask regions with non-homologous adjacent characters. These regions are defined as sharing no statistical support for homology with any other sequence in the set, which can result from e.g. sequencing errors or gene prediction errors creating frameshifts. Our methodology is implemented in the program PREQUAL, which is a fast and accurate tool for high-throughput filtering of sequences. The program is primarily aimed at amino acid sequences, although it can handle protein coding DNA sequences as well. It is fully customizable to allow fine-tuning of the filtering sensitivity. Availability and implementation: The program PREQUAL is written in C/C++ and available through a GNU GPL v3.0 at https://github.com/simonwhelan/prequal. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuencia de Aminoácidos , Homología de Secuencia , Programas Informáticos , Biología Computacional , Filogenia
5.
Syst Biol ; 66(2): 218-231, 2017 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-27633353

RESUMEN

Phylogenetic tree inference is a critical component of many systematic and evolutionary studies. The majority of these studies are based on the two-step process of multiple sequence alignment followed by tree inference, despite persistent evidence that the alignment step can lead to biased results. Here we present a two-part study that first presents PaHMM-Tree, a novel neighbor joining-based method that estimates pairwise distances without assuming a single alignment. We then use simulations to benchmark its performance against a wide-range of other phylogenetic tree inference methods, including the first comparison of alignment-free distance-based methods against more conventional tree estimation methods. Our new method for calculating pairwise distances based on statistical alignment provides distance estimates that are as accurate as those obtained using standard methods based on the true alignment. Pairwise distance estimates based on the two-step process tend to be substantially less accurate. This improved performance carries through to tree inference, where PaHMM-Tree provides more accurate tree estimates than all of the pairwise distance methods assessed. For close to moderately divergent sequence data we find that the two-step methods using statistical inference, where information from all sequences is included in the estimation procedure, tend to perform better than PaHMM-Tree, particularly full statistical alignment, which simultaneously estimates both the tree and the alignment. For deep divergences we find the alignment step becomes so prone to error that our distance-based PaHMM-Tree outperforms all other methods of tree inference. Finally, we find that the accuracy of alignment-free methods tends to decline faster than standard two-step methods in the presence of alignment uncertainty, and identify no conditions where alignment-free methods are equal to or more accurate than standard phylogenetic methods even in the presence of substantial alignment error. [Alignment-free; distance-based phylogenetics; pair Hidden Markov Models; phylogenetic inference; statistical alignment.].


Asunto(s)
Clasificación/métodos , Modelos Genéticos , Filogenia , Alineación de Secuencia , Algoritmos , Benchmarking , Evolución Biológica , Evolución Molecular
6.
Mol Biol Evol ; 32(9): 2456-68, 2015 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-25944916

RESUMEN

Recent developments in the analysis of amino acid covariation are leading to breakthroughs in protein structure prediction, protein design, and prediction of the interactome. It is assumed that observed patterns of covariation are caused by molecular coevolution, where substitutions at one site affect the evolutionary forces acting at neighboring sites. Our theoretical and empirical results cast doubt on this assumption. We demonstrate that the strongest coevolutionary signal is a decrease in evolutionary rate and that unfeasibly long times are required to produce coordinated substitutions. We find that covarying substitutions are mostly found on different branches of the phylogenetic tree, indicating that they are independent events that may or may not be attributable to coevolution. These observations undermine the hypothesis that molecular coevolution is the primary cause of the covariation signal. In contrast, we find that the pairs of residues with the strongest covariation signal tend to have low evolutionary rates, and that it is this low rate that gives rise to the covariation signal. Slowly evolving residue pairs are disproportionately located in the protein's core, which explains covariation methods' ability to detect pairs of residues that are close in three dimensions. These observations lead us to propose the "coevolution paradox": The strength of coevolution required to cause coordinated changes means the evolutionary rate is so low that such changes are highly unlikely to occur. As modern covariation methods may lead to breakthroughs in structural genomics, it is critical to recognize their biases and limitations.


Asunto(s)
Evolución Molecular , Cadenas de Markov , Modelos Genéticos , Tasa de Mutación , Filogenia , Pliegue de Proteína , Proteínas/genética
7.
Bioinformatics ; 31(14): 2391-3, 2015 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-25725494

RESUMEN

UNLABELLED: Phylogenetic models are an important tool in molecular evolution allowing us to study the pattern and rate of sequence change. The recent influx of new sequence data in the biosciences means that to address evolutionary questions, we need a means for rapid and easy model development and implementation. Here we present GeLL, a Java library that lets users use text to quickly and efficiently define novel forms of discrete data and create new substitution models that describe how those data change on a phylogeny. GeLL allows users to define general substitution models and data structures in a way that is not possible in other existing libraries, including mixture models and non-reversible models. Classes are provided for calculating likelihoods, optimizing model parameters and branch lengths, ancestral reconstruction and sequence simulation. AVAILABILITY AND IMPLEMENTATION: http://phylo.bio.ku.edu/GeLL under a GPL v3 license.


Asunto(s)
Filogenia , Programas Informáticos , Funciones de Verosimilitud , Modelos Genéticos
8.
Syst Biol ; 64(1): 42-55, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-25209223

RESUMEN

Molecular phylogenetics is a powerful tool for inferring both the process and pattern of evolution from genomic sequence data. Statistical approaches, such as maximum likelihood and Bayesian inference, are now established as the preferred methods of inference. The choice of models that a researcher uses for inference is of critical importance, and there are established methods for model selection conditioned on a particular type of data, such as nucleotides, amino acids, or codons. A major limitation of existing model selection approaches is that they can only compare models acting upon a single type of data. Here, we extend model selection to allow comparisons between models describing different types of data by introducing the idea of adapter functions, which project aggregated models onto the originally observed sequence data. These projections are implemented in the program ModelOMatic and used to perform model selection on 3722 families from the PANDIT database, 68 genes from an arthropod phylogenomic data set, and 248 genes from a vertebrate phylogenomic data set. For the PANDIT and arthropod data, we find that amino acid models are selected for the overwhelming majority of alignments; with progressively smaller numbers of alignments selecting codon and nucleotide models, and no families selecting RY-based models. In contrast, nearly all alignments from the vertebrate data set select codon-based models. The sequence divergence, the number of sequences, and the degree of selection acting upon the protein sequences may contribute to explaining this variation in model selection. Our ModelOMatic program is fast, with most families from PANDIT taking fewer than 150 s to complete, and should therefore be easily incorporated into existing phylogenetic pipelines. ModelOMatic is available at https://code.google.com/p/modelomatic/.


Asunto(s)
Clasificación/métodos , Modelos Biológicos , Filogenia , Aminoácidos/genética , Animales , Codón/genética , Nucleótidos/genética , Programas Informáticos
9.
Mol Biol Evol ; 30(3): 642-53, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-23144040

RESUMEN

Multiple sequence alignment (MSA) is the heart of comparative sequence analysis. Recent studies demonstrate that MSA algorithms can produce different outcomes when analyzing genomes, including phylogenetic tree inference and the detection of adaptive evolution. These studies also suggest that the difference between MSA algorithms is of a similar order to the uncertainty within an algorithm and suggest integrating across this uncertainty. In this study, we examine further the problem of disagreements between MSA algorithms and how they affect downstream analyses. We also investigate whether integrating across alignment uncertainty affects downstream analyses. We address these questions by analyzing 200 chordate gene families, with properties reflecting those used in large-scale genomic analyses. We find that newly developed distance metrics reveal two significantly different classes of MSA methods (MSAMs). The similarity-based class includes progressive aligners and consistency aligners, representing many methodological innovations for sequence alignment, whereas the evolution-based class includes phylogenetically aware alignment and statistical alignment. We proceed to show that the class of an MSAM has a substantial impact on downstream analyses. For phylogenetic inference, tree estimates and their branch lengths appear highly dependent on the class of aligner used. The number of families, and the sites within those families, inferred to have undergone adaptive evolution depend on the class of aligner used. Similarity-based aligners tend to identify more adaptive evolution. We also develop and test methods for incorporating MSA uncertainty when detecting adaptive evolution but find that although accounting for MSA uncertainty does affect downstream analyses, it appears less important than the class of aligner chosen. Our results demonstrate the critical role that MSA methodology has on downstream analysis, highlighting that the class of aligner chosen in an analysis has a demonstrable effect on its outcome.


Asunto(s)
Algoritmos , Modelos Genéticos , Alineación de Secuencia/métodos , Adaptación Biológica/genética , Teorema de Bayes , Evolución Molecular , Genoma Humano , Humanos , Funciones de Verosimilitud , Cadenas de Markov , Método de Montecarlo , Filogenia , Selección Genética , Análisis de Secuencia de ADN/métodos
10.
Mol Phylogenet Evol ; 76: 102-9, 2014 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-24631855

RESUMEN

Deep coalescence and the nongenealogical pattern of descent caused by recombination have emerged as a common problem for phylogenetic inference at the species level. Here we use computer simulations to assess whether AFLP-based phylogenies are robust to the uncertainties introduced by these factors. Our results indicate that phylogenetic signal can prevail even in the face of extensive deep coalescence allowing recovering the correct species tree topology. The impact of recombination on tree accuracy was related to total tree depth and species effective population size. The correct tree topology could be recovered upon many simulation settings due to a trade-off between the conflicting signals resulting from intra-locus recombination and the benefits of the joint consideration of unlinked loci that better matched overall the true species tree. Errors in tree topology were not only determined by deep coalescence, but also by the timing of divergence and the tree-building errors arising from an insufficient number of characters. DNA sequences generally outperformed AFLPs upon any simulated scenario, but this difference in performance was nearly negligible when a sufficient number of AFLP characters were sampled. Our simulations suggest that the impact of deep coalescence and intra-locus recombination on the reliability of AFLP trees could be minimal for effective population sizes equal to or lower than 10,000 (typical of many vertebrates and tree plants) given tree depths above 0.02 substitutions per site.


Asunto(s)
Análisis del Polimorfismo de Longitud de Fragmentos Amplificados/métodos , Filogenia , Recombinación Genética , Animales , Secuencia de Bases , Simulación por Computador , Modelos Genéticos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN
11.
Bioinformatics ; 28(4): 495-502, 2012 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-22199391

RESUMEN

MOTIVATION: Multiple sequence alignment (MSA) is a core method in bioinformatics. The accuracy of such alignments may influence the success of downstream analyses such as phylogenetic inference, protein structure prediction, and functional prediction. The importance of MSA has lead to the proliferation of MSA methods, with different objective functions and heuristics to search for the optimal MSA. Different methods of inferring MSAs produce different results in all but the most trivial cases. By measuring the differences between inferred alignments, we may be able to develop an understanding of how these differences (i) relate to the objective functions and heuristics used in MSA methods, and (ii) affect downstream analyses. RESULTS: We introduce four metrics to compare MSAs, which include the position in a sequence where a gap occurs or the location on a phylogenetic tree where an insertion or deletion (indel) event occurs. We use both real and synthetic data to explore the information given by these metrics and demonstrate how the different metrics in combination can yield more information about MSA methods and the differences between them. AVAILABILITY: MetAl is a free software implementation of these metrics in Haskell. Source and binaries for Windows, Linux and Mac OS X are available from http://kumiho.smith.man.ac.uk/whelan/software/metal/.


Asunto(s)
Filogenia , Alineación de Secuencia/métodos , Programas Informáticos , Computadores , Mutación INDEL , Proteínas/química , Proteínas/genética
12.
Bioinformatics ; 28(1): 48-55, 2012 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-22039210

RESUMEN

MOTIVATION: Recent large-scale studies of individuals within a population have demonstrated that there is widespread variation in copy number in many gene families. In addition, there is increasing evidence that the variation in gene copy number can give rise to substantial phenotypic effects. In some cases, these variations have been shown to be adaptive. These observations show that a full understanding of the evolution of biological function requires an understanding of gene gain and gene loss. Accurate, robust evolutionary models of gain and loss events are, therefore, required. RESULTS: We have developed weighted parsimony and maximum likelihood methods for inferring gain and loss events. To test these methods, we have used Markov models of gain and loss to simulate data with known properties. We examine three models: a simple birth-death model, a single rate model and a birth-death innovation model with parameters estimated from Drosophila genome data. We find that for all simulations maximum likelihood-based methods are very accurate for reconstructing the number of duplication events on the phylogenetic tree, and that maximum likelihood and weighted parsimony have similar accuracy for reconstructing the ancestral state. Our implementations are robust to different model parameters and provide accurate inferences of ancestral states and the number of gain and loss events. For ancestral reconstruction, we recommend weighted parsimony because it has similar accuracy to maximum likelihood, but is much faster. For inferring the number of individual gene loss or gain events, maximum likelihood is noticeably more accurate, albeit at greater computational cost. AVAILABILITY: www.bioinf.manchester.ac.uk/dupliphy CONTACT: simon.lovell@manchester.ac.uk; simon.whelan@manchester.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Drosophila/genética , Evolución Molecular , Modelos Genéticos , Animales , Simulación por Computador , Drosophila/clasificación , Genoma de los Insectos , Funciones de Verosimilitud , Cadenas de Markov
13.
Syst Biol ; 61(2): 228-39, 2012 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22076302

RESUMEN

Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.


Asunto(s)
Filogenia , Levaduras/genética , Algoritmos , Clasificación/métodos , Biología Computacional , Genoma , Funciones de Verosimilitud , Modelos Biológicos , Levaduras/clasificación
14.
Mol Biol Evol ; 28(1): 449-58, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-20724379

RESUMEN

There is widespread evidence of lineage-specific rate variation, known as heterotachy, during protein evolution. Changes in the structural and functional constraints acting on a protein can lead to heterotachy, and it is plausible that such changes, known as covarion shifts, may affect many amino acids at once. Several previous attempts to model heterotachy have used covarion models, where the sequence undergoes covarion drift, whereby each site may switch independently among a set of discrete classes having different substitution rates. However, such independent switching may not capture biologically important events where the selective forces acting on a protein affect many sites at once. We describe a new class of models that allow the rates of substitution and switching to vary among branches of a phylogenetic tree. Such models are better able to handle covarion shifts. We apply these models to a set of genes occurring in nonphotosynthetic bacteria, cyanobacteria, and the plastids of green and red algae. We find that 4/5 genes show evidence of some form of rate switching and that 3/5 genes show evidence that the relative switching rate differs among taxonomic groups. We conclude that covarion shifts may be frequent during the deep evolution of plastid genes and that our methodology may provide a powerful new tool for investigating such shifts in other systems.


Asunto(s)
Evolución Biológica , Variación Genética , Modelos Genéticos , Filogenia , Plastidios/genética , Algoritmos , Secuencia de Bases , Chlorophyta/citología , Chlorophyta/genética , Simulación por Computador , Datos de Secuencia Molecular , Proteínas/química , Proteínas/genética , Rhodophyta/citología , Rhodophyta/genética , Alineación de Secuencia
15.
Mol Biol Evol ; 27(12): 2674-7, 2010 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-20584772

RESUMEN

Phylogenetic tree-search is a major aspect of many evolutionary studies. Several tree rearrangement algorithms are available for tree-search, but it is hard to draw general conclusions about their relative performance because many effects are data set specific and can be highly dependent on individual implementations (e.g., RAxML or phyml). Using only the structure of the rearrangements proposed by the Nearest Neighbor Interchange (NNI) algorithm, we show tree-search can prematurely terminate if it encounters multifurcating trees. We validate the relevance of this result by demonstrating that in real data the majority of possible bifurcating trees potentially encountered during tree-search are actually multifurcations, which suggests NNI would be expected to perform poorly. We also show that the star-decomposition algorithm is a special case of two other popular tree-search algorithms, subtree pruning and regrafting (SPR) and tree bisection and reconnection (TBR), which means that these two algorithms can efficiently escape when they encounter multifurcations. We caution against the use of the NNI algorithm and for most applications we recommend the use of more robust tree-search algorithms, such as SPR and TBR.


Asunto(s)
Evolución Molecular , Modelos Genéticos , Filogenia , Alineación de Secuencia/estadística & datos numéricos , Algoritmos , Funciones de Verosimilitud
16.
Dev Dyn ; 239(12): 3312-23, 2010 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-20981828

RESUMEN

The 6-O-endosulfatase enzymes (Sulfs) edit the final sulfation pattern and function of heparan sulfate (HS) by removal of 6-O-sulfate groups from the chain. To date, two mammalian sulf genes have been identified that regulate many signalling pathways during embryonic development. In zebrafish a sulf1 ortholog and duplicate copies of the mammalian sulf2 gene, sulf2a and sulf2, have been identified, which contain conserved motifs characteristic of vertebrate sulf genes. Zebrafish sulf1 and sulf2a are broadly expressed in the central nervous system (CNS) and non-neuronal tissue including heart, somite boundaries, olfactory system, and otic vesicle, whereas sulf2 expression is almost entirely restricted to the CNS. The duplicate copies of sulf2 have distinct expression patterns, which together mirror that of mouse sulf2, suggesting duplication in the teleost lineage has been followed by subfunctionalisation, whereby both genes need to be preserved by selection to ensure the ancestral gene's expression profile and function is maintained.


Asunto(s)
Proteínas de Pez Cebra/metabolismo , Animales , Sistema Nervioso Central/embriología , Biología Computacional , Embrión no Mamífero/metabolismo , Corazón/embriología , Heparitina Sulfato/metabolismo , Hibridación in Situ , Vías Olfatorias/embriología , Proteoglicanos/metabolismo , Proteínas de Pez Cebra/genética
17.
Methods Mol Biol ; 2231: 147-162, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33289892

RESUMEN

Large-scale multigene datasets used in phylogenomics and comparative genomics often contain sequence errors inherited from source genomes and transcriptomes. These errors typically manifest as stretches of non-homologous characters and derive from sequencing, assembly, and/or annotation errors. The lack of automatic tools to detect and remove sequence errors leads to the propagation of these errors in large-scale datasets. PREQUAL is a command line tool that identifies and masks regions with non-homologous adjacent characters in sets of unaligned homologous sequences. PREQUAL uses a full probabilistic approach based on pair hidden Markov models. On the front end, PREQUAL is user-friendly and simple to use while also allowing full customization to adjust filtering sensitivity. It is primarily aimed at amino acid sequences but can handle protein-coding nucleotide sequences. PREQUAL is computationally efficient and shows high sensitivity and accuracy. In this chapter, we briefly introduce the motivation for PREQUAL and its underlying methodology, followed by a description of basic and advanced usage, and conclude with some notes and recommendations. PREQUAL fills an important gap in the current bioinformatics tool kit for phylogenomics, contributing toward increased accuracy and reproducibility in future studies.


Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Cadenas de Markov , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Modelos Estadísticos , Filogenia , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Homología de Secuencia
18.
Mol Biol Evol ; 25(8): 1683-94, 2008 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-18502771

RESUMEN

Models of nucleotide substitution make many simplifying assumptions about the evolutionary process, including that the same process acts on all sites in an alignment and on all branches on the phylogenetic tree. Many studies have shown that in reality the substitution process is heterogeneous and that this variability can introduce systematic errors into many forms of phylogenetic analyses. I propose a new rigorous approach for describing heterogeneity called a temporal hidden Markov model (THMM), which can distinguish between among site (spatial) heterogeneity and among lineage (temporal) heterogeneity. Several versions of the THMM are applied to 16 sets of aligned sequences to quantitatively assess the different forms of heterogeneity acting within them. The most general THMM provides the best fit in all the data sets examined, providing strong evidence of pervasive heterogeneity during evolution. Investigating individual forms of heterogeneity provides further insights. In agreement with previous studies, spatial rate heterogeneity (rates across sites [RAS]) is inferred to be the single most prevalent form of heterogeneity. Interestingly, RAS appears so dominant that failure to independently include it in the THMM masks other forms of heterogeneity, particularly temporal heterogeneity. Incorporating RAS into the THMM reveals substantial temporal and spatial heterogeneity in nucleotide composition and bias toward transition substitution in all alignments examined, although the relative importance of different forms of heterogeneity varies between data sets. Furthermore, the improvements in model fit observed by adding complexity to the model suggest that the THMMs used in this study do not capture all the evolutionary heterogeneity occurring in the data. These observations all indicate that current tests may consistently underestimate the degree of temporal heterogeneity occurring in data. Finally, there is a weak link between the amount of heterogeneity detected and the level of divergence between the sequences, suggesting that variability in the evolutionary process will be a particular problem for deep phylogeny.


Asunto(s)
Evolución Molecular , Heterogeneidad Genética , Modelos Genéticos , Filogenia , Secuencia de Bases , Funciones de Verosimilitud , Datos de Secuencia Molecular , Alineación de Secuencia
19.
Bioinformatics ; 24(1): 11-7, 2008 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-18006548

RESUMEN

MOTIVATION: Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear. RESULTS: We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.


Asunto(s)
Empalme Alternativo/genética , Evolución Molecular , Perfilación de la Expresión Génica/métodos , Isoformas de Proteínas/genética , Alineación de Secuencia/métodos , Análisis de Secuencia/métodos , Algoritmos , Homología de Secuencia de Ácido Nucleico
20.
Methods Mol Biol ; 452: 287-309, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18566770

RESUMEN

Molecular phylogenetics examines how biological sequences evolve and the historical relationships between them. An important aspect of many such studies is the estimation of a phylogenetic tree, which explicitly describes evolutionary relationships between the sequences. This chapter provides an introduction to evolutionary trees and some commonly used inferential methodology, focusing on the assumptions made and how they affect an analysis. Detailed discussion is also provided about some common algorithms used for phylogenetic tree estimation. Finally, there are a few practical guidelines, including how to combine multiple software packages to improve inference, and a comparison between Bayesian and maximum likelihood phylogenetics.


Asunto(s)
Biología Computacional/métodos , Evolución Molecular , Filogenia , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA