Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
J Hered ; 107(3): 248-56, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-26704140

RESUMEN

The renewable source of highly reduced carbon provided by plant triacylglycerols (TAGs) fills an ever increasing demand for food, biodiesel, and industrial chemicals. Each of these uses requires different compositions of fatty acid proportions in seed oils. Identifying the genes responsible for variation in seed oil composition in nature provides targets for bioengineering fatty acid proportions optimized for various industrial and nutrition goals. Here, we characterized the seed oil composition of 391 world-wide, wild accessions of Arabidopsis thaliana, and performed a genome-wide association study (GWAS) of the 9 major fatty acids in the seed oil and 4 composite measures of the fatty acids. Four to 19 regions of interest were associated with the seed oil composition traits. Thirty-four of the genes in these regions are involved in lipid metabolism or transport, with 14 specific to fatty acid synthesis or breakdown. Eight of the genes encode transcription factors. We have identified genes significantly associated with variation in fatty acid proportions that can be used as a resource across the Brassicaceae. Two-thirds of the regions identified contain candidate genes that have never been implicated in lipid metabolism and represent potential new targets for bioengineering.


Asunto(s)
Arabidopsis/genética , Ácidos Grasos/química , Genes de Plantas , Aceites de Plantas/química , Arabidopsis/química , Mapeo Cromosómico , Estudios de Asociación Genética , Metabolismo de los Lípidos , Polimorfismo de Nucleótido Simple , Semillas/química
2.
J Hered ; 107(3): 257-65, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-26865732

RESUMEN

Seed oil melting point is an adaptive, quantitative trait determined by the relative proportions of the fatty acids that compose the oil. Micro- and macro-evolutionary evidence suggests selection has changed the melting point of seed oils to covary with germination temperatures because of a trade-off between total energy stores and the rate of energy acquisition during germination under competition. The seed oil compositions of 391 natural accessions of Arabidopsis thaliana, grown under common-garden conditions, were used to assess whether seed oil melting point within a species varied with germination temperature. In support of the adaptive explanation, long-term monthly spring and fall field temperatures of the accession collection sites significantly predicted their seed oil melting points. In addition, a genome-wide association study (GWAS) was performed to determine which genes were most likely responsible for the natural variation in seed oil melting point. The GWAS found a single highly significant association within the coding region of FAD2, which encodes a fatty acid desaturase central to the oil biosynthesis pathway. In a separate analysis of 15 a priori oil synthesis candidate genes, 2 (FAD2 and FATB) were located near significant SNPs associated with seed oil melting point. These results comport with others' molecular work showing that lines with alterations in these genes affect seed oil melting point as expected. Our results suggest natural selection has acted on a small number of loci to alter a quantitative trait in response to local environmental conditions.


Asunto(s)
Arabidopsis/genética , Ácidos Grasos/química , Semillas/química , Temperatura de Transición , Arabidopsis/química , Proteínas de Arabidopsis/genética , Ácido Graso Desaturasas/genética , Estudios de Asociación Genética , Germinación , Polimorfismo de Nucleótido Simple , Tioléster Hidrolasas/genética
3.
Bioinformatics ; 28(12): i274-82, 2012 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-22689772

RESUMEN

MOTIVATION: While phylogenetic analyses of datasets containing 1000-5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale. METHODS: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000-27 643 taxa. RESULTS: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences. AVAILABILITY: DACTAL source code and results of dataset analyses are available at www.cs.utexas.edu/users/phylo/software/dactal.


Asunto(s)
Filogenia , Alineación de Secuencia , Programas Informáticos , Algoritmos , Simulación por Computador , Funciones de Verosimilitud
4.
Syst Biol ; 61(2): 214-27, 2012 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-21934137

RESUMEN

Many research groups are estimating trees containing anywhere from a few thousands to hundreds of thousands of species, toward the eventual goal of the estimation of a Tree of Life, containing perhaps as many as several million leaves. These phylogenetic estimations present enormous computational challenges, and current computational methods are likely to fail to run even on data sets in the low end of this range. One approach to estimate a large species tree is to use phylogenetic estimation methods (such as maximum likelihood) on a supermatrix produced by concatenating multiple sequence alignments for a collection of markers; however, the most accurate of these phylogenetic estimation methods are extremely computationally intensive for data sets with more than a few thousand sequences. Supertree methods, which assemble phylogenetic trees from a collection of trees on subsets of the taxa, are important tools for phylogeny estimation where phylogenetic analyses based upon maximum likelihood (ML) are infeasible. In this paper, we introduce SuperFine, a meta-method that utilizes a novel two-step procedure in order to improve the accuracy and scalability of supertree methods. Our study, using both simulated and empirical data, shows that SuperFine-boosted supertree methods produce more accurate trees than standard supertree methods, and run quickly on very large data sets with thousands of sequences. Furthermore, SuperFine-boosted matrix representation with parsimony (MRP, the most well-known supertree method) approaches the accuracy of ML methods on supermatrix data sets under realistic conditions.


Asunto(s)
Filogenia , Algoritmos , Clasificación/métodos , Biología Computacional , Simulación por Computador , Funciones de Verosimilitud , Modelos Biológicos
5.
Syst Biol ; 61(1): 90-106, 2012 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22139466

RESUMEN

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.


Asunto(s)
Filogenia , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Automatización , Simulación por Computador , ADN , Evolución Molecular , Funciones de Verosimilitud
6.
Mol Phylogenet Evol ; 48(3): 1013-26, 2008 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-18620872

RESUMEN

Angiosperm systematics has progressed to the point where it is now expected that multiple, independent markers be used in phylogenetic studies. Universal primers for amplifying informative regions of the chloroplast genome are readily available, but in the faster-evolving nuclear genome it is challenging to discover priming sites that are conserved across distantly related taxa. With goals including the identification of informative markers in rosids, and perhaps other angiosperms, we screened 141 nuclear primer combinations for phylogenetic utility in two distinct groups of rosids at different taxonomic levels-Psiguria (Cucurbitaceae) and Geraniaceae. We discovered three phylogenetically informative regions in Psiguria and two in Geraniaceae, but none that were useful in both groups. Extending beyond rosids, we combined our findings with those of another recent effort testing these primer pairs in Asteraceae, Brassicaceae, and Orchidaceae. From this comparison, we identified 32 primer combinations that amplified regions in representative species of at least two of the five distantly related angiosperm families, giving some prior indication about phylogenetic usefulness of these markers in other flowering plants. This reduced set of primer pairs for amplifying low-copy nuclear markers along with a recommended experimental strategy provide a framework for identifying phylogenetically informative regions in angiosperms.


Asunto(s)
Magnoliopsida/genética , Evolución Biológica , Núcleo Celular/metabolismo , Cartilla de ADN/química , ADN de Cloroplastos/genética , ADN de Plantas/genética , Evolución Molecular , Genes de Plantas , Genoma de Planta , Genómica , Filogenia , Especificidad de la Especie
7.
J Comput Biol ; 12(6): 796-811, 2005.
Artículo en Inglés | MEDLINE | ID: mdl-16108717

RESUMEN

We present new methods for reconstructing reticulate evolution of species due to events such as horizontal transfer or hybrid speciation; both methods are based upon extensions of Wayne Maddison's approach in his seminal 1997 paper. Our first method is a polynomial time algorithm for constructing phylogenetic networks from two gene trees contained inside the network. We allow the network to have an arbitrary number of reticulations, but we limit the reticulation in the network so that the cycles in the network are node-disjoint ("galled"). Our second method is a polynomial time algorithm for constructing networks with one reticulation, where we allow for errors in the estimated gene trees. Using simulations, we demonstrate improved performance of this method over both NeighborNet and Maddison's method.


Asunto(s)
Evolución Molecular , Modelos Genéticos , Filogenia , Simulación por Computador , Frecuencia de los Genes , Variación Genética , Modelos Estadísticos , Mutación , Selección Genética
8.
Bioinformatics ; 20 Suppl 1: i355-62, 2004 Aug 04.
Artículo en Inglés | MEDLINE | ID: mdl-15262820

RESUMEN

MOTIVATION: For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerase chain reaction (PCR). RESULTS: We develop an initial candidate set by comparing the Arabidopsis and rice genomes using MoBIoS (Molecular Biological Information System). MoBIoS is a metric-space database management system targeting life science data. Through the use of metric-space indexing techniques, two genomes can be compared in O(mlog n), where m and n are the lengths of the genomes, versus O(mn) for BLAST-based analysis. The filtering of low-complexity regions may also be accomplished by directly assessing the uniqueness of the region. We describe mSQL, a SQL extension being developed for MoBIoS that encapsulates the algorithmic details in a common database programming language, shielding end-users from esoteric programming. AVAILABILITY: Available upon request from authors.


Asunto(s)
Arabidopsis/genética , Mapeo Cromosómico/métodos , Secuencia Conservada/genética , Cartilla de ADN/genética , Genoma de Planta/genética , Oryza/genética , Reacción en Cadena de la Polimerasa/métodos , Análisis de Secuencia de ADN/métodos , Homología de Secuencia de Ácido Nucleico , Programas Informáticos
9.
Genetics ; 163(1): 277-86, 2003 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-12586715

RESUMEN

The extent to which genetic background can influence allelic fitness is poorly understood, despite having important evolutionary consequences. Using experimental populations of Arabidopsis thaliana and map-based population genetic data, we examined a multigeneration response to selection in populations with differentiated genetic backgrounds. Replicated experimental populations of A. thaliana with genetic backgrounds derived from ecotypes Landsberg and Niederzenz were subjected to strong viability and fertility selection by growing individuals from each population at high density for three generations in a growth chamber. Patterns of genome-wide selection were evaluated by examining deviations from expected frequencies of mapped molecular markers. Estimates of selection coefficients for individual genomic regions ranged from near 0 to 0.685. Genomic regions demonstrating the strongest response to selection most often were selected similarly in both genetic backgrounds. The selection response of several weakly selected regions, however, appeared to be sensitive to genetic background, but only one region showed evidence of positive selection in one background and negative selection in another. These results are most consistent with models of adaptive evolution in which allelic fitnesses are not strongly influenced by genetic background and only infrequently change in sign due to variation at other loci.


Asunto(s)
Arabidopsis/genética , Selección Genética , Ligamiento Genético , Marcadores Genéticos , Geografía , Heterocigoto
10.
Ecol Evol ; 5(1): 164-71, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-25628873

RESUMEN

Early seedling emergence can increase plant fitness under competition. Seed oil composition (the types and relative amounts of fatty acids in the oils) may play an important role in determining emergence timing and early growth rate in oilseeds. Saturated fatty acids provide more energy per carbon atom than unsaturated fatty acids but have substantially higher melting points (when chain length is held constant). This characteristic forms the basis of an adaptive hypothesis that lower melting point seeds (lower proportion of saturated fatty acids) should be favored under colder germination temperatures due to earlier germination and faster growth before photosynthesis, while at warmer germination temperatures, seeds with a higher amount of energy (higher proportion of saturated fatty acids) should be favored. To assess the effects of seed oil melting point on timing of seedling emergence and fitness, high- and low-melting point lines from a recombinant inbred cross of Arabidopsis thaliana were competed in a fully factorial experiment at warm and cold temperatures with two different density treatments. Emergence timing between these lines was not significantly different at either temperature, which aligned with warm temperature predictions, but not cold temperature predictions. Under all conditions, plants competing against high-melting point lines had lower fitness relative to those against low-melting point lines, which matched expectations for undifferentiated emergence times.

11.
Am Nat ; 156(4): 442-458, 2000 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-29592140

RESUMEN

Structural, energetic, biochemical, and ecological information suggests that germination temperature is an important selective agent causing seed oils of higher-latitude plants to have proportionately more unsaturated fatty acids than lower-latitude plants. Germination temperature is predicted to select relative proportions of saturated and unsaturated fatty acids in seed oils that optimize the total energy stores in a seed and the rate of energy production during germination. Saturated fatty acids store more energy per carbon than unsaturated fatty acids; however, unsaturated fatty acids have much lower melting points than saturated fatty acids. Thus, seeds with lower proportions of saturated fatty acids in their oils should be able to germinate earlier and grow more rapidly at low temperatures even though they store less total energy than seeds with a higher proportion of saturated fatty acids. Seeds that germinate earlier and grow more rapidly should have a competitive advantage. At higher germination temperatures, seeds with higher proportions of saturated fatty acids will be selectively favored because their oils will provide more energy, without a penalty in the rate of energy acquisition. Macroevolutionary biogeographical evidence from a broad spectrum of seed plants and the genus Helianthus support the theory, as do microevolutionary biogeography and seed germination performance within species of Helianthus.

12.
Artículo en Inglés | MEDLINE | ID: mdl-17048405

RESUMEN

Phylogenetic networks model the evolutionary history of sets of organisms when events such as hybrid speciation and horizontal gene transfer occur. In spite of their widely acknowledged importance in evolutionary biology, phylogenetic networks have so far been studied mostly for specific data sets. We present a general definition of phylogenetic networks in terms of directed acyclic graphs (DAGs) and a set of conditions. Further, we distinguish between model networks and reconstructible ones and characterize the effect of extinction and taxon sampling on the reconstructibility of the network. Simulation studies are a standard technique for assessing the performance of phylogenetic methods. A main step in such studies entails quantifying the topological error between the model and inferred phylogenies. While many measures of tree topological accuracy have been proposed, none exist for phylogenetic networks. Previously, we proposed the first such measure, which applied only to a restricted class of networks. In this paper, we extend that measure to apply to all networks, and prove that it is a metric on the space of phylogenetic networks. Our results allow for the systematic study of existing network methods, and for the design of new accurate ones.


Asunto(s)
Biología Computacional/métodos , Modelos Genéticos , Filogenia , Algoritmos , Evolución Molecular , Transferencia de Gen Horizontal/genética , Recombinación Genética/genética
13.
Am J Bot ; 91: 1700-1708, 2004 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-18677414

RESUMEN

Until recently, rigorously reconstructing the many hybrid speciation events in plants has not been practical because of the limited number of molecular markers available for plant phylogenetic reconstruction and the lack of good, biologically based methods for inferring reticulation (network) events. This situation should change rapidly with the development of multiple nuclear markers for phylogenetic reconstruction and new methods for reconstructing reticulate evolution. These developments will necessitate a much greater incorporation of population genetics into phylogenetic reconstruction than has been common. Population genetic events such as gene duplication coupled with lineage sorting and meiotic and sexual recombination have always had the potential to affect phylogenetic inference. For tree reconstruction, these problems are usually minimized by using uniparental markers and nuclear markers that undergo rapid concerted evolution. Because reconstruction of reticulate speciation events will require nuclear markers that lack these characteristics, effects of population genetics on phylogenetic inference will need to be addressed directly. Current models and methods that allow hybrid speciation to be detected and reconstructed are discussed, with a focus on how lineage sorting and meiotic and sexual recombination affect network reconstruction. Approaches that would allow inference of phylogenetic networks in their presence are suggested.

14.
G3 (Bethesda) ; 4(8): 1465-78, 2014 Jun 05.
Artículo en Inglés | MEDLINE | ID: mdl-24902604

RESUMEN

In the natural world, genotype expression is influenced by an organism's environment. Identifying and understanding the genes underlying phenotypes in different environments is important for making advances in fields ranging from evolution to medicine to agriculture. With the availability of genome-wide genetic-marker datasets, it is possible to look for genes that interact with the environment. Using the model organism, Arabidopsis thaliana, we looked for genes underlying phenotypes as well as genotype-by-environment interactions in four germination traits under two light and two nutrient conditions. We then performed genome-wide association tests to identify candidate genes underlying the observed phenotypes and genotype-by-environment interactions. Of the four germination traits examined, only two showed significant genotype-by-environment interactions. While genome-wide association analyses did not identify any markers or genes explicitly linked to genotype-by-environment interactions, we did identify a total of 55 markers and 71 genes associated with germination differences. Of the 71 genes, four--ZIGA4, PS1, TOR, and TT12--appear to be strong candidates for further study of germination variation under different environments.


Asunto(s)
Arabidopsis/genética , Germinación/genética , Arabidopsis/efectos de los fármacos , Arabidopsis/fisiología , Arabidopsis/efectos de la radiación , Fertilizantes , Interacción Gen-Ambiente , Genes de Plantas , Estudio de Asociación del Genoma Completo , Germinación/efectos de los fármacos , Germinación/efectos de la radiación , Luz , Fenotipo , Polimorfismo de Nucleótido Simple
15.
PLoS One ; 6(11): e27731, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-22132132

RESUMEN

Statistical methods for phylogeny estimation, especially maximum likelihood (ML), offer high accuracy with excellent theoretical properties. However, RAxML, the current leading method for large-scale ML estimation, can require weeks or longer when used on datasets with thousands of molecular sequences. Faster methods for ML estimation, among them FastTree, have also been developed, but their relative performance to RAxML is not yet fully understood. In this study, we explore the performance with respect to ML score, running time, and topological accuracy, of FastTree and RAxML on thousands of alignments (based on both simulated and biological nucleotide datasets) with up to 27,634 sequences. We find that when RAxML and FastTree are constrained to the same running time, FastTree produces topologically much more accurate trees in almost all cases. We also find that when RAxML is allowed to run to completion, it provides an advantage over FastTree in terms of the ML score, but does not produce substantially more accurate tree topologies. Interestingly, the relative accuracy of trees computed using FastTree and RAxML depends in part on the accuracy of the sequence alignment and dataset size, so that FastTree can be more accurate than RAxML on large datasets with relatively inaccurate alignments. Finally, the running times of RAxML and FastTree are dramatically different, so that when run to completion, RAxML can take several orders of magnitude longer than FastTree to complete. Thus, our study shows that very large phylogenies can be estimated very quickly using FastTree, with little (and in some cases no) degradation in tree accuracy, as compared to RAxML.


Asunto(s)
Biología Computacional/métodos , Filogenia , Programas Informáticos , Secuencia de Bases , Simulación por Computador , Bases de Datos Genéticas , Humanos , Funciones de Verosimilitud , Alineación de Secuencia
16.
Algorithms Mol Biol ; 6: 7, 2011 Apr 19.
Artículo en Inglés | MEDLINE | ID: mdl-21504600

RESUMEN

BACKGROUND: Supertree methods represent one of the major ways by which the Tree of Life can be estimated, but despite many recent algorithmic innovations, matrix representation with parsimony (MRP) remains the main algorithmic supertree method. RESULTS: We evaluated the performance of several supertree methods based upon the Quartets MaxCut (QMC) method of Snir and Rao and showed that two of these methods usually outperform MRP and five other supertree methods that we studied, under many realistic model conditions. However, the QMC-based methods have scalability issues that may limit their utility on large datasets. We also observed that taxon sampling impacted supertree accuracy, with poor results obtained when all of the source trees were only sparsely sampled. Finally, we showed that the popular optimality criterion of minimizing the total topological distance of the supertree to the source trees is only weakly correlated with supertree topological accuracy. Therefore evaluating supertree methods on biological datasets is problematic. CONCLUSIONS: Our results show that supertree methods that improve upon MRP are possible, and that an effort should be made to produce scalable and robust implementations of the most accurate supertree methods. Also, because topological accuracy depends upon taxon sampling strategies, attempts to construct very large phylogenetic trees using supertree methods should consider the selection of source tree datasets, as well as supertree methods. Finally, since supertree topological error is only weakly correlated with the supertree's topological distance to its source trees, development and testing of supertree methods presents methodological challenges.

17.
PLoS Curr ; 2: RRN1195, 2010 Nov 18.
Artículo en Inglés | MEDLINE | ID: mdl-21113335

RESUMEN

We have assembled a collection of web pages that contain benchmark datasets and software tools to enable the evaluation of the accuracy and scalability of computational methods for estimating evolutionary relationships. They provide a resource to the scientific community for development of new alignment and tree inference methods on very difficult datasets. The datasets are intended to help address three problems: multiple sequence alignment, phylogeny estimation given aligned sequences, and supertree estimation. Datasets from our work include empirical datasets with carefully curated alignments suitable for testing alignment and phylogenetic methods for large-scale systematics studies. Links to other empirical datasets, lacking curated alignments, are also provided. We also include simulated datasets with properties typical of large-scale systematics studies, including high rates of substitutions and indels, and we include the true alignment and tree for each simulated dataset. Finally, we provide links to software tools for generating simulated datasets, and for evaluating the accuracy of alignments and trees estimated on these datasets. We welcome contributions to the benchmark datasets from other researchers.

18.
PLoS Curr ; 2: RRN1198, 2010 Nov 19.
Artículo en Inglés | MEDLINE | ID: mdl-21113338

RESUMEN

Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies.

19.
Algorithms Mol Biol ; 5: 8, 2010 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-20047664

RESUMEN

BACKGROUND: Supertree methods comprise one approach to reconstructing large molecular phylogenies given multi-marker datasets: trees are estimated on each marker and then combined into a tree (the "supertree") on the entire set of taxa. Supertrees can be constructed using various algorithmic techniques, with the most common being matrix representation with parsimony (MRP). When the data allow, the competing approach is a combined analysis (also known as a "supermatrix" or "total evidence" approach) whereby the different sequence data matrices for each of the different subsets of taxa are concatenated into a single supermatrix, and a tree is estimated on that supermatrix. RESULTS: In this paper, we describe an extensive simulation study we performed comparing two supertree methods, MRP and weighted MRP, to combined analysis methods on large model trees. A key contribution of this study is our novel simulation methodology (Super-Method Input Data Generator, or SMIDGen) that better reflects biological processes and the practices of systematists than earlier simulations. We show that combined analysis based upon maximum likelihood outperforms MRP and weighted MRP, giving especially big improvements when the largest subtree does not contain most of the taxa. CONCLUSIONS: This study demonstrates that MRP and weighted MRP produce distinctly less accurate trees than combined analyses for a given base method (maximum parsimony or maximum likelihood). Since there are situations in which combined analyses are not feasible, there is a clear need for better supertree methods. The source tree and combined datasets used in this study can be used to test other supertree and combined analysis methods.

20.
Artículo en Inglés | MEDLINE | ID: mdl-19179695

RESUMEN

Several methods have been developed for simultaneous estimation of alignment and tree, of which POY is the most popular. In a 2007 paper published in Systematic Biology, Ogden and Rosenberg reported on a simulation study in which they compared POY to estimating the alignment using ClustalW and then analyzing the resultant alignment using maximum parsimony. They found that ClustalW+MP outperformed POY with respect to alignment and phylogenetic tree accuracy, and they concluded that simultaneous estimation techniques are not competitive with two-phase techniques. Our paper presents a simulation study in which we focus on the NP-hard optimization problem that POY addresses: minimizing treelength. Our study considers the impact of the gap penalty and suggests that the poor performance observed for POY by Ogden and Rosenberg is due to the simple gap penalties they used to score alignment/tree pairs. Our study suggests that optimizing under an affine gap penalty might produce alignments that are better than ClustalW alignments, and competitive with those produced by the best current alignment methods. We also show that optimizing under this affine gap penalty produces trees whose topological accuracy is better than ClustalW+MP, and competitive with the current best two-phase methods.


Asunto(s)
Evolución Molecular , Cadenas de Markov , Filogenia , Alineación de Secuencia , Biología de Sistemas/métodos , Biología Computacional , Simulación por Computador , Modelos Genéticos , Modelos Estadísticos , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA