RESUMEN
There are two main life cycles in plants-annual and perennial1,2. These life cycles are associated with different traits that determine ecosystem function3,4. Although life cycles are textbook examples of plant adaptation to different environments, we lack comprehensive knowledge regarding their global distributional patterns. Here we assembled an extensive database of plant life cycle assignments of 235,000 plant species coupled with millions of georeferenced datapoints to map the worldwide biogeography of these plant species. We found that annual plants are half as common as initially thought5-8, accounting for only 6% of plant species. Our analyses indicate that annuals are favoured in hot and dry regions. However, a more accurate model shows that the prevalence of annual species is driven by temperature and precipitation in the driest quarter (rather than yearly means), explaining, for example, why some Mediterranean systems have more annuals than desert systems. Furthermore, this pattern remains consistent among different families, indicating convergent evolution. Finally, we demonstrate that increasing climate variability and anthropogenic disturbance increase annual favourability. Considering future climate change, we predict an increase in annual prevalence for 69% of the world's ecoregions by 2060. Overall, our analyses raise concerns for ecosystem services provided by perennial plants, as ongoing changes are leading to a higher proportion of annual plants globally.
Asunto(s)
Ecosistema , Mapeo Geográfico , Filogeografía , Fenómenos Fisiológicos de las Plantas , Plantas , Aclimatación , Evolución Biológica , Cambio Climático/estadística & datos numéricos , Bases de Datos Factuales , Clima Desértico , Actividades Humanas , Región Mediterránea , Plantas/clasificación , Lluvia , TemperaturaRESUMEN
The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.
Asunto(s)
Algoritmos , Filogenia , Funciones de Verosimilitud , Modelos Genéticos , Biología Computacional/métodos , Programas InformáticosRESUMEN
MOTIVATION: Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS: Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION: The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.
Asunto(s)
Algoritmos , Aprendizaje Automático , Filogenia , Programas Informáticos , Alineación de Secuencia/métodos , Biología Computacional/métodos , Funciones de VerosimilitudRESUMEN
MOTIVATION: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution. RESULTS: We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.
Asunto(s)
Algoritmos , Programas Informáticos , Teorema de Bayes , Alineación de Secuencia , Mutación INDEL , Evolución MolecularRESUMEN
MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Inteligencia Artificial , Programas Informáticos , Funciones de Verosimilitud , FilogeniaRESUMEN
Changes in chromosome numbers, including polyploidy and dysploidy events, play a key role in eukaryote evolution as they could expediate reproductive isolation and have the potential to foster phenotypic diversification. Deciphering the pattern of chromosome-number change within a phylogeny currently relies on probabilistic evolutionary models. All currently available models assume time homogeneity, such that the transition rates are identical throughout the phylogeny. Here, we develop heterogeneous models of chromosome-number evolution that allow multiple transition regimes to operate in distinct parts of the phylogeny. The partition of the phylogeny to distinct transition regimes may be specified by the researcher or, alternatively, identified using a sequential testing approach. Once the number and locations of shifts in the transition pattern are determined, a second search phase identifies regimes with similar transition dynamics, which could indicate on convergent evolution. Using simulations, we study the performance of the developed model to detect shifts in patterns of chromosome-number evolution and demonstrate its applicability by analyzing the evolution of chromosome numbers within the Cyperaceae plant family. The developed model extends the capabilities of probabilistic models of chromosome-number evolution and should be particularly helpful for the analyses of large phylogenies that include multiple distinct subclades.
Asunto(s)
Cromosomas , Cyperaceae , Filogenia , Cyperaceae/genética , Poliploidía , Plantas/genética , Evolución MolecularRESUMEN
Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.
Asunto(s)
Evolución Molecular , Mutación INDEL , Teorema de Bayes , Modelos Estadísticos , FilogeniaRESUMEN
Detecting the signature of selection in coding sequences and associating it with shifts in phenotypic states can unveil genes underlying complex traits. Of the various signatures of selection exhibited at the molecular level, changes in the pattern of selection at protein-coding genes have been of main interest. To this end, phylogenetic branch-site codon models are routinely applied to detect changes in selective patterns along specific branches of the phylogeny. Many of these methods rely on a prespecified partition of the phylogeny to branch categories, thus treating the course of trait evolution as fully resolved and assuming that phenotypic transitions have occurred only at speciation events. Here, we present TraitRELAX, a new phylogenetic model that alleviates these strong assumptions by explicitly accounting for the uncertainty in the evolution of both trait and coding sequences. This joint statistical framework enables the detection of changes in selection intensity upon repeated trait transitions. We evaluated the performance of TraitRELAX using simulations and then applied it to two case studies. Using TraitRELAX, we found an intensification of selection in the primate SEMG2 gene in polygynandrous species compared to species of other mating forms, as well as changes in the intensity of purifying selection operating on sixteen bacterial genes upon transitioning from a free-living to an endosymbiotic lifestyle.[Evolutionary selection; intensification; $\gamma $-proteobacteria; genotype-phenotype; relaxation; SEMG2.].
Asunto(s)
Evolución Molecular , Fenotipo , Selección Genética , Animales , Codón , Modelos Genéticos , Filogenia , Primates/genéticaRESUMEN
Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.
Asunto(s)
Aprendizaje Automático , Modelos Genéticos , FilogeniaRESUMEN
Preventing and controlling epidemics caused by vector-borne viruses are particularly challenging due to their diverse pool of hosts and highly adaptive nature. Many vector-borne viruses belong to the Flavivirus genus, whose members vary greatly in host range and specificity. Members of the Flavivirus genus can be categorized to four main groups: insect-specific viruses that are maintained solely in arthropod populations, mosquito-borne viruses and tick-borne viruses that are transmitted to vertebrate hosts by mosquitoes or ticks via blood feeding, and those with no-known vector. The mosquito-borne group encompasses the yellow fever, dengue, and West Nile viruses, all of which are globally spread and cause severe morbidity in humans. The Flavivirus genus is genetically diverse, and its members are subject to different host-specific and vector-specific selective constraints, which do not always align. Thus, understanding the underlying genetic differences that led to the diversity in host range within this genus is an important aspect in deciphering the mechanisms that drive host compatibility and can aid in the constant arms-race against viral threats. Here, we review the phylogenetic relationships between members of the genus, their infection bottlenecks, and phenotypic and genomic differences. We further discuss methods that utilize these differences for prediction of host shifts in flaviviruses and can contribute to viral surveillance efforts.
Asunto(s)
Culicidae , Flavivirus , Animales , Culicidae/genética , Flavivirus/genética , Especificidad del Huésped , Humanos , Mosquitos Vectores/genética , FilogeniaRESUMEN
The role of plant-pollinator interactions in the rapid radiation of the angiosperms have long fascinated evolutionary biologists. Studies have brought evidence for pollinator-driven diversification of various plant lineages, particularly plants with specialized flowers and concealed rewards. By contrast, little is known about how this crucial interaction has shaped macroevolutionary patterns of floral visitors. In particular, there is currently no empirical evidence that floral host association has increased diversification in bees, the most prominent group of floral visitors that essentially rely on angiosperm pollen. In this study, we examine how floral host preference influenced diversification in eucerine bees (Apidae, Eucerini), which exhibit large variations in their floral associations. We combine quantitative pollen analyses with a recently proposed phylogenetic hypothesis, and use a state speciation and extinction probabilistic approach. Using this framework, we provide the first evidence that multiple evolutionary transitions from host plants with accessible pollen to restricted pollen from 'bee-flowers' have significantly increased the diversification of a bee clade. We suggest that exploiting host plants with restricted pollen has allowed the exploitation of a new ecological niche for eucerine bees and contributed both to their colonization of vast regions of the world and their rapid diversification.
Asunto(s)
Flores , Polinización , Animales , Abejas , Evolución Biológica , Filogenia , PolenRESUMEN
Chromosome number is a central feature of eukaryote genomes. Deciphering patterns of chromosome-number change along a phylogeny is central to the inference of whole genome duplications and ancestral chromosome numbers. ChromEvol is a probabilistic inference tool that allows the evaluation of several models of chromosome-number evolution and their fit to the data. However, fitting a model does not necessarily mean that the model describes the empirical data adequately. This vulnerability may lead to incorrect conclusions when model assumptions are not met by real data. Here, we present a model adequacy test for likelihood models of chromosome-number evolution. The procedure allows us to determine whether the model can generate data with similar characteristics as those found in the observed ones. We demonstrate that using inadequate models can lead to inflated errors in several inference tasks. Applying the developed method to 200 angiosperm genera, we find that in many of these, the best-fitting model provides poor fit to the data. The inadequacy rate increases in large clades or in those in which hybridizations are present. The developed model adequacy test can help researchers to identify phylogenies whose underlying evolutionary patterns deviate substantially from current modelling assumptions and should guide future methods development.
Asunto(s)
Evolución Molecular , Magnoliopsida , Cromosomas , Modelos Genéticos , Modelos Estadísticos , FilogeniaRESUMEN
If particular traits consistently affect rates of speciation and extinction, broad macroevolutionary patterns can be interpreted as consequences of selection at high levels of the biological hierarchy. Identifying traits associated with diversification rates is difficult because of the wide variety of characters under consideration and the statistical challenges of testing for associations from comparative phylogenetic data. Ploidy (diploid vs polyploid states) and breeding system (self-incompatible vs self-compatible states) are both thought to be drivers of differential diversification in angiosperms. We fit 29 diversification models to extensive trait and phylogenetic data in Solanaceae and investigate how speciation and extinction rate differences are associated with ploidy, breeding system, and the interaction between these traits. We show that diversification patterns in Solanaceae are better explained by breeding system and an additional unobserved factor, rather than by ploidy. We also find that the most common evolutionary pathway to polyploidy in Solanaceae occurs via direct breakdown of self-incompatibility by whole genome duplication, rather than indirectly via breakdown followed by polyploidization. Comparing multiple stochastic diversification models that include complex trait interactions alongside hidden states enhances our understanding of the macroevolutionary patterns in plant phylogenies.
Asunto(s)
Biodiversidad , Filogenia , Fitomejoramiento , Ploidias , Teorema de Bayes , Modelos Biológicos , Poliploidía , Carácter Cuantitativo HeredableRESUMEN
Understanding species adaptation at the molecular level has been a central goal of evolutionary biology and genomics research. This important task becomes increasingly relevant with the constant rise in both genotypic and phenotypic data availabilities. The TraitRateProp web server offers a unique perspective into this task by allowing the detection of associations between sequence evolution rate and whole-organism phenotypes. By analyzing sequences and phenotypes of extant species in the context of their phylogeny, it identifies sequence sites in a gene/protein whose evolutionary rate is associated with shifts in the phenotype. To this end, it considers alternative histories of whole-organism phenotypic changes, which result in the extant phenotypic states. Its joint likelihood framework that combines models of sequence and phenotype evolution allows testing whether an association between these processes exists. In addition to predicting sequence sites most likely to be associated with the phenotypic trait, the server can optionally integrate structural 3D information. This integration allows a visual detection of trait-associated sequence sites that are juxtapose in 3D space, thereby suggesting a common functional role. We used TraitRateProp to study the shifts in sequence evolution rate of the RPS8 protein upon transitions into heterotrophy in Orchidaceae. TraitRateProp is available at http://traitrate.tau.ac.il/prop.
Asunto(s)
Evolución Molecular , Análisis de Secuencia , Programas Informáticos , Algoritmos , Internet , Orchidaceae/genética , Fenotipo , Filogenia , Proteínas Ribosómicas/química , Proteínas Ribosómicas/genéticaRESUMEN
Recent years have seen a constant rise in the availability of trait data, including morphological features, ecological preferences, and life history characteristics. These phenotypic data provide means to associate genomic regions with phenotypic attributes, thus allowing the identification of phenotypic traits associated with the rate of genome and sequence evolution. However, inference methodologies that analyze sequence and phenotypic data in a unified statistical framework are still scarce. Here, we present TraitRateProp, a probabilistic method that allows testing whether the rate of sequence evolution is associated with a binary phenotypic character trait. The method further allows the detection of specific sequence sites whose evolutionary rate is most noticeably affected following the character transition, suggesting a shift in functional/structural constraints. TraitRateProp is first evaluated in simulations and then applied to study the evolutionary process of plastid plant genomes upon a transition to a heterotrophic lifestyle. To this end, we analyze 20 plastid genes across 85 orchid species, spanning different lifestyles and representing different genera in this large family of flowering plants. Our results indicate higher evolutionary rates following repeated transitions to a heterotrophic lifestyle in all but four of the loci analyzed. [Evolutionary models; evolutionary rate; genotype-phenotype; orchids; plastome; rate shift.].
Asunto(s)
Clasificación/métodos , Evolución Molecular , Genoma de Planta/genética , Modelos Genéticos , Simulación por Computador , Genoma de Plastidios/genética , FenotipoRESUMEN
The adaptation of the CRISPR-Cas9 system as a genome editing technique has generated much excitement in recent years owing to its ability to manipulate targeted genes and genomic regions that are complementary to a programmed single guide RNA (sgRNA). However, the efficacy of a specific sgRNA is not uniquely defined by exact sequence homology to the target site, thus unintended off-targets might additionally be cleaved. Current methods for sgRNA design are mainly concerned with predicting off-targets for a given sgRNA using basic sequence features and employ elementary rules for ranking possible sgRNAs. Here, we introduce CRISTA (CRISPR Target Assessment), a novel algorithm within the machine learning framework that determines the propensity of a genomic site to be cleaved by a given sgRNA. We show that the predictions made with CRISTA are more accurate than other available methodologies. We further demonstrate that the occurrence of bulges is not a rare phenomenon and should be accounted for in the prediction process. Beyond predicting cleavage efficiencies, the learning process provides inferences regarding patterns that underlie the mechanism of action of the CRISPR-Cas9 system. We discover that attributes that describe the spatial structure and rigidity of the entire genomic site as well as those surrounding the PAM region are a major component of the prediction capabilities.
Asunto(s)
Sistemas CRISPR-Cas/genética , Biología Computacional/métodos , Edición Génica/métodos , Aprendizaje Automático , Algoritmos , Humanos , ARN Guía de Kinetoplastida/genética , Curva ROCRESUMEN
The degree of evolutionary conservation of an amino acid in a protein or a nucleic acid in DNA/RNA reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and function of the macromolecule. The ConSurf web server (http://consurf.tau.ac.il), established over 15 years ago, analyses the evolutionary pattern of the amino/nucleic acids of the macromolecule to reveal regions that are important for structure and/or function. Starting from a query sequence or structure, the server automatically collects homologues, infers their multiple sequence alignment and reconstructs a phylogenetic tree that reflects their evolutionary relations. These data are then used, within a probabilistic framework, to estimate the evolutionary rates of each sequence position. Here we introduce several new features into ConSurf, including automatic selection of the best evolutionary model used to infer the rates, the ability to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree that enables interactively rerunning ConSurf with the taxa of a sub-tree.
Asunto(s)
Evolución Biológica , ADN/química , Modelos Estadísticos , Proteínas/química , ARN/química , Interfaz Usuario-Computador , Algoritmos , Secuencia de Aminoácidos , Animales , Secuencia de Bases , Gráficos por Computador , Secuencia Conservada , ADN/genética , Escherichia coli/genética , Escherichia coli/metabolismo , Humanos , Internet , Conformación de Ácido Nucleico , Filogenia , Plantas/genética , Plantas/metabolismo , Dominios Proteicos , Estructura Secundaria de Proteína , Proteínas/genética , ARN/genética , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Alineación de Secuencia , Homología de Secuencia de Aminoácido , Homología de Secuencia de Ácido NucleicoRESUMEN
Hybridization and genome doubling (allopolyploidy) have led to evolutionary novelties as well as to the origin of new clades and species. Despite the importance of allopolyploidization, the dynamics of postpolyploid diploidization (PPD) at the genome level has been only sparsely studied. The Microlepidieae (MICR) is a crucifer tribe of 17 genera and c. 56 species endemic to Australia and New Zealand. Our phylogenetic and cytogenomic analyses revealed that MICR originated via an intertribal hybridization between ancestors of Crucihimalayeae (n = 8; maternal genome) and Smelowskieae (n = 7; paternal genome), both native to the Northern Hemisphere. The reconstructed ancestral allopolyploid genome (n = 15) originated probably in northeastern Asia or western North America during the Late Miocene (c. 10.6-7 million years ago) and reached the Australian mainland via long-distance dispersal. In Australia, the allotetraploid genome diverged into at least three main subclades exhibiting different levels of PPD and diversity: 1.25-fold descending dysploidy (DD) of n = 15 â n = 12 (autopolyploidy â 24) in perennial Arabidella (3 species), 1.5-fold DD of n = 15 â n = 10 in the perennial Pachycladon (11 spp.) and 2.1-3.75-fold DD of n = 15 â n = 7-4 in the largely annual crown-group genera (42 spp. in 15 genera). These results are among the first to demonstrate multispeed genome evolution in taxa descending from a common allopolyploid ancestor. It is suggested that clade-specific PPD can operate at different rates and efficacies and can be tentatively linked to life histories and the extent of taxonomic diversity.