RESUMO
Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses1-4. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral evolution5. Here, we use a new phylogenomic method to search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. In a 1.6 million sample tree from May 2021, we identify 589 recombination events, which indicate that around 2.7% of sequenced SARS-CoV-2 genomes have detectable recombinant ancestry. Recombination breakpoints are inferred to occur disproportionately in the 3' portion of the genome that contains the spike protein. Our results highlight the need for timely analyses of recombination for pinpointing the emergence of recombinant lineages with the potential to increase transmissibility or virulence of the virus. We anticipate that this approach will empower comprehensive real-time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.
Assuntos
COVID-19 , Genoma Viral , Pandemias , Filogenia , Recombinação Genética , SARS-CoV-2 , COVID-19/epidemiologia , COVID-19/transmissão , COVID-19/virologia , Genoma Viral/genética , Humanos , Mutação , Recombinação Genética/genética , SARS-CoV-2/genética , SARS-CoV-2/patogenicidade , Seleção Genética/genética , Glicoproteína da Espícula de Coronavírus/genética , Virulência/genéticaRESUMO
As phylogenomic datasets have grown in size, researchers have developed new ways to measure biological variation and to assess statistical support for specific branches. Larger datasets have more sites and loci, and therefore less sampling variance. While we can more accurately measure the mean signal in these datasets, lower sampling variance is often reflected in uniformly high measures of branch support-such as the bootstrap and posterior probability-limiting their utility. Larger datasets have also revealed substantial biological variation in the topologies found across individual loci, such that the single species tree inferred by most phylogenetic methods represents a limited summary of the data for many purposes. In contrast to measures of statistical support, the degree of underlying topological variation among loci should be approximately constant regardless of the size of the dataset. "Concordance factors" and similar statistics have therefore become increasingly important tools in phylogenetics. In this review, we explain why concordance factors should be thought of as descriptors of topological variation rather than as measures of statistical support, and argue that they provide important information about the predictive power of the species tree not contained in measures of support. We review a growing suite of statistics for measuring concordance, compare them in a common framework that reveals their interrelationships, and demonstrate how to calculate them using an example from birds. We also discuss how measures of topological variation might change in the future as we move beyond estimating a single "tree of life" towards estimating the myriad evolutionary histories underlying genomic variation.
RESUMO
Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g. the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.
Assuntos
Modelos Genéticos , Filogenia , Archaea/genética , Funções Verossimilhança , Substituição de Aminoácidos , Evolução Molecular , Eucariotos/genéticaRESUMO
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
Assuntos
Classificação , Filogenia , Classificação/métodos , Modelos Genéticos , Simulação por Computador , Software , AnimaisRESUMO
MOTIVATION: Site concordance factors (sCFs) have become a widely used way to summarize discordance in phylogenomic datasets. However, the original version of sCFs was calculated by sampling a quartet of tip taxa and then applying parsimony-based criteria for discordance. This approach has the potential to be strongly affected by multiple hits at a site (homoplasy), especially when substitution rates are high or taxa are not closely related. RESULTS: Here, we introduce a new method for calculating sCFs. The updated version uses likelihood to generate probability distributions of ancestral states at internal nodes of the phylogeny. By sampling from the states at internal nodes adjacent to a given branch, this approach substantially reduces-but does not abolish-the effects of homoplasy and taxon sampling. AVAILABILITY AND IMPLEMENTATION: Updated sCFs are implemented in IQ-TREE 2.2.2. The software is freely available at https://github.com/iqtree/iqtree2/releases. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.
Assuntos
Software , Filogenia , ProbabilidadeRESUMO
MOTIVATION: Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10â000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which Neighbour-Joining is a useful approach, new implementations of existing methods are warranted. RESULTS: Here, we present DecentTree, which provides highly optimized and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g. it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For example, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g. RapidNJ) when generating a tree of 64â000 SARS-CoV-2 genomes. AVAILABILITY AND IMPLEMENTATION: DecentTree is open source and freely available at https://github.com/iqtree/decenttree. All code and data used in this analysis are available on Github (https://github.com/asdcid/Comparison-of-neighbour-joining-software).
Assuntos
COVID-19 , Humanos , Filogenia , SARS-CoV-2/genética , Genômica , Biblioteca GênicaRESUMO
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Filogenia , Probabilidade , GenômicaRESUMO
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programmes exist, but the most feature-rich programmes tend to be rather slow, and the fastest programmes tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly used rate matrix and probability matrix approaches. AliSim takes 1.4â h and 1.3â GB RAM to simulate alignments with one million sequences or sites, whereas popular software Seq-Gen, Dawg, and INDELible require 2-5â h and 50-500â GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.
Assuntos
Evolução Molecular , Modelos Genéticos , Genômica , Filogenia , SoftwareRESUMO
MOTIVATION: Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the coronavirus disease 2019 (COVID-19) pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic. RESULTS: Here, we present matOptimize, a fast and memory-efficient phylogenetic tree optimization tool based on parsimony that can be parallelized across multiple CPU threads and nodes, and provides orders of magnitude improvement in runtime and peak memory usage compared to existing state-of-the-art methods. We have developed this method particularly to address the pressing need during the COVID-19 pandemic for daily maintenance and optimization of a comprehensive SARS-CoV-2 phylogeny. matOptimize is currently helping refine on a daily basis possibly the largest-ever phylogenetic tree, containing millions of SARS-CoV-2 sequences. AVAILABILITY AND IMPLEMENTATION: The matOptimize code is freely available as part of the UShER package (https://github.com/yatisht/usher) and can also be installed via bioconda (https://bioconda.github.io/recipes/usher/README.html). All scripts we used to perform the experiments in this manuscript are available at https://github.com/yceh/matOptimize-experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Filogenia , SARS-CoV-2/genética , Pandemias , SoftwareRESUMO
Using time-reversible Markov models is a very common practice in phylogenetic analysis, because although we expect many of their assumptions to be violated by empirical data, they provide high computational efficiency. However, these models lack the ability to infer the root placement of the estimated phylogeny. In order to compensate for the inability of these models to root the tree, many researchers use external information such as using outgroup taxa or additional assumptions such as molecular clocks. In this study, we investigate the utility of nonreversible models to root empirical phylogenies and introduce a new bootstrap measure, the rootstrap, which provides information on the statistical support for any given root position. [Bootstrap; nonreversible models; phylogenetic inference; root estimation.].
Assuntos
Mamíferos , Modelos Genéticos , Animais , FilogeniaRESUMO
Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All commonly used amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this article, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time nonreversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the nonreversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of data sets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the data set. Notably, for the recently published plant and bird trees, these nonreversible models correctly recovered the commonly estimated root placements with very high-statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate nonreversible models and rooted phylogenies from their own protein data sets. The data sets and scripts used in this article are available at https://doi.org/10.5061/dryad.3tx95x6hx. [amino acid sequence analyses; amino acid substitution models; maximum likelihood model estimation; nonreversible models; phylogenetic inference; reversible models.].
Assuntos
Modelos Genéticos , Software , Substituição de Aminoácidos , Animais , Evolução Molecular , Funções Verossimilhança , Mamíferos , Filogenia , ProteínasRESUMO
Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here, we present new reference genome assemblies for 3 Old World monkey (OWM) species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and individual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.
Assuntos
Introgressão Genética/genética , Primatas/genética , Animais , Evolução Biológica , Cercopithecidae/genética , Biologia Computacional/métodos , Bases de Dados Genéticas , Fósseis , Fluxo Gênico/genética , Genoma/genética , Modelos Genéticos , Filogenia , Análise de Sequência de DNA/métodosRESUMO
The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.
Assuntos
Genoma Viral/genética , Filogenia , SARS-CoV-2/genética , Algoritmos , COVID-19 , Biologia Computacional , Evolução Molecular , Humanos , RNA Viral/genética , Alinhamento de Sequência , Sequenciamento Completo do GenomaRESUMO
Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible $Q$ matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.].
Assuntos
Evolução Molecular , Modelos Genéticos , Animais , Funções Verossimilhança , Filogenia , Proteínas/genética , Alinhamento de SequênciaRESUMO
We implement two measures for quantifying genealogical concordance in phylogenomic data sets: the gene concordance factor (gCF) and the novel site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of "decisive" gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. An easy to use implementation and tutorial is freely available in the IQ-TREE software package (http://www.iqtree.org/doc/Concordance-Factor, last accessed May 13, 2020).
Assuntos
Conjuntos de Dados como Assunto , Técnicas Genéticas , Filogenia , SoftwareRESUMO
Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. In phylogenetics, the potential impacts of partitioning sequence data for the assignment of substitution models are well appreciated. In contrast, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci or subsets of loci. By analyzing a range of empirical data sets, we find consistent support for a model in which branch lengths are proportionate between subsets of loci: gene trees share the same pattern of branch lengths, but form subsets that vary in their overall tree lengths. These models had substantially better statistical support than models that assume identical branch lengths across gene trees, or those in which genes form subsets with distinct branch-length patterns. We show using simulations and empirical data that the complexity of the branch-length model with the highest support depends on the length of the sequence alignment and on the numbers of taxa and loci in the data set. Our findings suggest that models in which branch lengths are proportionate between subsets have the highest statistical support under the conditions that are most commonly seen in practice. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.
Assuntos
Modelos Genéticos , Filogenia , Simulação por ComputadorRESUMO
IQ-TREE (http://www.iqtree.org, last accessed February 6, 2020) is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood. Since the release of version 1 in 2014, we have continuously expanded IQ-TREE to integrate a plethora of new models of sequence evolution and efficient computational approaches of phylogenetic inference to deal with genomic data. Here, we describe notable features of IQ-TREE version 2 and highlight the key advantages over other software.
Assuntos
Evolução Molecular , Genômica , Modelos Genéticos , Filogenia , SoftwareRESUMO
For the last 100 years, it has been uncontroversial to state that the plant germline is set aside late in development, but there is surprisingly little evidence to support this view. In contrast, much evolutionary theory and several recent empirical studies seem to suggest the opposite-that the germlines of some and perhaps most plants may be set aside early in development. But is this really the case? How much does it matter? How can we reconcile the new evidence with existing knowledge of plant development? And is there a way to reliably establish the timing of germline segregation in both model and nonmodel plants? Answering these questions is vital to understanding one of the most fundamental aspects of plant development and evolution.
Assuntos
Células Germinativas Vegetais , Plantas , Diferenciação Celular , Linhagem da Célula , Desenvolvimento VegetalRESUMO
Somatic mutations can have important effects on the life history, ecology, and evolution of plants, but the rate at which they accumulate is poorly understood and difficult to measure directly. Here, we develop a method to measure somatic mutations in individual plants and use it to estimate the somatic mutation rate in a large, long-lived, phenotypically mosaic Eucalyptus melliodora tree. Despite being 100 times larger than Arabidopsis, this tree has a per-generation mutation rate only ten times greater, which suggests that this species may have evolved mechanisms to reduce the mutation rate per unit of growth. This adds to a growing body of evidence that illuminates the correlated evolutionary shifts in mutation rate and life history in plants.
Assuntos
Arabidopsis/fisiologia , Taxa de Mutação , Filogenia , Fenômenos Fisiológicos VegetaisRESUMO
Ultraconserved (UCEs) are popular markers for phylogenomic studies. They are relatively simple to collect from distantly-related organisms, and contain sufficient information to infer relationships at almost all taxonomic levels. Most studies of UCEs use partitioning to account for variation in rates and patterns of molecular evolution among sites, for example by estimating an independent model of molecular evolution for each UCE. However, rates and patterns of molecular evolution vary substantially within as well as between UCEs, suggesting that there may be opportunities to improve how UCEs are partitioned for phylogenetic inference. We propose and evaluate new partitioning methods for phylogenomic studies of UCEs: Sliding-Window Site Characteristics (SWSC), and UCE Site Position (UCESP). The first method uses site characteristics such as entropy, multinomial likelihood, and GC content to generate partitions that account for heterogeneity in rates and patterns of molecular evolution within each UCE. The second method groups together nucleotides that are found in similar physical locations within the UCEs. We examined the new methods with seven published data sets from a variety of taxa. We demonstrate the UCESP method generates partitions that are worse than other strategies used to partition UCE data sets (e.g., one partition per UCE). The SWSC method, particularly when based on site entropies, generates partitions that account for within-UCE heterogeneity and leads to large increases in the model fit. All of the methods, code, and data used in this study, are available from https://github.com/Tagliacollo/PartitionUCE. Simplified code for implementing the best method, the SWSC-EN, is available from https://github.com/Tagliacollo/PFinderUCE-SWSC-EN.