Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 116
Filtrar
1.
ISME J ; 18(1)2024 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-39001714

RESUMO

In recent years, phylogenetic reconciliation has emerged as a promising approach for studying microbial ecology and evolution. The core idea is to model how gene trees evolve along a species tree and to explain differences between them via evolutionary events including gene duplications, transfers, and losses. Here, we describe how phylogenetic reconciliation provides a natural framework for studying genome evolution and highlight recent applications including ancestral gene content inference, the rooting of species trees, and the insights into metabolic evolution and ecological transitions they yield. Reconciliation analyses have elucidated the evolution of diverse microbial lineages, from Chlamydiae to Asgard archaea, shedding light on ecological adaptation, host-microbe interactions, and symbiotic relationships. However, there are many opportunities for broader application of the approach in microbiology. Continuing improvements to make reconciliation models more realistic and scalable, and integration of ecological metadata such as habitat, pH, temperature, and oxygen use offer enormous potential for understanding the rich tapestry of microbial life.


Assuntos
Archaea , Filogenia , Archaea/genética , Archaea/classificação , Bactérias/genética , Bactérias/classificação , Evolução Molecular , Genoma Bacteriano , Simbiose , Ecologia
2.
Mol Phylogenet Evol ; 197: 108091, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38719080

RESUMO

Cryptic diversity poses a great obstacle in our attempts to assess the current biodiversity crisis and may hamper conservation efforts. The gekkonid genus Mediodactylus, a well-known case of hidden species and genetic diversity, has been taxonomically reclassified several times during the last decade. Focusing on the Mediterranean populations, a recent study within the M. kotschyi species complex using classic mtDNA/nuDNA markers suggested the existence of five distinct species, some being endemic and some possibly threatened, yet their relationships have not been fully resolved. Here, we generated genome-wide SNPs (using ddRADseq) and applied molecular species delimitation approaches and population genomic analyses to further disentangle these relationships. Τhe most extensive nuclear dataset, so far, encompassing 2,360 loci and âˆ¼ 699,000 bp from across the genome of Mediodactylus gecko, enabled us to resolve previously obscure phylogenetic relationships among the five, recently elevated, Mediodactylus species and to support the hypothesis that the taxon includes several new, undescribed species. Population genomic analyses within each of the proposed species showed strong genetic structure and high levels of genetic differentiation among populations.


Assuntos
Lagartos , Filogenia , Filogeografia , Animais , Região do Mediterrâneo , Lagartos/genética , Lagartos/classificação , Polimorfismo de Nucleotídeo Único , Variação Genética , Genética Populacional , DNA Mitocondrial/genética , Análise de Sequência de DNA
3.
Forensic Sci Int Genet ; 71: 103060, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38796876

RESUMO

In the Battle of Crete during the World War II occupation of Greece, the German forces faced substantial civilian resistance. To retribute the numerous German losses, a series of mass executions took place in numerous places in Crete; a common practice reported from Greece and elsewhere. In Adele, a village in the regional unit of Rethymnon, 18 male civilians were executed and buried in a burial pit at the Sarakina site. In this study, the first one conducted for a conflict that occurred in Greece, we identified for humanitarian purposes the 18 skulls of the Sarakina victims, following a request from the local community of Adele. The molecular identification of historical human remains via ancient DNA approaches and low coverage whole genome sequencing has only recently been introduced. Here, we performed genome skimming on the living relatives of the victims, as well as high throughput historical DNA analysis on the skulls to infer the kinship degrees among the victims via genetic relatedness analyses. We also conducted targeted anthropological analysis to successfully complete the identification of all Sarakina victims. We demonstrate that our methodological approach constitutes a potentially highly informative forensic tool to identify war victims. It can hence be applied to analogous studies on degraded DNA, thus, paving the path for systematic war victim identification in Greece and beyond.


Assuntos
Impressões Digitais de DNA , DNA Antigo , II Guerra Mundial , Humanos , DNA Antigo/análise , Masculino , Grécia , Crânio , Genoma Humano , Antropologia Forense , Sequenciamento Completo do Genoma
4.
Nature ; 629(8013): 851-860, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38560995

RESUMO

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.


Assuntos
Aves , Evolução Molecular , Genoma , Filogenia , Animais , Aves/genética , Aves/classificação , Aves/anatomia & histologia , Encéfalo/anatomia & histologia , Extinção Biológica , Genoma/genética , Genômica , Densidade Demográfica , Masculino , Feminino
5.
Bioinformatics ; 40(4)2024 03 29.
Artigo em Inglês | MEDLINE | ID: mdl-38514421

RESUMO

MOTIVATION: Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences, but also the processes that generate the gene trees themselves along a shared species tree. To conduct accurate inferences, one needs to account for uncertainty at both levels, that is, in gene trees estimated from inherently short sequences and in their diverse evolutionary histories along a shared species tree. RESULTS: We present AleRax, a software that can infer reconciled gene trees together with a shared species tree using a simple, yet powerful, probabilistic model of gene duplication, transfer, and loss. A key feature of AleRax is its ability to account for uncertainty in the gene tree and its reconciliation by using an efficient approximation to calculate the joint phylogenetic-reconciliation likelihood and sample reconciled gene trees accordingly. Simulations and analyses of empirical data show that AleRax is one order of magnitude faster than competing gene tree inference tools while attaining the same accuracy. It is consistently more robust than species tree inference methods such as SpeciesRax and ASTRAL-Pro 2 under gene tree uncertainty. Finally, AleRax can process multiple gene families in parallel thereby allowing users to compare competing phylogenetic hypotheses and estimate model parameters, such as duplication, transfer, and loss probabilities for genome-scale datasets with hundreds of taxa. AVAILABILITY AND IMPLEMENTATION: GNU GPL at https://github.com/BenoitMorel/AleRax and data are made available at https://cme.h-its.org/exelixis/material/alerax_data.tar.gz.


Assuntos
Algoritmos , Duplicação Gênica , Filogenia , Software , Modelos Estatísticos , Evolução Molecular
6.
Mol Biol Evol ; 41(1)2024 Jan 03.
Artigo em Inglês | MEDLINE | ID: mdl-38124381

RESUMO

MOTIVATION: Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. RESULTS: Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.


Assuntos
Redes Neurais de Computação , Proteínas , Filogenia , Alinhamento de Sequência , Proteínas/genética , DNA/genética , Software
7.
Mol Biol Evol ; 40(10)2023 10 04.
Artigo em Inglês | MEDLINE | ID: mdl-37804116

RESUMO

Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).


Assuntos
Algoritmos , Filogenia , Funções Verossimilhança , Alinhamento de Sequência
8.
Bioinform Adv ; 3(1): vbad124, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37750068

RESUMO

Summary: Maximum likelihood (ML) is a widely used phylogenetic inference method. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE, and FastTree on empirical datasets. We provide empirical evidence that we can substantially accelerate tree inferences with RAxML-NG and IQ-TREE by changing the default values of two such numerical thresholds. At the same time, altering these settings does not significantly impact the quality of the inferred trees. We further show that increasing both thresholds accelerates the RAxML-NG bootstrap without influencing the resulting support values. For RAxML-NG, increasing the likelihood thresholds ϵLnL and ϵbrlen to 10 and 103, respectively, results in an average tree inference speedup of 1.9 ± 0.6 on Data collection 1, 1.8 ± 1.1 on Data collection 2, and 1.9 ± 0.8 on Data collection 2 for the RAxML-NG bootstrap compared to the runtime under the current default setting. Increasing the likelihood threshold ϵLnL to 10 in IQ-TREE results in an average tree inference speedup of 1.3 ± 0.4 on Data collection 1 and 1.3 ± 0.9 on Data collection 2. Availability and implementation: All MSAs we used for our analyses, as well as all results, are available for download at https://cme.h-its.org/exelixis/material/freeLunch_data.tar.gz. Our data generation scripts are available at https://github.com/tschuelia/ml-numerical-analysis.

9.
Genome Biol Evol ; 15(7)2023 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-37463417

RESUMO

ALE and GeneRax are tools for probabilistic gene tree-species tree reconciliation. Based on a common underlying statistical model of how gene trees evolve along species trees, these methods rely on gene vs. species tree discordance to infer gene duplication, transfer, and loss events, map gene family origins, and root species trees. Published analyses have used these methods to root species trees of Archaea, Bacteria, and several eukaryotic groups, as well as to infer ancestral gene repertoires. However, it was recently suggested that reconciliation-based estimates of duplication and transfer events using the ALE/GeneRax model were unreliable, with potential implications for species tree rooting. Here, we assess these criticisms and find that the methods are accurate when applied to simulated data and in generally good agreement with alternative methodological approaches on empirical data. In particular, ALE recovers variation in gene duplication and transfer frequencies across lineages that is consistent with the known biology of studied clades. In plants and opisthokonts, ALE recovers the consensus species tree root; in Bacteria-where there is less certainty about the root position-ALE agrees with alternative approaches on the most likely root region. Overall, ALE and related approaches are promising tools for studying genome evolution.


Assuntos
Algoritmos , Evolução Molecular , Filogenia , Duplicação Gênica , Bactérias/genética , Eucariotos , Modelos Genéticos
10.
J Eukaryot Microbiol ; 70(5): e12990, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37448139

RESUMO

Taxonomic assignment of operational taxonomic units (OTUs) is an important bioinformatics step in analyzing environmental sequencing data. Pairwise alignment and phylogenetic-placement methods represent two alternative approaches to taxonomic assignments, but their results can differ. Here we used available colpodean ciliate OTUs from forest soils to compare the taxonomic assignments of VSEARCH (which performs pairwise alignments) and EPA-ng (which performs phylogenetic placements). We showed that when there are differences in taxonomic assignments between pairwise alignments and phylogenetic placements at the subtaxon level, there is a low pairwise similarity of the OTUs to the reference database. We then showcase how the output of EPA-ng can be further evaluated using GAPPA to assess the taxonomic assignments when there exist multiple equally likely placements of an OTU, by taking into account the sum over the likelihood weights of the OTU placements within a subtaxon, and the branch distances between equally likely placement locations. We also inferred the evolutionary and ecological characteristics of the colpodean OTUs using their placements within subtaxa. This study demonstrates how to fully analyze the output of EPA-ng, by using GAPPA in conjunction with knowledge of the taxonomic diversity of the clade of interest.


Assuntos
DNA Ambiental , Filogenia
11.
Stat Comput ; 33(4): 80, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37216155

RESUMO

The prediction of knockout tournaments represents an area of large public interest and active academic as well as industrial research. Here, we show how one can leverage the computational analogies between calculating the phylogenetic likelihood score used in the area of molecular evolution to efficiently calculate, instead of approximate via simulations, the exact per-team tournament win probabilities, given a pairwise win probability matrix between all teams. We implement and make available our method as open-source code and show that it is two orders of magnitude faster than simulations and two or more orders of magnitude faster than calculating the exact per-team win probabilities naïvely, without taking into account the substantial computational savings induced by the tournament tree structure. Furthermore, we showcase novel prediction approaches that now become feasible due to this order of magnitude improvement in calculating tournament win probabilities. We demonstrate how to quantify prediction uncertainty by calculating 100,000 distinct tournament win probabilities for a tournament with 16 teams under slight variations of a reasonable pairwise win probability matrix within one minute on a standard laptop. We also conduct an analogous analysis for a tournament with 64 teams. Supplementary Information: The online version contains supplementary material available at 10.1007/s11222-023-10246-y.

12.
Syst Biol ; 72(1): 242-248, 2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-36705582

RESUMO

Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. Therefore, the kinds of biogeographical analyses that can be conducted under the DEC model are limited by the number of regions under consideration. In this work, we present a completely redesigned efficient version of the popular tool Lagrange which is up to 49 times faster with multithreading enabled, and is also 26 times faster when using only one thread. We call this new version Lagrange-NG (Lagrange-Next Generation). The increased computational efficiency allows Lagrange-NG to analyze datasets with a large number of regions in a reasonable amount of time, up to 12 regions in approximately 18 min. We achieve these speedups using a relatively new method of computing the matrix exponential based on Krylov subspaces. In order to validate the correctness of Lagrange-NG, we also introduce a novel metric on range distributions for trees so that researchers can assess the difference between any two range inferences. Finally, Lagrange-NG exhibits substantially higher adherence to coding quality standards. It improves a respective software quality indicator as implemented in the SoftWipe tool from average (5.5; Lagrange) to high (7.8; Lagrange-NG). Lagrange-NG is freely available under GPL2. [Biogeography; Phylogenetics; DEC Model.].


Assuntos
Software , Filogenia
13.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36576010

RESUMO

MOTIVATION: Missing data and incomplete lineage sorting (ILS) are two major obstacles to accurate species tree inference. Gene tree summary methods such as ASTRAL and ASTRID have been developed to account for ILS. However, they can be severely affected by high levels of missing data. RESULTS: We present Asteroid, a novel algorithm that infers an unrooted species tree from a set of unrooted gene trees. We show on both empirical and simulated datasets that Asteroid is substantially more accurate than ASTRAL and ASTRID for very high proportions (>80%) of missing data. Asteroid is several orders of magnitude faster than ASTRAL for datasets that contain thousands of genes. It offers advanced features such as parallelization, support value computation and support for multi-copy and multifurcating gene trees. AVAILABILITY AND IMPLEMENTATION: Asteroid is freely available at https://github.com/BenoitMorel/Asteroid. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Especiação Genética , Genômica , Filogenia , Simulação por Computador , Algoritmos , Modelos Genéticos
14.
Curr Biol ; 33(1): 41-57.e15, 2023 01 09.
Artigo em Inglês | MEDLINE | ID: mdl-36493775

RESUMO

We present a spatiotemporal picture of human genetic diversity in Anatolia, Iran, Levant, South Caucasus, and the Aegean, a broad region that experienced the earliest Neolithic transition and the emergence of complex hierarchical societies. Combining 35 new ancient shotgun genomes with 382 ancient and 23 present-day published genomes, we found that genetic diversity within each region steadily increased through the Holocene. We further observed that the inferred sources of gene flow shifted in time. In the first half of the Holocene, Southwest Asian and the East Mediterranean populations homogenized among themselves. Starting with the Bronze Age, however, regional populations diverged from each other, most likely driven by gene flow from external sources, which we term "the expanding mobility model." Interestingly, this increase in inter-regional divergence can be captured by outgroup-f3-based genetic distances, but not by the commonly used FST statistic, due to the sensitivity of FST, but not outgroup-f3, to within-population diversity. Finally, we report a temporal trend of increasing male bias in admixture events through the Holocene.


Assuntos
Genoma Humano , Grupos Raciais , Humanos , Masculino , História Antiga , Irã (Geográfico) , Fluxo Gênico , Migração Humana , Genética Populacional
15.
Mol Biol Evol ; 39(12)2022 12 05.
Artigo em Inglês | MEDLINE | ID: mdl-36395091

RESUMO

Phylogenetic analyzes under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.


Assuntos
Modelos Genéticos , Filogenia , Funções Verossimilhança , Algoritmo Florestas Aleatórias
16.
Front Bioinform ; 2: 871393, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36304302

RESUMO

Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and taking evolutionary history into account. Thereby, one can increase the accuracy of metagenomic surveys and eliminate the requirement for having exact or close matches with existing sequence databases. Phylogenetic placement constitutes a valuable analysis tool per se, but also entails a plethora of downstream tools to interpret its results. A common use case is to analyze species communities obtained from metagenomic sequencing, for example via taxonomic assignment, diversity quantification, sample comparison, and identification of correlations with environmental variables. In this review, we provide an overview over the methods developed during the first 10 years. In particular, the goals of this review are 1) to motivate the usage of phylogenetic placement and illustrate some of its use cases, 2) to outline the full workflow, from raw sequences to publishable figures, including best practices, 3) to introduce the most common tools and methods and their capabilities, 4) to point out common placement pitfalls and misconceptions, 5) to showcase typical placement-based analyses, and how they can help to analyze, visualize, and interpret phylogenetic placement data.

17.
Bioinformatics ; 38(15): 3725-3733, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35713506

RESUMO

MOTIVATION: Phylogenetic networks can represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity and current tools can only analyze small datasets. RESULTS: We present NetRAX, a tool for maximum likelihood (ML) inference of phylogenetic networks in the absence of ILS. Our tool leverages state-of-the-art methods for efficiently computing the phylogenetic likelihood function on trees, and extends them to phylogenetic networks via the notion of 'displayed trees'. NetRAX can infer ML phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format. On simulated data, our results show a very low relative difference in Bayesian Information Criterion (BIC) score and a near-zero unrooted softwired cluster distance to the true, simulated networks. With NetRAX, a network inference on a partitioned alignment with 8000 sites, 30 taxa and 3 reticulations completes within a few minutes on a standard laptop. AVAILABILITY AND IMPLEMENTATION: Our implementation is available under the GNU General Public License v3.0 at https://github.com/lutteropp/NetRAX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Filogenia , Teorema de Bayes , Alinhamento de Sequência , Funções Verossimilhança
18.
Bioinformatics ; 38(Suppl 1): i118-i124, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758778

RESUMO

MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Software , Funções Verossimilhança , Filogenia
19.
Genome Biol ; 23(1): 37, 2022 01 26.
Artigo em Inglês | MEDLINE | ID: mdl-35081992

RESUMO

We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference package that provides statistical confidence measurements and scales well on large datasets with hundreds or thousands of cells. Comprehensive simulations suggest that CellPhy is more robust to single-cell genomics errors and outperforms state-of-the-art methods under realistic scenarios, both in accuracy and speed. CellPhy is freely available at https://github.com/amkozlov/cellphy .


Assuntos
Algoritmos , Software , Genômica/métodos , Genótipo , Filogenia
20.
Mol Biol Evol ; 39(2)2022 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-35021210

RESUMO

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.


Assuntos
Algoritmos , Duplicação Gênica , Modelos Genéticos , Linhagem , Filogenia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA