Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 89
Filtrar
1.
Proc Natl Acad Sci U S A ; 121(15): e2319506121, 2024 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-38557186

RESUMO

Genomes are typically mosaics of regions with different evolutionary histories. When speciation events are closely spaced in time, recombination makes the regions sharing the same history small, and the evolutionary history changes rapidly as we move along the genome. When examining rapid radiations such as the early diversification of Neoaves 66 Mya, typically no consistent history is observed across segments exceeding kilobases of the genome. Here, we report an exception. We found that a 21-Mb region in avian genomes, mapped to chicken chromosome 4, shows an extremely strong and discordance-free signal for a history different from that of the inferred species tree. Such a strong discordance-free signal, indicative of suppressed recombination across many millions of base pairs, is not observed elsewhere in the genome for any deep avian relationships. Although long regions with suppressed recombination have been documented in recently diverged species, our results pertain to relationships dating circa 65 Mya. We provide evidence that this strong signal may be due to an ancient rearrangement that blocked recombination and remained polymorphic for several million years prior to fixation. We show that the presence of this region has misled previous phylogenomic efforts with lower taxon sampling, showing the interplay between taxon and locus sampling. We predict that similar ancient rearrangements may confound phylogenetic analyses in other clades, pointing to a need for new analytical models that incorporate the possibility of such events.


Assuntos
Evolução Biológica , Genoma , Animais , Filogenia , Genoma/genética , Aves , Recombinação Genética
2.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38492564

RESUMO

MOTIVATION: Taxonomic classification of short reads and taxonomic profiling of metagenomic samples are well-studied yet challenging problems. The presence of species belonging to groups without close representation in a reference dataset is particularly challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they tend to have reduced accuracy for such novel species. Thus, there is a growing need for methods that combine the scalability of k-mers with increased sensitivity. RESULTS: Here, we show that using locality-sensitive hashing (LSH) can increase the sensitivity of the k-mer-based search. Our method, which combines LSH with several heuristics techniques including soft lowest common ancestor labeling and voting, is more accurate than alternatives in both taxonomic classification of individual reads and abundance profiling. AVAILABILITY AND IMPLEMENTATION: CONSULT-II is implemented in C++, and the software, together with reference libraries, is publicly available on GitHub https://github.com/bo1929/CONSULT-II.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos , Metagenômica/métodos , Metagenoma
3.
Bioinformatics ; 40(6)2024 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-38870525

RESUMO

MOTIVATION: Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, the training phase of DEPP does not scale to more than roughly 10 000 backbone species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331 270 species. RESULTS: This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP. While divide-and-conquer has been extensively used in phylogenetics, applying divide-and-conquer to data-hungry machine-learning methods needs nuance. C-DEPP uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing 20 million 16S fragments on the GG2 reference tree in 41 h of computation. AVAILABILITY AND IMPLEMENTATION: The dataset and C-DEPP software are freely available at https://github.com/yueyujiang/dataset_cdepp/.


Assuntos
Filogenia , Algoritmos , RNA Ribossômico 16S/genética , Software , Biologia Computacional/métodos , Aprendizado de Máquina , Análise de Sequência de DNA/métodos
4.
Syst Biol ; 2024 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-38970346

RESUMO

Dating phylogenetic trees to obtain branch lengths in time unit is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation- Maximization (EM) algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.

5.
Bioinformatics ; 39(39 Suppl 1): i185-i193, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387151

RESUMO

MOTIVATION: Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. RESULTS: In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. AVAILABILITY AND IMPLEMENTATION: CASTLES is available at https://github.com/ytabatabaee/CASTLES.


Assuntos
Evolução Biológica , Neoplasias Epiteliais e Glandulares , Humanos , Filogenia , Movimento Celular , Genômica
6.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36637196

RESUMO

MOTIVATION: The phylogenetic signal of structural variation informs a more comprehensive understanding of evolution. As (near-)complete genome assembly becomes more commonplace, the next methodological challenge for inferring genome rearrangement trees is the identification of syntenic blocks of orthologous sequences. In this article, we studied 94 reference quality genomes of primarily Mycobacterium tuberculosis (Mtb) isolates as a benchmark to evaluate these methods. The clonal nature of Mtb evolution, the manageable genome sizes, along with substantial levels of structural variation make this an ideal benchmarking dataset. RESULTS: We tested several methods for detecting homology and obtaining syntenic blocks and two methods for inferring phylogenies from them, then compared the resulting trees to the standard method's tree, inferred from nucleotide substitutions. We found that, not only the choice of methods, but also their parameters can impact results, and that the tree inference method had less impact than the block determination method. Interestingly, a rearrangement tree based on blocks from the Cactus whole-genome aligner was fully compatible with the highly supported branches of the substitution-based tree, enabling the combination of the two into a high-resolution supertree. Overall, our results indicate that accurate trees can be inferred using genome rearrangements, but the choice of the methods for inferring homology requires care. AVAILABILITY AND IMPLEMENTATION: Analysis scripts and code written for this study are available at https://gitlab.com/LPCDRP/rearrangement-homology.pub and https://gitlab.com/LPCDRP/syntement. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Mycobacterium tuberculosis , Filogenia , Mycobacterium tuberculosis/genética , Genoma , Sintenia
7.
Syst Biol ; 72(1): 17-34, 2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-35485976

RESUMO

Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without prespecified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multilocus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data. [Deep learning; gene tree discordance; metagenomics; microbiome analyses; neural networks; phylogenetic placement.].


Assuntos
Aprendizado Profundo , Microbiota , Filogenia , RNA Ribossômico 16S/genética , Algoritmos , Microbiota/genética
8.
Mol Biol Evol ; 39(12)2022 12 05.
Artigo em Inglês | MEDLINE | ID: mdl-36201617

RESUMO

Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.


Assuntos
Algoritmos , Modelos Genéticos , Incerteza , Simulação por Computador , Filogenia
9.
Bioinformatics ; 38(Suppl 1): i413-i421, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758818

RESUMO

MOTIVATION: Phylogenomics faces a dilemma: on the one hand, most accurate species and gene tree estimation methods are those that co-estimate them; on the other hand, these co-estimation methods do not scale to moderately large numbers of species. The summary-based methods, which first infer gene trees independently and then combine them, are much more scalable but are prone to gene tree estimation error, which is inevitable when inferring trees from limited-length data. Gene tree estimation error is not just random noise and can create biases such as long-branch attraction. RESULTS: We introduce a scalable likelihood-based approach to co-estimation under the multi-species coalescent model. The method, called quartet co-estimation (QuCo), takes as input independently inferred distributions over gene trees and computes the most likely species tree topology and internal branch length for each quartet, marginalizing over gene tree topologies and ignoring branch lengths by making several simplifying assumptions. It then updates the gene tree posterior probabilities based on the species tree. The focus on gene tree topologies and the heuristic division to quartets enables fast likelihood calculations. We benchmark our method with extensive simulations for quartet trees in zones known to produce biased species trees and further with larger trees. We also run QuCo on a biological dataset of bees. Our results show better accuracy than the summary-based approach ASTRAL run on estimated gene trees. AVAILABILITY AND IMPLEMENTATION: QuCo is available on https://github.com/maryamrabiee/quco. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Especiação Genética , Modelos Genéticos , Animais , Funções Verossimilhança , Filogenia
10.
Bioinformatics ; 38(6): 1532-1541, 2022 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-34978565

RESUMO

MOTIVATION: As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitations of available data are being appreciated more than ever before. One issue is that phylogenomic datasets are riddled with missing data, and gene trees, in particular, almost always lack representatives from some species otherwise available in the dataset. Since many downstream applications of gene trees require or can benefit from access to complete gene trees, it will be beneficial to algorithmically complete gene trees. Also, gene trees are often unrooted, and rooting them is useful for downstream applications. While completing and rooting a gene tree with respect to a given species tree has been studied, those problems are not studied in depth when we lack such a reference species tree. RESULTS: We study completion of gene trees without a need for a reference species tree. We formulate an optimization problem to complete the gene trees while minimizing their quartet distance to the given set of gene trees. We extend a seminal algorithm by Brodal et al. to solve this problem in quasi-linear time. In simulated studies and on a large empirical data, we show that completion of gene trees using other gene trees is relatively accurate and, unlike the case where a species tree is available, is unbiased. AVAILABILITY AND IMPLEMENTATION: Our method, tripVote, is available at https://github.com/uym2/tripVote. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Modelos Genéticos , Filogenia
11.
Bioinformatics ; 38(21): 4949-4950, 2022 10 31.
Artigo em Inglês | MEDLINE | ID: mdl-36094339

RESUMO

MOTIVATION: Species tree inference from multi-copy gene trees has long been a challenge in phylogenomics. The recent method ASTRAL-Pro has made strides by enabling multi-copy gene family trees as input and has been quickly adopted. Yet, its scalability, especially memory usage, needs to improve to accommodate the ever-growing dataset size. RESULTS: We present ASTRAL-Pro 2, an ultrafast and memory efficient version of ASTRAL-Pro that adopts a placement-based optimization algorithm for significantly better scalability without sacrificing accuracy. AVAILABILITY AND IMPLEMENTATION: The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER; data are available at https://github.com/chaoszhang/A-Pro2_data. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Linhagem , Simulação por Computador , Filogenia
12.
Nature ; 551(7681): 457-463, 2017 11 23.
Artigo em Inglês | MEDLINE | ID: mdl-29088705

RESUMO

Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth's microbial diversity.


Assuntos
Biodiversidade , Planeta Terra , Microbiota/genética , Animais , Archaea/genética , Archaea/isolamento & purificação , Bactérias/genética , Bactérias/isolamento & purificação , Ecologia/métodos , Dosagem de Genes , Mapeamento Geográfico , Humanos , Plantas/microbiologia , RNA Ribossômico 16S/análise , RNA Ribossômico 16S/genética
13.
Mol Biol Evol ; 38(3): 1151-1167, 2021 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-32886788

RESUMO

Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected number of substitutions and therefore, do not have divergence times estimated. These trees give an incomplete view of evolutionary histories since many applications of phylogenies require time trees. Many methods have been developed to convert the inferred branch lengths from substitution unit to time unit using calibration points, but none is universally accepted as they are challenged in both scalability and accuracy under complex models. Here, we introduce a new method that formulates dating as a nonconvex optimization problem where the variance of log-transformed rate multipliers is minimized across the tree. On simulated and real data, we show that our method, wLogDate, is often more accurate than alternatives and is more robust to various model assumptions.


Assuntos
Técnicas Genéticas , Filogenia , Algoritmos , Ebolavirus , Vírus da Influenza A Subtipo H1N1 , Taxa de Mutação
14.
Bioinformatics ; 36(24): 5623-5631, 2021 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-33555318

RESUMO

MOTIVATION: Species delimitation, the process of deciding how to group a set of organisms into units called species, is one of the most challenging problems in computational evolutionary biology. While many methods exist for species delimitation, most based on the coalescent theory, few are scalable to very large datasets, and methods that scale tend to be not accurate. Species delimitation is closely related to species tree inference from discordant gene trees, a problem that has enjoyed rapid advances in recent years. RESULTS: In this article, we build on the accuracy and scalability of recent quartet-based methods for species tree estimation and propose a new method called SODA for species delimitation. SODA relies heavily on a recently developed method for testing zero branch length in species trees. In extensive simulations, we show that SODA can easily scale to very large datasets while maintaining high accuracy. AVAILABILITY AND IMPLEMENTATION: The code and data presented here are available on https://github.com/maryamrabiee/SODA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

15.
PLoS Comput Biol ; 17(11): e1009449, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34780468

RESUMO

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.


Assuntos
Algoritmos , Genoma , Genômica/estatística & dados numéricos , Sequências Repetitivas de Ácido Nucleico , Software , Animais , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Invertebrados/classificação , Invertebrados/genética , Análise dos Mínimos Quadrados , Modelos Lineares , Mamíferos/classificação , Mamíferos/genética , Modelos Genéticos , Filogenia , Plantas/classificação , Plantas/genética , Vertebrados/classificação , Vertebrados/genética
16.
Mol Biol Evol ; 37(11): 3292-3307, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32886770

RESUMO

Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.


Assuntos
Técnicas Genéticas , Filogenia , Algoritmos , Plantas/genética , Leveduras/genética
17.
Nat Methods ; 15(4): 279-282, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29457793

RESUMO

Most approaches that capture signatures of selective sweeps in population genomics data do not identify the specific mutation favored by selection. We present iSAFE (for "integrated selection of allele favored by evolution"), a method that enables researchers to accurately pinpoint the favored mutation in a large region (∼5 Mbp) by using a statistic derived solely from population genetics signals. iSAFE does not require knowledge of demography, the phenotype under selection, or functional annotations of mutations.


Assuntos
Genômica , Técnicas de Amplificação de Ácido Nucleico , Hibridização de Ácido Nucleico/métodos , Alelos , Evolução Biológica , Haplótipos , Humanos , Mutação
18.
Bioinformatics ; 36(Suppl_1): i335-i343, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657414

RESUMO

MOTIVATION: Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. RESULTS: We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. AVAILABILITY AND IMPLEMENTATION: The software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Evolução Biológica , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Filogenia , Análise de Sequência de DNA
19.
Syst Biol ; 69(2): 384-391, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31290974

RESUMO

Phylogenomic analyses have increasingly adopted species tree reconstruction using methods that account for gene tree discordance using pipelines that require both human effort and computational resources. As the number of available genomes continues to increase, a new problem is facing researchers. Once more species become available, they have to repeat the whole process from the beginning because updating species trees is currently not possible. However, the de novo inference can be prohibitively costly in human effort or machine time. In this article, we introduce INSTRAL, a method that extends ASTRAL to enable phylogenetic placement. INSTRAL is designed to place a new species on an existing species tree after sequences from the new species have already been added to gene trees; thus, INSTRAL is complementary to existing placement methods that update gene trees. [ASTRAL; ILS; phylogenetic placement; species tree reconstruction.].


Assuntos
Classificação/métodos , Filogenia , Análise de Sequência de DNA/métodos , Software
20.
Syst Biol ; 69(3): 566-578, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31545363

RESUMO

Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.


Assuntos
Classificação/métodos , Filogenia , Software , Algoritmos , Sequência de Bases
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA