Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 86
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Algorithms Mol Biol ; 19(1): 18, 2024 Apr 29.
Artigo em Inglês | MEDLINE | ID: mdl-38685065

RESUMO

Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via "bulk sequencing," the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD's performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD .

2.
Nature ; 2024 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-38560995

RESUMO

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.

3.
Nat Commun ; 14(1): 8262, 2023 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-38092737

RESUMO

Cancers develop and progress as mutations accumulate, and with the advent of single-cell DNA and RNA sequencing, researchers can observe these mutations and their transcriptomic effects and predict proteomic changes with remarkable temporal and spatial precision. However, to connect genomic mutations with their transcriptomic and proteomic consequences, cells with either only DNA data or only RNA data must be mapped to a common domain. For this purpose, we present MaCroDNA, a method that uses maximum weighted bipartite matching of per-gene read counts from single-cell DNA and RNA-seq data. Using ground truth information from colorectal cancer data, we demonstrate the advantage of MaCroDNA over existing methods in accuracy and speed. Exemplifying the utility of single-cell data integration in cancer research, we suggest, based on results derived using MaCroDNA, that genomic mutations of large effect size increasingly contribute to differential expression between cells as Barrett's esophagus progresses to esophageal cancer, reaffirming the findings of the previous studies.


Assuntos
Adenocarcinoma , Esôfago de Barrett , Neoplasias Esofágicas , Humanos , Adenocarcinoma/genética , RNA/genética , Proteômica , Esôfago de Barrett/genética , Neoplasias Esofágicas/patologia , DNA
4.
Genome Biol Evol ; 15(6)2023 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-37243541

RESUMO

The evolutionary histories of individual loci in a genome can be estimated independently, but this approach is error-prone due to the limited amount of sequence data available for each gene, which has led to the development of a diverse array of gene tree error correction methods which reduce the distance to the species tree. We investigate the performance of two representatives of these methods: TRACTION and TreeFix. We found that gene tree error correction frequently increases the level of error in gene tree topologies by "correcting" them to be closer to the species tree, even when the true gene and species trees are discordant. We confirm that full Bayesian inference of the gene trees under the multispecies coalescent model is more accurate than independent inference. Future gene tree correction approaches and methods should incorporate an adequately realistic model of evolution instead of relying on oversimplified heuristics.


Assuntos
Genoma , Modelos Genéticos , Filogenia , Teorema de Bayes
5.
Mol Phylogenet Evol ; 181: 107724, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36720421

RESUMO

Accurate inference of population parameters plays a pivotal role in unravelling evolutionary histories. While recombination has been universally accepted as a fundamental process in the evolution of sexually reproducing organisms, it remains challenging to model it exactly. Thus, existing coalescent-based approaches make different assumptions or approximations to facilitate phylogenetic inference, which can potentially bring about biases in estimates of evolutionary parameters when recombination is present. In this article, we evaluate the performance of population parameter estimation using three methods-StarBEAST2, SNAPP, and diCal2-that represent three different types of inference. We performed whole-genome simulations in which recombination rates, mutation rates, and levels of incomplete lineage sorting were varied. We show that StarBEAST2 using short or medium-sized loci is robust to realistic rates of recombination, which is in agreement with previous studies. SNAPP, as expected, is generally unaffected by recombination events. Most surprisingly, diCal2, a method that is designed to explicitly account for recombination, performs considerably worse than other methods under comparison.


Assuntos
Genoma , Taxa de Mutação , Filogenia , Recombinação Genética , Modelos Genéticos , Simulação por Computador
6.
PLoS Comput Biol ; 18(6): e1010216, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35675326

RESUMO

Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences (MHGs) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.


Assuntos
Genoma Bacteriano , Células Procarióticas , Transferência Genética Horizontal , Genoma Bacteriano/genética , Filogenia , Alinhamento de Sequência
7.
Bioinformatics ; 38(Suppl 1): i195-i202, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758771

RESUMO

MOTIVATION: Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. RESULTS: Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. AVAILABILITY AND IMPLEMENTATION: Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nucleotídeos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Filogenia , Análise de Sequência de DNA
8.
Bioinformatics ; 38(10): 2912-2914, 2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35561189

RESUMO

SUMMARY: We report on a new single-cell DNA sequence simulator, SimSCSnTree, which generates an evolutionary tree of cells and evolves single nucleotide variants (SNVs) and copy number aberrations (CNAs) along its branches. Data generated by the simulator can be used to benchmark tools for single-cell genomic analyses, particularly in cancer where SNVs and CNAs are ubiquitous. AVAILABILITY AND IMPLEMENTATION: SimSCSnTree is now on BioConda and also is freely available for download at https://github.com/compbiofan/SimSCSnTree.git with detailed documentation.


Assuntos
Genoma , Genômica , Sequência de Bases , Variações do Número de Cópias de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Análise de Célula Única , Software
9.
Nat Commun ; 13(1): 1728, 2022 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-35365602

RESUMO

Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.


Assuntos
Aprendizado Profundo , Biologia Computacional , Filogenia , Proteínas , Biologia de Sistemas
10.
Syst Biol ; 71(3): 706-720, 2022 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-34605924

RESUMO

Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this article, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package. [Incomplete lineage sorting; minimizing deep coalescences; multilabeled trees; multispecies network coalescent; phylogenetic networks; polyploidy.].


Assuntos
Hibridização Genética , Poliploidia , Evolução Biológica , Humanos , Filogenia
11.
Syst Biol ; 71(2): 367-381, 2022 02 10.
Artigo em Inglês | MEDLINE | ID: mdl-34245291

RESUMO

Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].


Assuntos
Duplicação Gênica , Modelos Genéticos , Simulação por Computador , Genoma , Filogenia
12.
Genome Res ; 31(11): 2107-2119, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34426513

RESUMO

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, is coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies, and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation, it is flexible enough to enable future implementations of various population models.


Assuntos
Genética Populacional , Modelos Genéticos , Animais , Simulação por Computador , Humanos , Densidade Demográfica , Recombinação Genética
13.
PLoS Genet ; 17(8): e1009701, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34407067

RESUMO

Trait evolution among a set of species-a central theme in evolutionary biology-has long been understood and analyzed with respect to a species tree. However, the field of phylogenomics, which has been propelled by advances in sequencing technologies, has ushered in the era of species/gene tree incongruence and, consequently, a more nuanced understanding of trait evolution. For a trait whose states are incongruent with the branching patterns in the species tree, the same state could have arisen independently in different species (homoplasy) or followed the branching patterns of gene trees, incongruent with the species tree (hemiplasy). Another evolutionary process whose extent and significance are better revealed by phylogenomic studies is gene flow between different species. In this work, we present a phylogenomic method for assessing the role of hybridization and introgression in the evolution of polymorphic or monomorphic binary traits. We apply the method to simulated evolutionary scenarios to demonstrate the interplay between the parameters of the evolutionary history and the role of introgression in a binary trait's evolution (which we call xenoplasy). Very importantly, we demonstrate, including on a biological data set, that inferring a species tree and using it for trait evolution analysis in the presence of gene flow could lead to misleading hypotheses about trait evolution.


Assuntos
Biologia Computacional/métodos , Introgressão Genética/genética , Locos de Características Quantitativas , Evolução Molecular , Especiação Genética , Modelos Genéticos , Fenótipo , Filogenia
14.
Genome Biol ; 21(1): 208, 2020 08 17.
Artigo em Inglês | MEDLINE | ID: mdl-32807205

RESUMO

Copy number aberrations (CNAs), which are pathogenic copy number variations (CNVs), play an important role in the initiation and progression of cancer. Single-cell DNA-sequencing (scDNAseq) technologies produce data that is ideal for inferring CNAs. In this review, we review eight methods that have been developed for detecting CNAs in scDNAseq data, and categorize them according to the steps of a seven-step pipeline that they employ. Furthermore, we review models and methods for evolutionary analyses of CNAs from scDNAseq data and highlight advances and future research directions for computational methods for CNA detection from scDNAseq data.


Assuntos
Sequência de Bases , Biologia Computacional/métodos , Variações do Número de Cópias de DNA , Análise de Sequência de DNA/métodos , Aberrações Cromossômicas , DNA , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/genética
15.
PLoS Comput Biol ; 16(7): e1008012, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32658894

RESUMO

Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods-Ginkgo, HMMcopy, and CopyNumber-on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.


Assuntos
Variações do Número de Cópias de DNA , Genoma Humano , Análise de Sequência de DNA/métodos , Análise de Célula Única/métodos , Algoritmos , Aberrações Cromossômicas , Biologia Computacional , Simulação por Computador , Dosagem de Genes , Humanos , Mutação , Neoplasias/genética , Ploidias , Distribuição de Poisson , Curva ROC , Reprodutibilidade dos Testes , Software
16.
BMC Genomics ; 21(Suppl 2): 219, 2020 Apr 16.
Artigo em Inglês | MEDLINE | ID: mdl-32299348

RESUMO

BACKGROUND: Multi-locus species phylogeny inference is based on models of sequence evolution on gene trees as well as models of gene tree evolution within the branches of species phylogenies. Almost all statistical methods for this inference task assume a common mechanism across all loci as captured by a single value of each branch length of the species phylogeny. RESULTS: In this paper, we pursue a "no common mechanism" (NCM) model, where every gene tree evolves according to its own parameters of the species phylogeny. Based on this model, we derive an analytically integrated likelihood of both species trees and networks given the gene trees of multiple loci under an NCM model. We demonstrate the performance of inference under this integrated likelihood on both simulated and biological data. CONCLUSIONS: The model presented here will afford opportunities for exploring connections among various criteria for estimating species phylogenies from multiple, independent loci. Furthermore, further development of this model could potentially result in more efficient methods for searching the space of species phylogenies by focusing solely on the topology of the phylogeny.


Assuntos
Evolução Molecular , Genômica/métodos , Animais , Simulação por Computador , Culicidae/genética , Especiação Genética , Funções Verossimilhança , Modelos Genéticos , Redes Neurais de Computação , Filogenia , Probabilidade , Testamentos/estatística & dados numéricos
17.
Mol Biol Evol ; 37(6): 1809-1818, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32077947

RESUMO

Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.


Assuntos
Modelos Genéticos , Filogenia , Teorema de Bayes , Cadeias de Markov , Método de Monte Carlo
18.
Genome Res ; 29(11): 1847-1859, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31628257

RESUMO

Accumulation and selection of somatic mutations in a Darwinian framework result in intra-tumor heterogeneity (ITH) that poses significant challenges to the diagnosis and clinical therapy of cancer. Identification of the tumor cell populations (clones) and reconstruction of their evolutionary relationship can elucidate this heterogeneity. Recently developed single-cell DNA sequencing (SCS) technologies promise to resolve ITH to a single-cell level. However, technical errors in SCS data sets, including false-positives (FP) and false-negatives (FN) due to allelic dropout, and cell doublets, significantly complicate these tasks. Here, we propose a nonparametric Bayesian method that reconstructs the clonal populations as clusters of single cells, genotypes of each clone, and the evolutionary relationship between the clones. It employs a tree-structured Chinese restaurant process as the prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-site model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental data sets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-site model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.


Assuntos
Células Clonais , Genótipo , Neoplasias/genética , Análise de Célula Única/métodos , Teorema de Bayes , Humanos , Filogenia , Mutação Puntual
19.
Bioinformatics ; 35(14): i370-i378, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510688

RESUMO

MOTIVATION: Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. RESULTS: In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. AVAILABILITY AND IMPLEMENTATION: We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Filogenia , Evolução Molecular , Genoma , Alinhamento de Sequência , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA