Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 79
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Nature ; 609(7929): 994-997, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35952714

RESUMO

Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses1-4. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral evolution5. Here, we use a new phylogenomic method to search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. In a 1.6 million sample tree from May 2021, we identify 589 recombination events, which indicate that around 2.7% of sequenced SARS-CoV-2 genomes have detectable recombinant ancestry. Recombination breakpoints are inferred to occur disproportionately in the 3' portion of the genome that contains the spike protein. Our results highlight the need for timely analyses of recombination for pinpointing the emergence of recombinant lineages with the potential to increase transmissibility or virulence of the virus. We anticipate that this approach will empower comprehensive real-time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.


Assuntos
COVID-19 , Genoma Viral , Pandemias , Filogenia , Recombinação Genética , SARS-CoV-2 , COVID-19/epidemiologia , COVID-19/transmissão , COVID-19/virologia , Genoma Viral/genética , Humanos , Mutação , Recombinação Genética/genética , SARS-CoV-2/genética , SARS-CoV-2/patogenicidade , Seleção Genética/genética , Glicoproteína da Espícula de Coronavírus/genética , Virulência/genética
2.
Syst Biol ; 2024 Feb 29.
Artigo em Inglês | MEDLINE | ID: mdl-38421146

RESUMO

Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting, introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call MAST. This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of incomplete lineage sorting in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of four Platyrrhine species for which standard concatenated maximum likelihood and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e. the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.

3.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36383168

RESUMO

MOTIVATION: Site concordance factors (sCFs) have become a widely used way to summarize discordance in phylogenomic datasets. However, the original version of sCFs was calculated by sampling a quartet of tip taxa and then applying parsimony-based criteria for discordance. This approach has the potential to be strongly affected by multiple hits at a site (homoplasy), especially when substitution rates are high or taxa are not closely related. RESULTS: Here, we introduce a new method for calculating sCFs. The updated version uses likelihood to generate probability distributions of ancestral states at internal nodes of the phylogeny. By sampling from the states at internal nodes adjacent to a given branch, this approach substantially reduces-but does not abolish-the effects of homoplasy and taxon sampling. AVAILABILITY AND IMPLEMENTATION: Updated sCFs are implemented in IQ-TREE 2.2.2. The software is freely available at https://github.com/iqtree/iqtree2/releases. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.


Assuntos
Software , Filogenia , Probabilidade
4.
Bioinformatics ; 39(9)2023 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-37651445

RESUMO

MOTIVATION: Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10 000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which Neighbour-Joining is a useful approach, new implementations of existing methods are warranted. RESULTS: Here, we present DecentTree, which provides highly optimized and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g. it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For example, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g. RapidNJ) when generating a tree of 64 000 SARS-CoV-2 genomes. AVAILABILITY AND IMPLEMENTATION: DecentTree is open source and freely available at https://github.com/iqtree/decenttree. All code and data used in this analysis are available on Github (https://github.com/asdcid/Comparison-of-neighbour-joining-software).


Assuntos
COVID-19 , Humanos , Filogenia , SARS-CoV-2/genética , Genômica , Biblioteca Gênica
5.
Syst Biol ; 72(5): 1039-1051, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37232476

RESUMO

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Filogenia , Probabilidade , Genômica
6.
Mol Biol Evol ; 39(5)2022 05 03.
Artigo em Inglês | MEDLINE | ID: mdl-35511713

RESUMO

Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programmes exist, but the most feature-rich programmes tend to be rather slow, and the fastest programmes tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly used rate matrix and probability matrix approaches. AliSim takes 1.4 h and 1.3 GB RAM to simulate alignments with one million sequences or sites, whereas popular software Seq-Gen, Dawg, and INDELible require 2-5 h and 50-500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.


Assuntos
Evolução Molecular , Modelos Genéticos , Genômica , Filogenia , Software
7.
Bioinformatics ; 38(15): 3734-3740, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35731204

RESUMO

MOTIVATION: Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the coronavirus disease 2019 (COVID-19) pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic. RESULTS: Here, we present matOptimize, a fast and memory-efficient phylogenetic tree optimization tool based on parsimony that can be parallelized across multiple CPU threads and nodes, and provides orders of magnitude improvement in runtime and peak memory usage compared to existing state-of-the-art methods. We have developed this method particularly to address the pressing need during the COVID-19 pandemic for daily maintenance and optimization of a comprehensive SARS-CoV-2 phylogeny. matOptimize is currently helping refine on a daily basis possibly the largest-ever phylogenetic tree, containing millions of SARS-CoV-2 sequences. AVAILABILITY AND IMPLEMENTATION: The matOptimize code is freely available as part of the UShER package (https://github.com/yatisht/usher) and can also be installed via bioconda (https://bioconda.github.io/recipes/usher/README.html). All scripts we used to perform the experiments in this manuscript are available at https://github.com/yceh/matOptimize-experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Filogenia , SARS-CoV-2/genética , Pandemias , Software
8.
Syst Biol ; 71(4): 959-972, 2022 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-34387349

RESUMO

Using time-reversible Markov models is a very common practice in phylogenetic analysis, because although we expect many of their assumptions to be violated by empirical data, they provide high computational efficiency. However, these models lack the ability to infer the root placement of the estimated phylogeny. In order to compensate for the inability of these models to root the tree, many researchers use external information such as using outgroup taxa or additional assumptions such as molecular clocks. In this study, we investigate the utility of nonreversible models to root empirical phylogenies and introduce a new bootstrap measure, the rootstrap, which provides information on the statistical support for any given root position. [Bootstrap; nonreversible models; phylogenetic inference; root estimation.].


Assuntos
Mamíferos , Modelos Genéticos , Animais , Filogenia
9.
Syst Biol ; 71(5): 1110-1123, 2022 08 10.
Artigo em Inglês | MEDLINE | ID: mdl-35139203

RESUMO

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All commonly used amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this article, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time nonreversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the nonreversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of data sets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the data set. Notably, for the recently published plant and bird trees, these nonreversible models correctly recovered the commonly estimated root placements with very high-statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate nonreversible models and rooted phylogenies from their own protein data sets. The data sets and scripts used in this article are available at https://doi.org/10.5061/dryad.3tx95x6hx. [amino acid sequence analyses; amino acid substitution models; maximum likelihood model estimation; nonreversible models; phylogenetic inference; reversible models.].


Assuntos
Modelos Genéticos , Software , Substituição de Aminoácidos , Animais , Evolução Molecular , Funções Verossimilhança , Mamíferos , Filogenia , Proteínas
10.
PLoS Biol ; 18(12): e3000954, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33270638

RESUMO

Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here, we present new reference genome assemblies for 3 Old World monkey (OWM) species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and individual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.


Assuntos
Introgressão Genética/genética , Primatas/genética , Animais , Evolução Biológica , Cercopithecidae/genética , Biologia Computacional/métodos , Bases de Dados Genéticas , Fósseis , Fluxo Gênico/genética , Genoma/genética , Modelos Genéticos , Filogenia , Análise de Sequência de DNA/métodos
11.
PLoS Genet ; 16(11): e1009175, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33206635

RESUMO

The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.


Assuntos
Genoma Viral/genética , Filogenia , SARS-CoV-2/genética , Algoritmos , COVID-19 , Biologia Computacional , Evolução Molecular , Humanos , RNA Viral/genética , Alinhamento de Sequência , Sequenciamento Completo do Genoma
12.
Syst Biol ; 70(5): 1046-1060, 2021 08 11.
Artigo em Inglês | MEDLINE | ID: mdl-33616668

RESUMO

Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible $Q$ matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.].


Assuntos
Evolução Molecular , Modelos Genéticos , Animais , Funções Verossimilhança , Filogenia , Proteínas/genética , Alinhamento de Sequência
13.
Mol Biol Evol ; 37(9): 2727-2733, 2020 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-32365179

RESUMO

We implement two measures for quantifying genealogical concordance in phylogenomic data sets: the gene concordance factor (gCF) and the novel site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of "decisive" gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. An easy to use implementation and tutorial is freely available in the IQ-TREE software package (http://www.iqtree.org/doc/Concordance-Factor, last accessed May 13, 2020).


Assuntos
Conjuntos de Dados como Assunto , Técnicas Genéticas , Filogenia , Software
14.
Mol Biol Evol ; 37(5): 1530-1534, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32011700

RESUMO

IQ-TREE (http://www.iqtree.org, last accessed February 6, 2020) is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood. Since the release of version 1 in 2014, we have continuously expanded IQ-TREE to integrate a plethora of new models of sequence evolution and efficient computational approaches of phylogenetic inference to deal with genomic data. Here, we describe notable features of IQ-TREE version 2 and highlight the key advantages over other software.


Assuntos
Evolução Molecular , Genômica , Modelos Genéticos , Filogenia , Software
15.
Mol Biol Evol ; 37(4): 1202-1210, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31825512

RESUMO

Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. In phylogenetics, the potential impacts of partitioning sequence data for the assignment of substitution models are well appreciated. In contrast, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci or subsets of loci. By analyzing a range of empirical data sets, we find consistent support for a model in which branch lengths are proportionate between subsets of loci: gene trees share the same pattern of branch lengths, but form subsets that vary in their overall tree lengths. These models had substantially better statistical support than models that assume identical branch lengths across gene trees, or those in which genes form subsets with distinct branch-length patterns. We show using simulations and empirical data that the complexity of the branch-length model with the highest support depends on the length of the sequence alignment and on the numbers of taxa and loci in the data set. Our findings suggest that models in which branch lengths are proportionate between subsets have the highest statistical support under the conditions that are most commonly seen in practice. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.


Assuntos
Modelos Genéticos , Filogenia , Simulação por Computador
16.
PLoS Biol ; 16(5): e2005439, 2018 05.
Artigo em Inglês | MEDLINE | ID: mdl-29768400

RESUMO

For the last 100 years, it has been uncontroversial to state that the plant germline is set aside late in development, but there is surprisingly little evidence to support this view. In contrast, much evolutionary theory and several recent empirical studies seem to suggest the opposite-that the germlines of some and perhaps most plants may be set aside early in development. But is this really the case? How much does it matter? How can we reconcile the new evidence with existing knowledge of plant development? And is there a way to reliably establish the timing of germline segregation in both model and nonmodel plants? Answering these questions is vital to understanding one of the most fundamental aspects of plant development and evolution.


Assuntos
Células Germinativas Vegetais , Plantas , Diferenciação Celular , Linhagem da Célula , Desenvolvimento Vegetal
17.
Proc Biol Sci ; 287(1922): 20192364, 2020 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-32156194

RESUMO

Somatic mutations can have important effects on the life history, ecology, and evolution of plants, but the rate at which they accumulate is poorly understood and difficult to measure directly. Here, we develop a method to measure somatic mutations in individual plants and use it to estimate the somatic mutation rate in a large, long-lived, phenotypically mosaic Eucalyptus melliodora tree. Despite being 100 times larger than Arabidopsis, this tree has a per-generation mutation rate only ten times greater, which suggests that this species may have evolved mechanisms to reduce the mutation rate per unit of growth. This adds to a growing body of evidence that illuminates the correlated evolutionary shifts in mutation rate and life history in plants.


Assuntos
Arabidopsis/fisiologia , Taxa de Mutação , Filogenia , Fenômenos Fisiológicos Vegetais
18.
Mol Biol Evol ; 35(7): 1798-1811, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29659989

RESUMO

Ultraconserved (UCEs) are popular markers for phylogenomic studies. They are relatively simple to collect from distantly-related organisms, and contain sufficient information to infer relationships at almost all taxonomic levels. Most studies of UCEs use partitioning to account for variation in rates and patterns of molecular evolution among sites, for example by estimating an independent model of molecular evolution for each UCE. However, rates and patterns of molecular evolution vary substantially within as well as between UCEs, suggesting that there may be opportunities to improve how UCEs are partitioned for phylogenetic inference. We propose and evaluate new partitioning methods for phylogenomic studies of UCEs: Sliding-Window Site Characteristics (SWSC), and UCE Site Position (UCESP). The first method uses site characteristics such as entropy, multinomial likelihood, and GC content to generate partitions that account for heterogeneity in rates and patterns of molecular evolution within each UCE. The second method groups together nucleotides that are found in similar physical locations within the UCEs. We examined the new methods with seven published data sets from a variety of taxa. We demonstrate the UCESP method generates partitions that are worse than other strategies used to partition UCE data sets (e.g., one partition per UCE). The SWSC method, particularly when based on site entropies, generates partitions that account for within-UCE heterogeneity and leads to large increases in the model fit. All of the methods, code, and data used in this study, are available from https://github.com/Tagliacollo/PartitionUCE. Simplified code for implementing the best method, the SWSC-EN, is available from https://github.com/Tagliacollo/PFinderUCE-SWSC-EN.


Assuntos
Sequência Conservada , Marcadores Genéticos , Técnicas Genéticas , Sequência de Bases , Software
19.
BMC Genomics ; 19(1): 977, 2018 Dec 29.
Artigo em Inglês | MEDLINE | ID: mdl-30594129

RESUMO

BACKGROUND: Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10-30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. RESULTS: Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. CONCLUSIONS: Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.


Assuntos
Cloroplastos/genética , Genoma de Cloroplastos , Sequências Repetidas Invertidas , Eucalyptus/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
20.
Mol Biol Evol ; 34(4): 1016-1020, 2017 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-28087773

RESUMO

Bayesian inference using Markov chain Monte Carlo (MCMC) has become one of the primary methods used to infer phylogenies from sequence data. Assessing convergence is a crucial component of these analyses, as it establishes the reliability of the posterior distribution estimates of the tree topology and model parameters sampled from the MCMC. Numerous tests and visualizations have been developed for this purpose, but many of the most popular methods are implemented in ways that make them inconvenient to use for large data sets. RWTY is an R package that implements established and new methods for diagnosing phylogenetic MCMC convergence in a single convenient interface.


Assuntos
Análise de Sequência de DNA/métodos , Algoritmos , Teorema de Bayes , Humanos , Cadeias de Markov , Modelos Genéticos , Método de Monte Carlo , Filogenia , Reprodutibilidade dos Testes , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA