Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 88
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Cell ; 187(4): 814-830.e23, 2024 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-38364788

RESUMO

Myelin, the insulating sheath that surrounds neuronal axons, is produced by oligodendrocytes in the central nervous system (CNS). This evolutionary innovation, which first appears in jawed vertebrates, enabled rapid transmission of nerve impulses, more complex brains, and greater morphological diversity. Here, we report that RNA-level expression of RNLTR12-int, a retrotransposon of retroviral origin, is essential for myelination. We show that RNLTR12-int-encoded RNA binds to the transcription factor SOX10 to regulate transcription of myelin basic protein (Mbp, the major constituent of myelin) in rodents. RNLTR12-int-like sequences (which we name RetroMyelin) are found in all jawed vertebrates, and we further demonstrate their function in regulating myelination in two different vertebrate classes (zebrafish and frogs). Our study therefore suggests that retroviral endogenization played a prominent role in the emergence of vertebrate myelin.


Assuntos
Bainha de Mielina , Retroelementos , Animais , Expressão Gênica , Bainha de Mielina/metabolismo , Oligodendroglia/metabolismo , Retroelementos/genética , RNA/metabolismo , Peixe-Zebra/genética , Anuros
2.
Nature ; 600(7889): 506-511, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34649268

RESUMO

The evolution of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus leads to new variants that warrant timely epidemiological characterization. Here we use the dense genomic surveillance data generated by the COVID-19 Genomics UK Consortium to reconstruct the dynamics of 71 different lineages in each of 315 English local authorities between September 2020 and June 2021. This analysis reveals a series of subepidemics that peaked in early autumn 2020, followed by a jump in transmissibility of the B.1.1.7/Alpha lineage. The Alpha variant grew when other lineages declined during the second national lockdown and regionally tiered restrictions between November and December 2020. A third more stringent national lockdown suppressed the Alpha variant and eliminated nearly all other lineages in early 2021. Yet a series of variants (most of which contained the spike E484K mutation) defied these trends and persisted at moderately increasing proportions. However, by accounting for sustained introductions, we found that the transmissibility of these variants is unlikely to have exceeded the transmissibility of the Alpha variant. Finally, B.1.617.2/Delta was repeatedly introduced in England and grew rapidly in early summer 2021, constituting approximately 98% of sampled SARS-CoV-2 genomes on 26 June 2021.


Assuntos
COVID-19/epidemiologia , COVID-19/virologia , Genoma Viral/genética , Genômica , SARS-CoV-2/genética , Substituição de Aminoácidos , COVID-19/transmissão , Inglaterra/epidemiologia , Monitoramento Epidemiológico , Humanos , Epidemiologia Molecular , Mutação , Quarentena/estatística & dados numéricos , SARS-CoV-2/classificação , Análise Espaço-Temporal , Glicoproteína da Espícula de Coronavírus/genética
3.
Bioinformatics ; 40(9)2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39226177

RESUMO

MOTIVATION: Tracking SARS-CoV-2 variants through genomic sequencing has been an important part of the global response to the pandemic and remains a useful tool for surveillance of the virus. As well as whole-genome sequencing of clinical samples, this surveillance effort has been aided by amplicon sequencing of wastewater samples, which proved effective in real case studies. Because of its relevance to public healthcare decisions, testing and benchmarking wastewater sequencing analysis methods is also crucial, which necessitates a simulator. Although metagenomic simulators exist, none is fit for the purpose of simulating the metagenomes produced through amplicon sequencing of wastewater. RESULTS: Our new simulation tool, SWAMPy (Simulating SARS-CoV-2 Wastewater Amplicon Metagenomes with Python), is intended to provide realistic simulated SARS-CoV-2 wastewater sequencing datasets with which other programs that rely on this type of data can be evaluated and improved. Our tool is suitable for simulating Illumina short-read RT-PCR amplified metagenomes. AVAILABILITY AND IMPLEMENTATION: The code for this project is available at https://github.com/goldman-gp-ebi/SWAMPy. It can be installed on any Unix-based operating system and is available under the GPL-v3 license.


Assuntos
COVID-19 , Metagenoma , SARS-CoV-2 , Águas Residuárias , Águas Residuárias/virologia , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificação , COVID-19/virologia , COVID-19/diagnóstico , Metagenômica/métodos , Software , Humanos , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos
4.
Syst Biol ; 72(5): 1119-1135, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37366056

RESUMO

Inference of deep phylogenies has almost exclusively used protein rather than DNA sequences based on the perception that protein sequences are less prone to homoplasy and saturation or to issues of compositional heterogeneity than DNA sequences. Here, we analyze a model of codon evolution under an idealized genetic code and demonstrate that those perceptions may be misconceptions. We conduct a simulation study to assess the utility of protein versus DNA sequences for inferring deep phylogenies, with protein-coding data generated under models of heterogeneous substitution processes across sites in the sequence and among lineages on the tree, and then analyzed using nucleotide, amino acid, and codon models. Analysis of DNA sequences under nucleotide-substitution models (possibly with the third codon positions excluded) recovered the correct tree at least as often as analysis of the corresponding protein sequences under modern amino acid models. We also applied the different data-analysis strategies to an empirical dataset to infer the metazoan phylogeny. Our results from both simulated and real data suggest that DNA sequences may be as useful as proteins for inferring deep phylogenies and should not be excluded from such analyses. Analysis of DNA data under nucleotide models has a major computational advantage over protein-data analysis, potentially making it feasible to use advanced models that account for among-site and among-lineage heterogeneity in the nucleotide-substitution process in inference of deep phylogenies.


Assuntos
Modelos Genéticos , Nucleotídeos , Animais , Filogenia , Sequência de Bases , Códon , Aminoácidos/genética , Evolução Molecular
5.
PLoS Genet ; 17(3): e1009221, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33651813

RESUMO

Many complex genomic rearrangements arise through template switch errors, which occur in DNA replication when there is a transient polymerase switch to an alternate template nearby in three-dimensional space. While typically investigated at kilobase-to-megabase scales, the genomic and evolutionary consequences of this mutational process are not well characterised at smaller scales, where they are often interpreted as clusters of independent substitutions, insertions and deletions. Here we present an improved statistical approach using pair hidden Markov models, and use it to detect and describe short-range template switches underlying clusters of mutations in the multi-way alignment of hominid genomes. Using robust statistics derived from evolutionary genomic simulations, we show that template switch events have been widespread in the evolution of the great apes' genomes and provide a parsimonious explanation for the presence of many complex mutation clusters in their phylogenetic context. Larger-scale mechanisms of genome rearrangement are typically associated with structural features around breakpoints, and accordingly we show that atypical patterns of secondary structure formation and DNA bending are present at the initial template switch loci. Our methods improve on previous non-probabilistic approaches for computational detection of template switch mutations, allowing the statistical significance of events to be assessed. By specifying realistic evolutionary parameters based on the genomes and taxa involved, our methods can be readily adapted to other intra- or inter-species comparisons.


Assuntos
Replicação do DNA , Genoma , Hominidae/genética , Cadeias de Markov , Modelos Genéticos , Moldes Genéticos , Algoritmos , Animais , Genômica/métodos , Humanos , Poli A-U , Locos de Características Quantitativas
6.
PLoS Comput Biol ; 18(4): e1010056, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35486906

RESUMO

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.


Assuntos
COVID-19 , Pandemias , Algoritmos , COVID-19/epidemiologia , Simulação por Computador , Evolução Molecular , Humanos , Filogenia , SARS-CoV-2/genética , Software
8.
Proc Natl Acad Sci U S A ; 117(11): 5977-5986, 2020 03 17.
Artigo em Inglês | MEDLINE | ID: mdl-32123117

RESUMO

Understanding the molecular basis of adaptation to the environment is a central question in evolutionary biology, yet linking detected signatures of positive selection to molecular mechanisms remains challenging. Here we demonstrate that combining sequence-based phylogenetic methods with structural information assists in making such mechanistic interpretations on a genomic scale. Our integrative analysis shows that positively selected sites tend to colocalize on protein structures and that positively selected clusters are found in functionally important regions of proteins, indicating that positive selection can contravene the well-known principle of evolutionary conservation of functionally important regions. This unexpected finding, along with our discovery that positive selection acts on structural clusters, opens previously unexplored strategies for the development of better models of protein evolution. Remarkably, proteins where we detect the strongest evidence of clustering belong to just two functional groups: Components of immune response and metabolic enzymes. This gives a coherent picture of pathogens and xenobiotics as important drivers of adaptive evolution of mammals.


Assuntos
Adaptação Fisiológica , Evolução Molecular , Mamíferos/genética , Mamíferos/fisiologia , Seleção Genética , Animais , Meio Ambiente , Enzimas/química , Genômica , Imunidade , Mamíferos/imunologia , Modelos Moleculares , Filogenia , Conformação Proteica , Proteínas/química
9.
PLoS Genet ; 16(11): e1009175, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33206635

RESUMO

The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.


Assuntos
Genoma Viral/genética , Filogenia , SARS-CoV-2/genética , Algoritmos , COVID-19 , Biologia Computacional , Evolução Molecular , Humanos , RNA Viral/genética , Alinhamento de Sequência , Sequenciamento Completo do Genoma
10.
Mol Biol Evol ; 38(12): 5819-5824, 2021 12 09.
Artigo em Inglês | MEDLINE | ID: mdl-34469548

RESUMO

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.


Assuntos
Evolução Molecular , Filogenia , SARS-CoV-2 , COVID-19/virologia , Humanos , Mutação , SARS-CoV-2/genética , Software
11.
Syst Biol ; 70(1): 21-32, 2021 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32353118

RESUMO

How can we best learn the history of a protein's evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modeling based on inferred amino acid sequence and side chain configuration). But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the "unseen" state-space. To illustrate this, we consider two examples with amino acid sequences as input. We show that $$\omega$$, a parameter describing the relative strength of selection on nonsynonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible. [Ancestral reconstruction; natural selection; protein structure; state-spaces; substitution models.].


Assuntos
Evolução Molecular , Seleção Genética , Sequência de Aminoácidos , Modelos Genéticos , Filogenia , Proteínas
12.
PLoS Comput Biol ; 17(1): e1008561, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33406072

RESUMO

Phylogeographic inference allows reconstruction of past geographical spread of pathogens or living organisms by integrating genetic and geographic data. A popular model in continuous phylogeography-with location data provided in the form of latitude and longitude coordinates-describes spread as a Brownian motion (Brownian Motion Phylogeography, BMP) in continuous space and time, akin to similar models of continuous trait evolution. Here, we show that reconstructions using this model can be strongly affected by sampling biases, such as the lack of sampling from certain areas. As an attempt to reduce the effects of sampling bias on BMP, we consider the addition of sequence-free samples from under-sampled areas. While this approach alleviates the effects of sampling bias, in most scenarios this will not be a viable option due to the need for prior knowledge of an outbreak's spatial distribution. We therefore consider an alternative model, the spatial Λ-Fleming-Viot process (ΛFV), which has recently gained popularity in population genetics. Despite the ΛFV's robustness to sampling biases, we find that the different assumptions of the ΛFV and BMP models result in different applicabilities, with the ΛFV being more appropriate for scenarios of endemic spread, and BMP being more appropriate for recent outbreaks or colonizations.


Assuntos
Genética Populacional/métodos , Modelos Genéticos , Filogeografia/métodos , Viés de Seleção , Teorema de Bayes , Biologia Computacional , Surtos de Doenças/estatística & dados numéricos , Flavivirus/genética , Infecções por Flavivirus/epidemiologia , Infecções por Flavivirus/virologia , Humanos , Cadeias de Markov
14.
BMC Bioinformatics ; 22(1): 285, 2021 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-34049487

RESUMO

BACKGROUND: Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented. RESULTS: We formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes. CONCLUSIONS: Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.


Assuntos
Algoritmos , Biologia Computacional , Filogenia , Alinhamento de Sequência
15.
J Proteome Res ; 20(8): 4212-4215, 2021 08 06.
Artigo em Inglês | MEDLINE | ID: mdl-34180678

RESUMO

In the absence of effective treatment, COVID-19 is likely to remain a global disease burden. Compounding this threat is the near certainty that novel coronaviruses with pandemic potential will emerge in years to come. Pan-coronavirus drugs-agents active against both SARS-CoV-2 and other coronaviruses-would address both threats. A strategy to develop such broad-spectrum inhibitors is to pharmacologically target binding sites on SARS-CoV-2 proteins that are highly conserved in other known coronaviruses, the assumption being that any selective pressure to keep a site conserved across past viruses will apply to future ones. Here we systematically mapped druggable binding pockets on the experimental structure of 15 SARS-CoV-2 proteins and analyzed their variation across 27 α- and ß-coronaviruses and across thousands of SARS-CoV-2 samples from COVID-19 patients. We find that the two most conserved druggable sites are a pocket overlapping the RNA binding site of the helicase nsp13 and the catalytic site of the RNA-dependent RNA polymerase nsp12, both components of the viral replication-transcription complex. We present the data on a public web portal (https://www.thesgc.org/SARSCoV2_pocketome/), where users can interactively navigate individual protein structures and view the genetic variability of drug-binding pockets in 3D.


Assuntos
COVID-19 , SARS-CoV-2 , Antivirais/farmacologia , Antivirais/uso terapêutico , Humanos , Pandemias , RNA Polimerase Dependente de RNA/genética
16.
Mol Biol Evol ; 36(9): 2086-2103, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-31114882

RESUMO

Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino acid identity and structural information. Each character specifies an amino acid as well as information about the rotamer configuration of its side-chain: the discrete geometric pattern of permitted side-chain atomic positions, as defined by the dihedral angles between covalently linked atoms. By assigning rotamer states in 251,194 protein structures and identifying 4,508,390 substitutions between closely related sequences, we generate a 55-state "Dayhoff-like" model that shows that the evolutionary properties of amino acids depend strongly upon side-chain geometry. The model performs as well as or better than traditional 20-state models for divergence time estimation, tree inference, and ancestral state reconstruction. We conclude that not only is rotamer configuration a valuable source of information for phylogenetic studies, but that modeling the concomitant evolution of sequence and structure may have important implications for understanding protein folding and function.


Assuntos
Evolução Molecular , Modelos Biológicos , Conformação Proteica , Substituição de Aminoácidos , Cadeias de Markov
17.
Genome Res ; 27(6): 1039-1049, 2017 06.
Artigo em Inglês | MEDLINE | ID: mdl-28385709

RESUMO

Resequencing efforts are uncovering the extent of genetic variation in humans and provide data to study the evolutionary processes shaping our genome. One recurring puzzle in both intra- and inter-species studies is the high frequency of complex mutations comprising multiple nearby base substitutions or insertion-deletions. We devised a generalized mutation model of template switching during replication that extends existing models of genome rearrangement and used this to study the role of template switch events in the origin of short mutation clusters. Applied to the human genome, our model detects thousands of template switch events during the evolution of human and chimp from their common ancestor and hundreds of events between two independently sequenced human genomes. Although many of these are consistent with a template switch mechanism previously proposed for bacteria, our model also identifies new types of mutations that create short inversions, some flanked by paired inverted repeats. The local template switch process can create numerous complex mutation patterns, including hairpin loop structures, and explains multinucleotide mutations and compensatory substitutions without invoking positive selection, speculative mechanisms, or implausible coincidence. Clustered sequence differences are challenging for current mapping and variant calling methods, and we show that many erroneous variant annotations exist in human reference data. Local template switch events may have been neglected as an explanation for complex mutations because of biases in commonly used analyses. Incorporation of our model into reference-based analysis pipelines and comparisons of de novo assembled genomes will lead to improved understanding of genome variation and evolution.


Assuntos
Genoma Humano , Mutação INDEL , Sequências Repetidas Invertidas , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Animais , Sequência de Bases , Evolução Biológica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Pan troglodytes , Alinhamento de Sequência
18.
Nature ; 513(7518): 422-425, 2014 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-25043003

RESUMO

The somatic mutations present in the genome of a cell accumulate over the lifetime of a multicellular organism. These mutations can provide insights into the developmental lineage tree, the number of divisions that each cell has undergone and the mutational processes that have been operative. Here we describe whole genomes of clonal lines derived from multiple tissues of healthy mice. Using somatic base substitutions, we reconstructed the early cell divisions of each animal, demonstrating the contributions of embryonic cells to adult tissues. Differences were observed between tissues in the numbers and types of mutations accumulated by each cell, which likely reflect differences in the number of cell divisions they have undergone and varying contributions of different mutational processes. If somatic mutation rates are similar to those in mice, the results indicate that precise insights into development and mutagenesis of normal human cells will be possible.


Assuntos
Linhagem da Célula/genética , Células Clonais/citologia , Células Clonais/metabolismo , Genoma/genética , Mutagênese/genética , Mutação/genética , Animais , Relógios Biológicos/genética , Divisão Celular , Células Cultivadas , Embrião de Mamíferos/citologia , Humanos , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Taxa de Mutação , Organoides/citologia , Organoides/metabolismo , Filogenia , Análise de Sequência de DNA , Cauda/citologia
19.
Mol Biol Evol ; 35(7): 1783-1797, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29618097

RESUMO

Accurate reconstruction of ancestral states is a critical evolutionary analysis when studying ancient proteins and comparing biochemical properties between parental or extinct species and their extant relatives. It relies on multiple sequence alignment (MSA) which may introduce biases, and it remains unknown how MSA methodological approaches impact ancestral sequence reconstruction (ASR). Here, we investigate how MSA methodology modulates ASR using a simulation study of various evolutionary scenarios. We evaluate the accuracy of ancestral protein sequence reconstruction for simulated data and compare reconstruction outcomes using different alignment methods. Our results reveal biases introduced not only by aligner algorithms and assumptions, but also tree topology and the rate of insertions and deletions. Under many conditions we find no substantial differences between the MSAs. However, increasing the difficulty for the aligners can significantly impact ASR. The MAFFT consistency aligners and PRANK variants exhibit the best performance, whereas FSA displays limited performance. We also discover a bias towards reconstructed sequences longer than the true ancestors, deriving from a preference for inferring insertions, in almost all MSA methodological approaches. In addition, we find measures of MSA quality generally correlate highly with reconstruction accuracy. Thus, we show MSA methodological differences can affect the quality of reconstructions and propose MSA methods should be selected with care to accurately determine ancestral states with confidence.


Assuntos
Técnicas Genéticas , Alinhamento de Sequência
20.
Nature ; 494(7435): 77-80, 2013 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-23354052

RESUMO

Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.


Assuntos
Arquivos , DNA/química , DNA/síntese química , Gestão da Informação/métodos , Sequência de Bases , Computadores , DNA/economia , Gestão da Informação/economia , Dados de Sequência Molecular , Análise de Sequência de DNA/economia , Biologia Sintética/economia , Biologia Sintética/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA