Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 85
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 187(4): 814-830.e23, 2024 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-38364788

RESUMEN

Myelin, the insulating sheath that surrounds neuronal axons, is produced by oligodendrocytes in the central nervous system (CNS). This evolutionary innovation, which first appears in jawed vertebrates, enabled rapid transmission of nerve impulses, more complex brains, and greater morphological diversity. Here, we report that RNA-level expression of RNLTR12-int, a retrotransposon of retroviral origin, is essential for myelination. We show that RNLTR12-int-encoded RNA binds to the transcription factor SOX10 to regulate transcription of myelin basic protein (Mbp, the major constituent of myelin) in rodents. RNLTR12-int-like sequences (which we name RetroMyelin) are found in all jawed vertebrates, and we further demonstrate their function in regulating myelination in two different vertebrate classes (zebrafish and frogs). Our study therefore suggests that retroviral endogenization played a prominent role in the emergence of vertebrate myelin.


Asunto(s)
Vaina de Mielina , Retroelementos , Animales , Expresión Génica , Vaina de Mielina/metabolismo , Oligodendroglía/metabolismo , Retroelementos/genética , ARN/metabolismo , Pez Cebra/genética , Anuros
2.
Nature ; 600(7889): 506-511, 2021 12.
Artículo en Inglés | MEDLINE | ID: mdl-34649268

RESUMEN

The evolution of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus leads to new variants that warrant timely epidemiological characterization. Here we use the dense genomic surveillance data generated by the COVID-19 Genomics UK Consortium to reconstruct the dynamics of 71 different lineages in each of 315 English local authorities between September 2020 and June 2021. This analysis reveals a series of subepidemics that peaked in early autumn 2020, followed by a jump in transmissibility of the B.1.1.7/Alpha lineage. The Alpha variant grew when other lineages declined during the second national lockdown and regionally tiered restrictions between November and December 2020. A third more stringent national lockdown suppressed the Alpha variant and eliminated nearly all other lineages in early 2021. Yet a series of variants (most of which contained the spike E484K mutation) defied these trends and persisted at moderately increasing proportions. However, by accounting for sustained introductions, we found that the transmissibility of these variants is unlikely to have exceeded the transmissibility of the Alpha variant. Finally, B.1.617.2/Delta was repeatedly introduced in England and grew rapidly in early summer 2021, constituting approximately 98% of sampled SARS-CoV-2 genomes on 26 June 2021.


Asunto(s)
COVID-19/epidemiología , COVID-19/virología , Genoma Viral/genética , Genómica , SARS-CoV-2/genética , Sustitución de Aminoácidos , COVID-19/transmisión , Inglaterra/epidemiología , Monitoreo Epidemiológico , Humanos , Epidemiología Molecular , Mutación , Cuarentena/estadística & datos numéricos , SARS-CoV-2/clasificación , Análisis Espacio-Temporal , Glicoproteína de la Espiga del Coronavirus/genética
3.
Syst Biol ; 72(5): 1119-1135, 2023 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-37366056

RESUMEN

Inference of deep phylogenies has almost exclusively used protein rather than DNA sequences based on the perception that protein sequences are less prone to homoplasy and saturation or to issues of compositional heterogeneity than DNA sequences. Here, we analyze a model of codon evolution under an idealized genetic code and demonstrate that those perceptions may be misconceptions. We conduct a simulation study to assess the utility of protein versus DNA sequences for inferring deep phylogenies, with protein-coding data generated under models of heterogeneous substitution processes across sites in the sequence and among lineages on the tree, and then analyzed using nucleotide, amino acid, and codon models. Analysis of DNA sequences under nucleotide-substitution models (possibly with the third codon positions excluded) recovered the correct tree at least as often as analysis of the corresponding protein sequences under modern amino acid models. We also applied the different data-analysis strategies to an empirical dataset to infer the metazoan phylogeny. Our results from both simulated and real data suggest that DNA sequences may be as useful as proteins for inferring deep phylogenies and should not be excluded from such analyses. Analysis of DNA data under nucleotide models has a major computational advantage over protein-data analysis, potentially making it feasible to use advanced models that account for among-site and among-lineage heterogeneity in the nucleotide-substitution process in inference of deep phylogenies.


Asunto(s)
Modelos Genéticos , Nucleótidos , Animales , Filogenia , Secuencia de Bases , Codón , Aminoácidos/genética , Evolución Molecular
4.
PLoS Genet ; 17(3): e1009221, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33651813

RESUMEN

Many complex genomic rearrangements arise through template switch errors, which occur in DNA replication when there is a transient polymerase switch to an alternate template nearby in three-dimensional space. While typically investigated at kilobase-to-megabase scales, the genomic and evolutionary consequences of this mutational process are not well characterised at smaller scales, where they are often interpreted as clusters of independent substitutions, insertions and deletions. Here we present an improved statistical approach using pair hidden Markov models, and use it to detect and describe short-range template switches underlying clusters of mutations in the multi-way alignment of hominid genomes. Using robust statistics derived from evolutionary genomic simulations, we show that template switch events have been widespread in the evolution of the great apes' genomes and provide a parsimonious explanation for the presence of many complex mutation clusters in their phylogenetic context. Larger-scale mechanisms of genome rearrangement are typically associated with structural features around breakpoints, and accordingly we show that atypical patterns of secondary structure formation and DNA bending are present at the initial template switch loci. Our methods improve on previous non-probabilistic approaches for computational detection of template switch mutations, allowing the statistical significance of events to be assessed. By specifying realistic evolutionary parameters based on the genomes and taxa involved, our methods can be readily adapted to other intra- or inter-species comparisons.


Asunto(s)
Replicación del ADN , Genoma , Hominidae/genética , Cadenas de Markov , Modelos Genéticos , Moldes Genéticos , Algoritmos , Animales , Genómica/métodos , Humanos , Poli A-U , Sitios de Carácter Cuantitativo
5.
PLoS Comput Biol ; 18(4): e1010056, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35486906

RESUMEN

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.


Asunto(s)
COVID-19 , Pandemias , Algoritmos , COVID-19/epidemiología , Simulación por Computador , Evolución Molecular , Humanos , Filogenia , SARS-CoV-2/genética , Programas Informáticos
7.
Proc Natl Acad Sci U S A ; 117(11): 5977-5986, 2020 03 17.
Artículo en Inglés | MEDLINE | ID: mdl-32123117

RESUMEN

Understanding the molecular basis of adaptation to the environment is a central question in evolutionary biology, yet linking detected signatures of positive selection to molecular mechanisms remains challenging. Here we demonstrate that combining sequence-based phylogenetic methods with structural information assists in making such mechanistic interpretations on a genomic scale. Our integrative analysis shows that positively selected sites tend to colocalize on protein structures and that positively selected clusters are found in functionally important regions of proteins, indicating that positive selection can contravene the well-known principle of evolutionary conservation of functionally important regions. This unexpected finding, along with our discovery that positive selection acts on structural clusters, opens previously unexplored strategies for the development of better models of protein evolution. Remarkably, proteins where we detect the strongest evidence of clustering belong to just two functional groups: Components of immune response and metabolic enzymes. This gives a coherent picture of pathogens and xenobiotics as important drivers of adaptive evolution of mammals.


Asunto(s)
Adaptación Fisiológica , Evolución Molecular , Mamíferos/genética , Mamíferos/fisiología , Selección Genética , Animales , Ambiente , Enzimas/química , Genómica , Inmunidad , Mamíferos/inmunología , Modelos Moleculares , Filogenia , Conformación Proteica , Proteínas/química
8.
PLoS Genet ; 16(11): e1009175, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33206635

RESUMEN

The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.


Asunto(s)
Genoma Viral/genética , Filogenia , SARS-CoV-2/genética , Algoritmos , COVID-19 , Biología Computacional , Evolución Molecular , Humanos , ARN Viral/genética , Alineación de Secuencia , Secuenciación Completa del Genoma
9.
Mol Biol Evol ; 38(12): 5819-5824, 2021 12 09.
Artículo en Inglés | MEDLINE | ID: mdl-34469548

RESUMEN

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.


Asunto(s)
Evolución Molecular , Filogenia , SARS-CoV-2 , COVID-19/virología , Humanos , Mutación , SARS-CoV-2/genética , Programas Informáticos
10.
Syst Biol ; 70(1): 21-32, 2021 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-32353118

RESUMEN

How can we best learn the history of a protein's evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modeling based on inferred amino acid sequence and side chain configuration). But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the "unseen" state-space. To illustrate this, we consider two examples with amino acid sequences as input. We show that $$\omega$$, a parameter describing the relative strength of selection on nonsynonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible. [Ancestral reconstruction; natural selection; protein structure; state-spaces; substitution models.].


Asunto(s)
Evolución Molecular , Selección Genética , Secuencia de Aminoácidos , Modelos Genéticos , Filogenia , Proteínas
11.
PLoS Comput Biol ; 17(1): e1008561, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-33406072

RESUMEN

Phylogeographic inference allows reconstruction of past geographical spread of pathogens or living organisms by integrating genetic and geographic data. A popular model in continuous phylogeography-with location data provided in the form of latitude and longitude coordinates-describes spread as a Brownian motion (Brownian Motion Phylogeography, BMP) in continuous space and time, akin to similar models of continuous trait evolution. Here, we show that reconstructions using this model can be strongly affected by sampling biases, such as the lack of sampling from certain areas. As an attempt to reduce the effects of sampling bias on BMP, we consider the addition of sequence-free samples from under-sampled areas. While this approach alleviates the effects of sampling bias, in most scenarios this will not be a viable option due to the need for prior knowledge of an outbreak's spatial distribution. We therefore consider an alternative model, the spatial Λ-Fleming-Viot process (ΛFV), which has recently gained popularity in population genetics. Despite the ΛFV's robustness to sampling biases, we find that the different assumptions of the ΛFV and BMP models result in different applicabilities, with the ΛFV being more appropriate for scenarios of endemic spread, and BMP being more appropriate for recent outbreaks or colonizations.


Asunto(s)
Genética de Población/métodos , Modelos Genéticos , Filogeografía/métodos , Sesgo de Selección , Teorema de Bayes , Biología Computacional , Brotes de Enfermedades/estadística & datos numéricos , Flavivirus/genética , Infecciones por Flavivirus/epidemiología , Infecciones por Flavivirus/virología , Humanos , Cadenas de Markov
13.
BMC Bioinformatics ; 22(1): 285, 2021 May 28.
Artículo en Inglés | MEDLINE | ID: mdl-34049487

RESUMEN

BACKGROUND: Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented. RESULTS: We formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes. CONCLUSIONS: Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.


Asunto(s)
Algoritmos , Biología Computacional , Filogenia , Alineación de Secuencia
14.
J Proteome Res ; 20(8): 4212-4215, 2021 08 06.
Artículo en Inglés | MEDLINE | ID: mdl-34180678

RESUMEN

In the absence of effective treatment, COVID-19 is likely to remain a global disease burden. Compounding this threat is the near certainty that novel coronaviruses with pandemic potential will emerge in years to come. Pan-coronavirus drugs-agents active against both SARS-CoV-2 and other coronaviruses-would address both threats. A strategy to develop such broad-spectrum inhibitors is to pharmacologically target binding sites on SARS-CoV-2 proteins that are highly conserved in other known coronaviruses, the assumption being that any selective pressure to keep a site conserved across past viruses will apply to future ones. Here we systematically mapped druggable binding pockets on the experimental structure of 15 SARS-CoV-2 proteins and analyzed their variation across 27 α- and ß-coronaviruses and across thousands of SARS-CoV-2 samples from COVID-19 patients. We find that the two most conserved druggable sites are a pocket overlapping the RNA binding site of the helicase nsp13 and the catalytic site of the RNA-dependent RNA polymerase nsp12, both components of the viral replication-transcription complex. We present the data on a public web portal (https://www.thesgc.org/SARSCoV2_pocketome/), where users can interactively navigate individual protein structures and view the genetic variability of drug-binding pockets in 3D.


Asunto(s)
COVID-19 , SARS-CoV-2 , Antivirales/farmacología , Antivirales/uso terapéutico , Humanos , Pandemias , ARN Polimerasa Dependiente del ARN/genética
15.
Mol Biol Evol ; 36(9): 2086-2103, 2019 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-31114882

RESUMEN

Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino acid identity and structural information. Each character specifies an amino acid as well as information about the rotamer configuration of its side-chain: the discrete geometric pattern of permitted side-chain atomic positions, as defined by the dihedral angles between covalently linked atoms. By assigning rotamer states in 251,194 protein structures and identifying 4,508,390 substitutions between closely related sequences, we generate a 55-state "Dayhoff-like" model that shows that the evolutionary properties of amino acids depend strongly upon side-chain geometry. The model performs as well as or better than traditional 20-state models for divergence time estimation, tree inference, and ancestral state reconstruction. We conclude that not only is rotamer configuration a valuable source of information for phylogenetic studies, but that modeling the concomitant evolution of sequence and structure may have important implications for understanding protein folding and function.


Asunto(s)
Evolución Molecular , Modelos Biológicos , Conformación Proteica , Sustitución de Aminoácidos , Cadenas de Markov
16.
Genome Res ; 27(6): 1039-1049, 2017 06.
Artículo en Inglés | MEDLINE | ID: mdl-28385709

RESUMEN

Resequencing efforts are uncovering the extent of genetic variation in humans and provide data to study the evolutionary processes shaping our genome. One recurring puzzle in both intra- and inter-species studies is the high frequency of complex mutations comprising multiple nearby base substitutions or insertion-deletions. We devised a generalized mutation model of template switching during replication that extends existing models of genome rearrangement and used this to study the role of template switch events in the origin of short mutation clusters. Applied to the human genome, our model detects thousands of template switch events during the evolution of human and chimp from their common ancestor and hundreds of events between two independently sequenced human genomes. Although many of these are consistent with a template switch mechanism previously proposed for bacteria, our model also identifies new types of mutations that create short inversions, some flanked by paired inverted repeats. The local template switch process can create numerous complex mutation patterns, including hairpin loop structures, and explains multinucleotide mutations and compensatory substitutions without invoking positive selection, speculative mechanisms, or implausible coincidence. Clustered sequence differences are challenging for current mapping and variant calling methods, and we show that many erroneous variant annotations exist in human reference data. Local template switch events may have been neglected as an explanation for complex mutations because of biases in commonly used analyses. Incorporation of our model into reference-based analysis pipelines and comparisons of de novo assembled genomes will lead to improved understanding of genome variation and evolution.


Asunto(s)
Genoma Humano , Mutación INDEL , Secuencias Invertidas Repetidas , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Animales , Secuencia de Bases , Evolución Biológica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Pan troglodytes , Alineación de Secuencia
17.
Nature ; 513(7518): 422-425, 2014 Sep 18.
Artículo en Inglés | MEDLINE | ID: mdl-25043003

RESUMEN

The somatic mutations present in the genome of a cell accumulate over the lifetime of a multicellular organism. These mutations can provide insights into the developmental lineage tree, the number of divisions that each cell has undergone and the mutational processes that have been operative. Here we describe whole genomes of clonal lines derived from multiple tissues of healthy mice. Using somatic base substitutions, we reconstructed the early cell divisions of each animal, demonstrating the contributions of embryonic cells to adult tissues. Differences were observed between tissues in the numbers and types of mutations accumulated by each cell, which likely reflect differences in the number of cell divisions they have undergone and varying contributions of different mutational processes. If somatic mutation rates are similar to those in mice, the results indicate that precise insights into development and mutagenesis of normal human cells will be possible.


Asunto(s)
Linaje de la Célula/genética , Células Clonales/citología , Células Clonales/metabolismo , Genoma/genética , Mutagénesis/genética , Mutación/genética , Animales , Relojes Biológicos/genética , División Celular , Células Cultivadas , Embrión de Mamíferos/citología , Humanos , Masculino , Ratones , Ratones Endogámicos C57BL , Tasa de Mutación , Organoides/citología , Organoides/metabolismo , Filogenia , Análisis de Secuencia de ADN , Cola (estructura animal)/citología
18.
Mol Biol Evol ; 35(7): 1783-1797, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29618097

RESUMEN

Accurate reconstruction of ancestral states is a critical evolutionary analysis when studying ancient proteins and comparing biochemical properties between parental or extinct species and their extant relatives. It relies on multiple sequence alignment (MSA) which may introduce biases, and it remains unknown how MSA methodological approaches impact ancestral sequence reconstruction (ASR). Here, we investigate how MSA methodology modulates ASR using a simulation study of various evolutionary scenarios. We evaluate the accuracy of ancestral protein sequence reconstruction for simulated data and compare reconstruction outcomes using different alignment methods. Our results reveal biases introduced not only by aligner algorithms and assumptions, but also tree topology and the rate of insertions and deletions. Under many conditions we find no substantial differences between the MSAs. However, increasing the difficulty for the aligners can significantly impact ASR. The MAFFT consistency aligners and PRANK variants exhibit the best performance, whereas FSA displays limited performance. We also discover a bias towards reconstructed sequences longer than the true ancestors, deriving from a preference for inferring insertions, in almost all MSA methodological approaches. In addition, we find measures of MSA quality generally correlate highly with reconstruction accuracy. Thus, we show MSA methodological differences can affect the quality of reconstructions and propose MSA methods should be selected with care to accurately determine ancestral states with confidence.


Asunto(s)
Técnicas Genéticas , Alineación de Secuencia
19.
Nature ; 494(7435): 77-80, 2013 Feb 07.
Artículo en Inglés | MEDLINE | ID: mdl-23354052

RESUMEN

Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.


Asunto(s)
Archivos , ADN/química , ADN/síntesis química , Gestión de la Información/métodos , Secuencia de Bases , Computadores , ADN/economía , Gestión de la Información/economía , Datos de Secuencia Molecular , Análisis de Secuencia de ADN/economía , Biología Sintética/economía , Biología Sintética/métodos
20.
Syst Biol ; 66(5): 769-785, 2017 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-28595363

RESUMEN

The accumulation of genome-scale molecular data sets for nonmodel taxa brings us ever closer to resolving the tree of life of all living organisms. However, despite the depth of data available, a number of studies that each used thousands of genes have reported conflicting results. The focus of phylogenomic projects must thus shift to more careful experimental design. Even though we still have a limited understanding of what are the best predictors of the phylogenetic informativeness of a gene, there is wide agreement that one key factor is its evolutionary rate; but there is no consensus as to whether the rates derived as optimal in various analytical, empirical, and simulation approaches have any general applicability. We here use simulations to infer optimal rates in a set of realistic phylogenetic scenarios with varying tree sizes, numbers of terminals, and tree shapes. Furthermore, we study the relationship between the optimal rate and rate variation among sites and among lineages. Finally, we examine how well the predictions made by a range of experimental design methods correlate with the observed performance in our simulations.We find that the optimal level of divergence is surprisingly robust to differences in taxon sampling and even to among-site and among-lineage rate variation as often encountered in empirical data sets. This finding encourages the use of methods that rely on a single optimal rate to predict a gene's utility. Focusing on correct recovery either of the most basal node in the phylogeny or of the entire topology, the optimal rate is about 0.45 substitutions from root to tip in average Yule trees and about 0.2 in difficult trees with short basal and long-apical branches, but all rates leading to divergence levels between about 0.1 and 0.5 perform reasonably well.Testing the performance of six methods that can be used to predict a gene's utility against our simulation results, we find that the probability of resolution, signal-noise analysis, and Fisher information are good predictors of phylogenetic informativeness, but they require specification of at least part of a model tree. Likelihood quartet mapping also shows very good performance but only requires sequence alignments and is thus applicable without making assumptions about the phylogeny. Despite them being the most commonly used methods for experimental design, geometric quartet mapping and the integration of phylogenetic informativeness curves perform rather poorly in our comparison. Instead of derived predictors of phylogenetic informativeness, we suggest that the number of sites in a gene that evolve at near-optimal rates (as inferred here) could be used directly to prioritize genes for phylogenetic inference. In combination with measures of model fit, especially with respect to compositional biases and among-site and among-lineage rate variation, such an approach has the potential to greatly improve marker choice and should be tested on empirical data.


Asunto(s)
Clasificación/métodos , Modelos Biológicos , Filogenia , Evolución Biológica , Simulación por Computador , Genoma/genética , Probabilidad , Tiempo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA