Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
Microb Genom ; 10(2)2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38358325

RESUMEN

The COVID-19 pandemic has seen large-scale pathogen genomic sequencing efforts, becoming part of the toolbox for surveillance and epidemic research. This resulted in an unprecedented level of data sharing to open repositories, which has actively supported the identification of SARS-CoV-2 structure, molecular interactions, mutations and variants, and facilitated vaccine development and drug reuse studies and design. The European COVID-19 Data Platform was launched to support this data sharing, and has resulted in the deposition of several million SARS-CoV-2 raw reads. In this paper we describe (1) open data sharing, (2) tools for submission, analysis, visualisation and data claiming (e.g. ORCiD), (3) the systematic analysis of these datasets, at scale via the SARS-CoV-2 Data Hubs as well as (4) lessons learnt. This paper describes a component of the Platform, the SARS-CoV-2 Data Hubs, which enable the extension and set up of infrastructure that we intend to use more widely in the future for pathogen surveillance and pandemic preparedness.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Pandemias , COVID-19/epidemiología , Genómica , Difusión de la Información
2.
Nat Ecol Evol ; 8(3): 519-535, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38216617

RESUMEN

Polyploidy or whole-genome duplication (WGD) is a major event that drastically reshapes genome architecture and is often assumed to be causally associated with organismal innovations and radiations. The 2R hypothesis suggests that two WGD events (1R and 2R) occurred during early vertebrate evolution. However, the timing of the 2R event relative to the divergence of gnathostomes (jawed vertebrates) and cyclostomes (jawless hagfishes and lampreys) is unresolved and whether these WGD events underlie vertebrate phenotypic diversification remains elusive. Here we present the genome of the inshore hagfish, Eptatretus burgeri. Through comparative analysis with lamprey and gnathostome genomes, we reconstruct the early events in cyclostome genome evolution, leveraging insights into the ancestral vertebrate genome. Genome-wide synteny and phylogenetic analyses support a scenario in which 1R occurred in the vertebrate stem-lineage during the early Cambrian, and 2R occurred in the gnathostome stem-lineage, maximally in the late Cambrian-earliest Ordovician, after its divergence from cyclostomes. We find that the genome of stem-cyclostomes experienced an additional independent genome triplication. Functional genomic and morphospace analyses demonstrate that WGD events generally contribute to developmental evolution with similar changes in the regulatory genome of both vertebrate groups. However, appreciable morphological diversification occurred only in the gnathostome but not in the cyclostome lineage, calling into question the general expectation that WGDs lead to leaps of bodyplan complexity.


Asunto(s)
Anguila Babosa , Animales , Filogenia , Anguila Babosa/genética , Duplicación de Gen , Vertebrados/genética , Genoma , Lampreas/genética
3.
Nucleic Acids Res ; 52(D1): D92-D97, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37956313

RESUMEN

The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena) is maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI). The ENA is one of the three members of the International Nucleotide Sequence Database Collaboration (INSDC). It serves the bioinformatics community worldwide via the submission, processing, archiving and dissemination of sequence data. The ENA supports data types ranging from raw reads, through alignments and assemblies to functional annotation. The data is enriched with contextual information relating to samples and experimental configurations. In this article, we describe recent progress and improvements to ENA services. In particular, we focus upon three areas of work in 2023: FAIRness of ENA data, pandemic preparedness and foundational technology. For FAIRness, we have introduced minimal requirements for spatiotemporal annotation, created a metadata-based classification system, incorporated third party metadata curations with archived records, and developed a new rapid visualisation platform, the ENA Notebooks. For foundational enhancements, we have improved the INSDC data exchange and synchronisation pipelines, and invested in site reliability engineering for ENA infrastructure. In order to support genomic surveillance efforts, we have continued to provide ENA services in support of SARS-CoV-2 data mobilisation and have adapted these for broader pathogen surveillance efforts.


Asunto(s)
Genómica , Nucleótidos , Biología Computacional , Bases de Datos de Ácidos Nucleicos , Internet , Reproducibilidad de los Resultados , Europa (Continente)
4.
Microb Genom ; 9(12)2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38085797

RESUMEN

Fast, efficient public health actions require well-organized and coordinated systems that can supply timely and accurate knowledge. Public databases of pathogen genomic data, such as the International Nucleotide Sequence Database Collaboration (INSDC), have become essential tools for efficient public health decisions. However, these international resources began primarily for academic purposes, rather than for surveillance or interventions. Now, queries need to access not only the whole genomes of multiple pathogens but also make connections using robust contextual metadata to identify issues of public health relevance. Databases that over time developed a patchwork of submission formats and requirements need to be consistently organized and coordinated internationally to allow effective searches.To help resolve these issues, we propose a common pathogen data structure called the Pathogen Data Object Model (DOM) that will formalize the minimum pieces of sequence data and contextual data necessary for general public health uses, while recognizing that submitters will likely withhold a wide range of non-public contextual data. Further, we propose contributors use the Pathogen DOM for all pathogen submissions (bacterial, viral, fungal, and parasites), which will simplify data submissions and provide a consistent and transparent data structure for downstream data analyses. We also highlight how improved submission tools can support the Pathogen DOM, offering users additional easy-to-use methods to ensure this structure is followed.


Asunto(s)
Nucleótidos , Salud Pública , Secuencia de Bases , Genómica/métodos , Bases de Datos de Ácidos Nucleicos
5.
Nucleic Acids Res ; 51(D1): D121-D125, 2023 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-36399492

RESUMEN

The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena), maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), offers those producing data an open and supported platform for the management, archiving, publication, and dissemination of data; and to the scientific community as a whole, it offers a globally comprehensive data set through a host of data discovery and retrieval tools. Here, we describe recent updates to the ENA's submission and retrieval services as well as focused efforts to improve connectivity, reusability, and interoperability of ENA data and metadata.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Academias e Institutos , Biología Computacional , Internet , Programas Informáticos , Conjuntos de Datos como Asunto
6.
Nucleic Acids Res ; 50(D1): D765-D770, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34634797

RESUMEN

The COVID-19 pandemic has seen unprecedented use of SARS-CoV-2 genome sequencing for epidemiological tracking and identification of emerging variants. Understanding the potential impact of these variants on the infectivity of the virus and the efficacy of emerging therapeutics and vaccines has become a cornerstone of the fight against the disease. To support the maximal use of genomic information for SARS-CoV-2 research, we launched the Ensembl COVID-19 browser; the first virus to be encompassed within the Ensembl platform. This resource incorporates a new Ensembl gene set, multiple variant sets, and annotation from several relevant resources aligned to the reference SARS-CoV-2 assembly. Since the first release in May 2020, the content has been regularly updated using our new rapid release workflow, and tools such as the Ensembl Variant Effect Predictor have been integrated. The Ensembl COVID-19 browser is freely available at https://covid-19.ensembl.org.


Asunto(s)
COVID-19/virología , Bases de Datos Genéticas , SARS-CoV-2/genética , Navegador Web , Coronaviridae/genética , Variación Genética , Genoma Viral , Humanos , Anotación de Secuencia Molecular
7.
Nucleic Acids Res ; 50(D1): D106-D110, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34850158

RESUMEN

The European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena), maintained at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) provides freely accessible services, both for deposition of, and access to, open nucleotide sequencing data. Open scientific data are of paramount importance to the scientific community and contribute daily to the acceleration of scientific advance. Here, we outline the major updates to ENA's services and infrastructure that have been delivered over the past year.


Asunto(s)
Biología Computacional , Bases de Datos de Ácidos Nucleicos , Nucleótidos/genética , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Internet , Anotación de Secuencia Molecular , Nucleótidos/clasificación
8.
Nucleic Acids Res ; 50(D1): D988-D995, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34791404

RESUMEN

Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed to efficiently deliver annotation at scale for all eukaryotic life, and it also provides deep comprehensive annotation for key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the annotation of new assemblies. Here, we report the release of the greatest annual number of newly annotated genomes in the history of Ensembl via our dedicated Ensembl Rapid Release platform (http://rapid.ensembl.org). We have also developed a new method to generate comparative analyses at scale for these assemblies and, for the first time, we have annotated non-vertebrate eukaryotes. Meanwhile, we continually improve, extend and update the annotation for our high-value reference vertebrate genomes and report the details here. We have a range of specific software tools for specific tasks, such as the Ensembl Variant Effect Predictor (VEP) and the newly developed interface for the Variant Recoder. All Ensembl data, software and tools are freely available for download and are accessible programmatically.


Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Anotación de Secuencia Molecular , Programas Informáticos , Animales , Biología Computacional/clasificación , Humanos
9.
Nucleic Acids Res ; 49(W1): W619-W623, 2021 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-34048576

RESUMEN

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic will be remembered as one of the defining events of the 21st century. The rapid global outbreak has had significant impacts on human society and is already responsible for millions of deaths. Understanding and tackling the impact of the virus has required a worldwide mobilisation and coordination of scientific research. The COVID-19 Data Portal (https://www.covid19dataportal.org/) was first released as part of the European COVID-19 Data Platform, on April 20th 2020 to facilitate rapid and open data sharing and analysis, to accelerate global SARS-CoV-2 and COVID-19 research. The COVID-19 Data Portal has fortnightly feature releases to continue to add new data types, search options, visualisations and improvements based on user feedback and research. The open datasets and intuitive suite of search, identification and download services, represent a truly FAIR (Findable, Accessible, Interoperable and Reusable) resource that enables researchers to easily identify and quickly obtain the key datasets needed for their COVID-19 research.


Asunto(s)
Investigación Biomédica , COVID-19 , Bases de Datos Factuales , Conjuntos de Datos como Asunto , Difusión de la Información , Publicación de Acceso Abierto , SARS-CoV-2 , COVID-19/epidemiología , COVID-19/genética , COVID-19/virología , Bases de Datos Bibliográficas , Brotes de Enfermedades , Humanos , Pandemias , SARS-CoV-2/química , SARS-CoV-2/genética , SARS-CoV-2/metabolismo , SARS-CoV-2/ultraestructura , Factores de Tiempo , Proteínas Virales/química , Proteínas Virales/genética
10.
Nucleic Acids Res ; 49(D1): D884-D891, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33137190

RESUMEN

The Ensembl project (https://www.ensembl.org) annotates genomes and disseminates genomic data for vertebrate species. We create detailed and comprehensive annotation of gene structures, regulatory elements and variants, and enable comparative genomics by inferring the evolutionary history of genes and genomes. Our integrated genomic data are made available in a variety of ways, including genome browsers, search interfaces, specialist tools such as the Ensembl Variant Effect Predictor, download files and programmatic interfaces. Here, we present recent Ensembl developments including two new website portals. Ensembl Rapid Release (http://rapid.ensembl.org) is designed to provide core tools and services for genomes as soon as possible and has been deployed to support large biodiversity sequencing projects. Our SARS-CoV-2 genome browser (https://covid-19.ensembl.org) integrates our own annotation with publicly available genomic data from numerous sources to facilitate the use of genomics in the international scientific response to the COVID-19 pandemic. We also report on other updates to our annotation resources, tools and services. All Ensembl data and software are freely available without restriction.


Asunto(s)
Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos , Genómica/métodos , SARS-CoV-2/genética , Vertebrados/genética , Animales , COVID-19/epidemiología , COVID-19/virología , Humanos , Internet , Anotación de Secuencia Molecular/métodos , Pandemias , Vertebrados/clasificación
11.
Nucleic Acids Res ; 48(D1): D682-D688, 2020 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-31691826

RESUMEN

The Ensembl (https://www.ensembl.org) is a system for generating and distributing genome annotation such as genes, variation, regulation and comparative genomics across the vertebrate subphylum and key model organisms. The Ensembl annotation pipeline is capable of integrating experimental and reference data from multiple providers into a single integrated resource. Here, we present 94 newly annotated and re-annotated genomes, bringing the total number of genomes offered by Ensembl to 227. This represents the single largest expansion of the resource since its inception. We also detail our continued efforts to improve human annotation, developments in our epigenome analysis and display, a new tool for imputing causal genes from genome-wide association studies and visualisation of variation within a 3D protein model. Finally, we present information on our new website. Both software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license) and data updates made available four times a year.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Epigenoma , Anotación de Secuencia Molecular , Algoritmos , Animales , Gráficos por Computador , Bases de Datos de Proteínas , Variación Genética , Estudio de Asociación del Genoma Completo , Genómica , Histonas/metabolismo , Humanos , Imagenología Tridimensional , Internet , Ligandos , Motor de Búsqueda , Programas Informáticos , Especificidad de la Especie , Transcriptoma , Interfaz Usuario-Computador , Navegador Web
12.
Nucleic Acids Res ; 48(D1): D689-D695, 2020 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-31598706

RESUMEN

Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Variación Genética , Genoma Bacteriano , Genoma Fúngico , Genoma de Planta , Algoritmos , Animales , Caenorhabditis elegans/genética , Genómica , Internet , Anotación de Secuencia Molecular , Fenotipo , Plantas/genética , Valores de Referencia , Programas Informáticos , Interfaz Usuario-Computador
13.
Nucleic Acids Res ; 47(D1): D745-D751, 2019 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-30407521

RESUMEN

The Ensembl project (https://www.ensembl.org) makes key genomic data sets available to the entire scientific community without restrictions. Ensembl seeks to be a fundamental resource driving scientific progress by creating, maintaining and updating reference genome annotation and comparative genomics resources. This year we describe our new and expanded gene, variant and comparative annotation capabilities, which led to a 50% increase in the number of vertebrate genomes we support. We have also doubled the number of available human variants and added regulatory regions for many mouse cell types and developmental stages. Our data sets and tools are available via the Ensembl website as well as a through a RESTful webservice, Perl application programming interface and as data files for download.


Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Genómica , Vertebrados/genética , Animales , Biología Computacional/tendencias , Humanos , Ratones , Anotación de Secuencia Molecular , Programas Informáticos
14.
Genome Res ; 28(4): 448-459, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29563166

RESUMEN

Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.


Asunto(s)
Evolución Molecular , Genoma/genética , Muridae/genética , Filogenia , Animales , Sitios de Unión , Factor de Unión a CCCTC/genética , Cromosomas/genética , Cariotipificación/métodos , Elementos de Nucleótido Esparcido Largo/genética , Ratones , Retroelementos/genética , Especificidad de la Especie
15.
Nucleic Acids Res ; 46(D1): D754-D761, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29155950

RESUMEN

The Ensembl project has been aggregating, processing, integrating and redistributing genomic datasets since the initial releases of the draft human genome, with the aim of accelerating genomics research through rapid open distribution of public data. Large amounts of raw data are thus transformed into knowledge, which is made available via a multitude of channels, in particular our browser (http://www.ensembl.org). Over time, we have expanded in multiple directions. First, our resources describe multiple fields of genomics, in particular gene annotation, comparative genomics, genetics and epigenomics. Second, we cover a growing number of genome assemblies; Ensembl Release 90 contains exactly 100. Third, our databases feed simultaneously into an array of services designed around different use cases, ranging from quick browsing to genome-wide bioinformatic analysis. We present here the latest developments of the Ensembl project, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving our browser.


Asunto(s)
Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Genoma , Difusión de la Información , Animales , Epigenómica , Genoma Humano , Estudio de Asociación del Genoma Completo , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Anotación de Secuencia Molecular , Vertebrados/genética , Navegador Web
16.
Nucleic Acids Res ; 45(D1): D635-D642, 2017 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-27899575

RESUMEN

Ensembl (www.ensembl.org) is a database and genome browser for enabling research on vertebrate genomes. We import, analyse, curate and integrate a diverse collection of large-scale reference data to create a more comprehensive view of genome biology than would be possible from any individual dataset. Our extensive data resources include evidence-based gene and regulatory region annotation, genome variation and gene trees. An accompanying suite of tools, infrastructure and programmatic access methods ensure uniform data analysis and distribution for all supported species. Together, these provide a comprehensive solution for large-scale and targeted genomics applications alike. Among many other developments over the past year, we have improved our resources for gene regulation and comparative genomics, and added CRISPR/Cas9 target sites. We released new browser functionality and tools, including improved filtering and prioritization of genome variation, Manhattan plot visualization for linkage disequilibrium and eQTL data, and an ontology search for phenotypes, traits and disease. We have also enhanced data discovery and access with a track hub registry and a selection of new REST end points. All Ensembl data are freely released to the scientific community and our source code is available via the open source Apache 2.0 license.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Genómica/métodos , Motor de Búsqueda , Programas Informáticos , Navegador Web , Animales , Minería de Datos , Evolución Molecular , Regulación de la Expresión Génica , Variación Genética , Genoma Humano , Humanos , Anotación de Secuencia Molecular , Especificidad de la Especie , Vertebrados
17.
Bioinformatics ; 32(7): 1109-11, 2016 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-26794317

RESUMEN

UNLABELLED: Transposon insertion sequencing is a high-throughput technique for assaying large libraries of otherwise isogenic transposon mutants providing insight into gene essentiality, gene function and genetic interactions. We previously developed the Transposon Directed Insertion Sequencing (TraDIS) protocol for this purpose, which utilizes shearing of genomic DNA followed by specific PCR amplification of transposon-containing fragments and Illumina sequencing. Here we describe an optimized high-yield library preparation and sequencing protocol for TraDIS experiments and a novel software pipeline for analysis of the resulting data. The Bio-Tradis analysis pipeline is implemented as an extensible Perl library which can either be used as is, or as a basis for the development of more advanced analysis tools. This article can serve as a general reference for the application of the TraDIS methodology. AVAILABILITY AND IMPLEMENTATION: The optimized sequencing protocol is included as supplementary information. The Bio-Tradis analysis pipeline is available under a GPL license at https://github.com/sanger-pathogens/Bio-Tradis CONTACT: parkhill@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Elementos Transponibles de ADN , Biblioteca de Genes , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento
18.
Nucleic Acids Res ; 44(D1): D710-6, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26687719

RESUMEN

The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites. This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data. REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.


Asunto(s)
Bases de Datos Genéticas , Genómica , Anotación de Secuencia Molecular , Animales , Genes , Variación Genética , Humanos , Internet , Ratones , Proteínas/genética , Ratas , Secuencias Reguladoras de Ácidos Nucleicos , Programas Informáticos
19.
Bioinformatics ; 31(22): 3691-3, 2015 Nov 15.
Artículo en Inglés | MEDLINE | ID: mdl-26198102

RESUMEN

UNLABELLED: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. AVAILABILITY AND IMPLEMENTATION: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary CONTACT: roary@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma Bacteriano , Células Procariotas/metabolismo , Programas Informáticos , Simulación por Computador , Bases de Datos Genéticas , Salmonella typhi/genética
20.
Syst Biol ; 60(6): 833-44, 2011 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-21804093

RESUMEN

Current phylogenetic methods attempt to account for evolutionary rate variation across characters in a matrix. This is generally achieved by the use of sophisticated evolutionary models, combined with dense sampling of large numbers of characters. However, systematic biases and superimposed substitutions make this task very difficult. Model adequacy can sometimes be achieved at the cost of adding large numbers of free parameters, with each parameter being optimized according to some criterion, resulting in increased computation times and large variances in the model estimates. In this study, we develop a simple approach that estimates the relative evolutionary rate of each homologous character. The method that we describe uses the similarity between characters as a proxy for evolutionary rate. In this article, we work on the premise that if the character-state distribution of a homologous character is similar to many other characters, then this character is likely to be relatively slowly evolving. If the character-state distribution of a homologous character is not similar to many or any of the rest of the characters in a data set, then it is likely to be the result of rapid evolution. We show that in some test cases, at least, the premise can hold and the inferences are robust. Importantly, the method does not use a "starting tree" to make the inference and therefore is tree independent. We demonstrate that this approach can work as well as a maximum likelihood (ML) approach, though the ML method needs to have a known phylogeny, or at least a very good estimate of that phylogeny. We then demonstrate some uses for this method of analysis, including the improvement in phylogeny reconstruction for both deep-level and recent relationships and overcoming systematic biases such as base composition bias. Furthermore, we compare this approach to two well-established methods for reweighting or removing characters. These other methods are tree-based and we show that they can be systematically biased. We feel this method can be useful for phylogeny reconstruction, understanding evolutionary rate variation, and for understanding selection variation on different characters.


Asunto(s)
Evolución Biológica , Clasificación/métodos , Filogenia , Animales , Sesgo , Simulación por Computador , Humanos , Primates/clasificación , Primates/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...