RESUMO
The Ensembl project has been aggregating, processing, integrating and redistributing genomic datasets since the initial releases of the draft human genome, with the aim of accelerating genomics research through rapid open distribution of public data. Large amounts of raw data are thus transformed into knowledge, which is made available via a multitude of channels, in particular our browser (http://www.ensembl.org). Over time, we have expanded in multiple directions. First, our resources describe multiple fields of genomics, in particular gene annotation, comparative genomics, genetics and epigenomics. Second, we cover a growing number of genome assemblies; Ensembl Release 90 contains exactly 100. Third, our databases feed simultaneously into an array of services designed around different use cases, ranging from quick browsing to genome-wide bioinformatic analysis. We present here the latest developments of the Ensembl project, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving our browser.
Assuntos
Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Genoma , Disseminação de Informação , Animais , Epigenômica , Genoma Humano , Estudo de Associação Genômica Ampla , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Anotação de Sequência Molecular , Vertebrados/genética , NavegadorRESUMO
Ensembl (www.ensembl.org) is a database and genome browser for enabling research on vertebrate genomes. We import, analyse, curate and integrate a diverse collection of large-scale reference data to create a more comprehensive view of genome biology than would be possible from any individual dataset. Our extensive data resources include evidence-based gene and regulatory region annotation, genome variation and gene trees. An accompanying suite of tools, infrastructure and programmatic access methods ensure uniform data analysis and distribution for all supported species. Together, these provide a comprehensive solution for large-scale and targeted genomics applications alike. Among many other developments over the past year, we have improved our resources for gene regulation and comparative genomics, and added CRISPR/Cas9 target sites. We released new browser functionality and tools, including improved filtering and prioritization of genome variation, Manhattan plot visualization for linkage disequilibrium and eQTL data, and an ontology search for phenotypes, traits and disease. We have also enhanced data discovery and access with a track hub registry and a selection of new REST end points. All Ensembl data are freely released to the scientific community and our source code is available via the open source Apache 2.0 license.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Genômica/métodos , Ferramenta de Busca , Software , Navegador , Animais , Mineração de Dados , Evolução Molecular , Regulação da Expressão Gênica , Variação Genética , Genoma Humano , Humanos , Anotação de Sequência Molecular , Especificidade da Espécie , VertebradosRESUMO
Gramene (http://www.gramene.org) is an online resource for comparative functional genomics in crops and model plant species. Its two main frameworks are genomes (collaboration with Ensembl Plants) and pathways (The Plant Reactome and archival BioCyc databases). Since our last NAR update, the database website adopted a new Drupal management platform. The genomes section features 39 fully assembled reference genomes that are integrated using ontology-based annotation and comparative analyses, and accessed through both visual and programmatic interfaces. Additional community data, such as genetic variation, expression and methylation, are also mapped for a subset of genomes. The Plant Reactome pathway portal (http://plantreactome.gramene.org) provides a reference resource for analyzing plant metabolic and regulatory pathways. In addition to â¼ 200 curated rice reference pathways, the portal hosts gene homology-based pathway projections for 33 plant species. Both the genome and pathway browsers interface with the EMBL-EBI's Expression Atlas to enable the projection of baseline and differential expression data from curated expression studies in plants. Gramene's archive website (http://archive.gramene.org) continues to provide previously reported resources on comparative maps, markers and QTL. To further aid our users, we have also introduced a live monthly educational webinar series and a Gramene YouTube channel carrying video tutorials.
Assuntos
Bases de Dados Genéticas , Genoma de Planta , Plantas/metabolismo , Expressão Gênica , Variação Genética , Genômica , Internet , Redes e Vias Metabólicas , Anotação de Sequência Molecular , Plantas/genéticaRESUMO
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including reference sequence, gene models, transcriptional data, genetic variation and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments. These include the development of new analyses and views to represent polyploid genomes (of which bread wheat is the primary exemplar); and the continued up-scaling of the resource, which now includes over 23 000 bacterial genomes, 400 fungal genomes and 100 protist genomes, in addition to 55 genomes from invertebrate metazoa and 39 genomes from plants. This dramatic increase in the number of included genomes is one part of a broader effort to automate the integration of archival data (genome sequence, but also associated RNA sequence data and variant calls) within the context of reference genomes and make it available through the Ensembl user interfaces.
Assuntos
Bases de Dados Genéticas , Genoma Bacteriano , Genoma Fúngico , Genoma de Planta , Invertebrados/genética , Animais , Diploide , Eucariotos/genética , Variação Genética , Genoma , Poliploidia , Alinhamento de SequênciaRESUMO
Gramene (http://www.gramene.org) is a curated online resource for comparative functional genomics in crops and model plant species, currently hosting 27 fully and 10 partially sequenced reference genomes in its build number 38. Its strength derives from the application of a phylogenetic framework for genome comparison and the use of ontologies to integrate structural and functional annotation data. Whole-genome alignments complemented by phylogenetic gene family trees help infer syntenic and orthologous relationships. Genetic variation data, sequences and genome mappings available for 10 species, including Arabidopsis, rice and maize, help infer putative variant effects on genes and transcripts. The pathways section also hosts 10 species-specific metabolic pathways databases developed in-house or by our collaborators using Pathway Tools software, which facilitates searches for pathway, reaction and metabolite annotations, and allows analyses of user-defined expression datasets. Recently, we released a Plant Reactome portal featuring 133 curated rice pathways. This portal will be expanded for Arabidopsis, maize and other plant species. We continue to provide genetic and QTL maps and marker datasets developed by crop researchers. The project provides a unique community platform to support scientific research in plant genomics including studies in evolution, genetics, plant breeding, molecular biology, biochemistry and systems biology.
Assuntos
Bases de Dados Genéticas , Genoma de Planta , Genômica , Produtos Agrícolas/genética , Variação Genética , Internet , Redes e Vias Metabólicas/genética , Anotação de Sequência Molecular , Plantas/genética , Plantas/metabolismoRESUMO
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. This article provides an update to the previous publications about the resource, with a focus on recent developments. These include the addition of important new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.
Assuntos
Bases de Dados Genéticas , Genoma , Animais , Grão Comestível/genética , Genoma Bacteriano , Genoma Fúngico , Genoma de Planta , Genômica , Internet , Anotação de Sequência Molecular , SoftwareRESUMO
Recent developments in DNA sequencing have enabled the large and complex genomes of many crop species to be determined for the first time, even those previously intractable due to their polyploid nature. Indeed, over the course of the last 2 years, the genome sequences of several commercially important cereals, notably barley and bread wheat, have become available, as well as those of related wild species. While still incomplete, comparison with other, more completely assembled species suggests that coverage of genic regions is likely to be high. Ensembl Plants (http://plants.ensembl.org) is an integrative resource organizing, analyzing and visualizing genome-scale information for important crop and model plants. Available data include reference genome sequence, variant loci, gene models and functional annotation. For variant loci, individual and population genotypes, linkage information and, where available, phenotypic information are shown. Comparative analyses are performed on DNA and protein sequence alignments. The resulting genome alignments and gene trees, representing the implied evolutionary history of the gene family, are made available for visualization and analysis. Driven by the case of bread wheat, specific extensions to the analysis pipelines and web interface have recently been developed to support polyploid genomes. Data in Ensembl Plants is accessible through a genome browser incorporating various specialist interfaces for different data types, and through a variety of additional methods for programmatic access and data mining. These interfaces are consistent with those offered through the Ensembl interface for the genomes of non-plant species, including those of plant pathogens, pests and pollinators, facilitating the study of the plant in its environment.
Assuntos
Genoma de Planta/genética , Genômica , Hordeum/genética , Transcriptoma , Triticum/genética , Grão Comestível/genética , Variação Genética , Genótipo , Armazenamento e Recuperação da Informação , Internet , Interface Usuário-ComputadorRESUMO
BACKGROUND: Vibrio vulnificus is the leading cause of reported death from consumption of seafood in the United States. Despite several decades of research on molecular pathogenesis, much remains to be learned about the mechanisms of virulence of this opportunistic bacterial pathogen. The two complete and annotated genomic DNA sequences of V. vulnificus belong to strains of clade 2, which is the predominant clade among clinical strains. Clade 2 strains generally possess higher virulence potential in animal models of disease compared with clade 1, which predominates among environmental strains. SOLiD sequencing of four V. vulnificus strains representing different clades (1 and 2) and biotypes (1 and 2) was used for comparative genomic analysis. RESULTS: Greater than 4,100,000 bases were sequenced of each strain, yielding approximately 100-fold coverage for each of the four genomes. Although the read lengths of SOLiD genomic sequencing were only 35 nt, we were able to make significant conclusions about the unique and shared sequences among the genomes, including identification of single nucleotide polymorphisms. Comparative analysis of the newly sequenced genomes to the existing reference genomes enabled the identification of 3,459 core V. vulnificus genes shared among all six strains and 80 clade 2-specific genes. We identified 523,161 SNPs among the six genomes. CONCLUSIONS: We were able to glean much information about the genomic content of each strain using next generation sequencing. Flp pili, GGDEF proteins, and genomic island XII were identified as possible virulence factors because of their presence in virulent sequenced strains. Genomic comparisons also point toward the involvement of sialic acid catabolism in pathogenesis.
Assuntos
Genes Bacterianos/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Vibrio vulnificus/genética , Vibrio vulnificus/patogenicidade , Animais , Sequência de Bases , Genótipo , Camundongos , Fases de Leitura Aberta/genética , Fenótipo , Filogenia , Plasmídeos/genética , Polimorfismo de Nucleotídeo Único/genética , Padrões de Referência , Vibrio vulnificus/classificação , Virulência/genéticaRESUMO
The genome sequences of many important Triticeae species, including bread wheat ( L.) and barley ( L.), remained uncharacterized for a long time because their high repeat content, large sizes, and polyploidy. As a result of improvements in sequencing technologies and novel analyses strategies, several of these have recently been deciphered. These efforts have generated new insights into Triticeae biology and genome organization and have important implications for downstream usage by breeders, experimental biologists, and comparative genomicists. transPLANT () is an EU-funded project aimed at constructing hardware, software, and data infrastructure for genome-scale research in the life sciences. Since the Triticeae data are intrinsically complex, heterogenous, and distributed, the transPLANT consortium has undertaken efforts to develop common data formats and tools that enable the exchange and integration of data from distributed resources. Here we present an overview of the individual Triticeae genome resources hosted by transPLANT partners, introduce the objectives of transPLANT, and outline common developments and interfaces supporting integrated data access.
Assuntos
Genoma de Planta , Genômica/métodos , Poaceae/genética , Evolução Molecular , Hordeum/genética , Poliploidia , Triticum/genéticaRESUMO
Genome sequencing with next-generation sequence (NGS) technologies can now be applied to organisms pivotal to addressing fundamental biological questions, but with genomes previously considered intractable or too expensive to undertake. However, for species with large and complex genomes, extensive genetic and physical map resources have, until now, been required to direct the sequencing effort and sequence assembly. As these resources are unavailable for most species, assembling high-quality genome sequences from NGS data remains challenging. We describe a strategy that uses NGS, fluorescence in situ hybridization, and whole-genome mapping to assemble a high-quality genome sequence for Amborella trichopoda, a nonmodel species crucial to understanding flowering plant evolution. These methods are applicable to many other organisms with limited genomic resources.
Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma de Planta , Análise de Sequência de DNA/métodos , Traqueófitas/genética , Hibridização in Situ FluorescenteRESUMO
Multilevel analysis of transcription is facilitated by a new array design that includes modules for assessment of differential expression, isoform usage, and allelic imbalance in Drosophila. The â¼2.5 million feature chip incorporates a large number of controls, and it contains 18,769 3' expression probe sets and 61,919 exon probe sets with probe sequences from Drosophila melanogaster and 60,118 SNP probe sets focused on Drosophila simulans. An experiment in D. simulans identified genes differentially expressed between males and females (34% in the 3' expression module; 32% in the exon module). These proportions are consistent with previous reports, and there was good agreement (κ = 0.63) between the modules. Alternative isoform usage between the sexes was identified for 164 genes. The SNP module was verified with resequencing data. Concordance between resequencing and the chip design was greater than 99%. The design also proved apt in separating alleles based upon hybridization intensity. Concordance between the highest hybridization signals and the expected alleles in the genotype was greater than 96%. Intriguingly, allelic imbalance was detected for 37% of 6579 probe sets examined that contained heterozygous SNP loci. The large number of probes and multiple probe sets per gene in the 3' expression and exon modules allows the array to be used in D. melanogaster and in closely related species. The SNP module can be used for allele specific expression and genotyping of D. simulans.
RESUMO
High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including pre-processing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the chi(2) step, are joined into one program called the 'backbone'.
Assuntos
Bactérias/classificação , Biologia Computacional/métodos , Fezes/microbiologia , RNA Ribossômico 16S/genética , Análise de Sequência de DNA/métodos , Software , Microbiologia do Solo , Animais , Bactérias/genética , Bactérias/isolamento & purificação , DNA Bacteriano/genética , Diabetes Mellitus/microbiologia , Ecossistema , Dados de Sequência Molecular , Ratos , Fluxo de TrabalhoRESUMO
OBJECTIVE: To explore a quantitative trait locus (QTL) on human chromosome 1q affecting BMI, adiposity, and fat-free mass phenotypes in the Quebec Family Study cohort. RESEARCH METHODS AND PROCEDURES: Non-parametric sibpair and variance component linkage analyses and family-based association studies were performed with a dense set of chromosome 1q43 microsatellites and single-nucleotide polymorphism markers in 885 adult individuals. RESULTS: Linkage was observed between marker D1S184 and BMI (p = 0.0004) and with body fat mass or percentage body fat (p < or = 0.0003), but no linkage was detected with fat-free mass. Furthermore, significant linkages (p < 0.0001) were achieved with subsamples of sibpairs at both ends of phenotype distributions. Association studies with quantitative transmission disequilibrium tests refined the linkage to a region overlapping the regulator of G-protein signaling 7 (RGS7) gene and extending to immediate upstream gene loci. DISCUSSION: The present study indicates that the QTL on chromosome 1q43 specifically affects total adiposity and provides a genetic mapping framework for the dissection of this adiposity locus.
Assuntos
Tecido Adiposo/metabolismo , Composição Corporal/genética , Cromossomos Humanos Par 1/genética , Ligação Genética , Obesidade/genética , Locos de Características Quantitativas , Adolescente , Adulto , Idoso , Índice de Massa Corporal , Estudos de Coortes , Feminino , Frequência do Gene , Genótipo , Inquéritos Epidemiológicos , Humanos , Masculino , Repetições de Microssatélites , Pessoa de Meia-Idade , Polimorfismo de Nucleotídeo Único , QuebequeRESUMO
This paper presents the 12th update of the human obesity gene map, which incorporates published results up to the end of October 2005. Evidence from single-gene mutation obesity cases, Mendelian disorders exhibiting obesity as a clinical feature, transgenic and knockout murine models relevant to obesity, quantitative trait loci (QTL) from animal cross-breeding experiments, association studies with candidate genes, and linkages from genome scans is reviewed. As of October 2005, 176 human obesity cases due to single-gene mutations in 11 different genes have been reported, 50 loci related to Mendelian syndromes relevant to human obesity have been mapped to a genomic region, and causal genes or strong candidates have been identified for most of these syndromes. There are 244 genes that, when mutated or expressed as transgenes in the mouse, result in phenotypes that affect body weight and adiposity. The number of QTLs reported from animal models currently reaches 408. The number of human obesity QTLs derived from genome scans continues to grow, and we now have 253 QTLs for obesity-related phenotypes from 61 genome-wide scans. A total of 52 genomic regions harbor QTLs supported by two or more studies. The number of studies reporting associations between DNA sequence variation in specific genes and obesity phenotypes has also increased considerably, with 426 findings of positive associations with 127 candidate genes. A promising observation is that 22 genes are each supported by at least five positive studies. The obesity gene map shows putative loci on all chromosomes except Y. The electronic version of the map with links to useful publications and relevant sites can be found at http://obesitygene.pbrc.edu.
Assuntos
Mapeamento Cromossômico , Obesidade/genética , Animais , Mapeamento Cromossômico/métodos , Cromossomos Humanos , Ligação Genética , Humanos , Camundongos , Camundongos Knockout , Camundongos Transgênicos , Locos de Características QuantitativasRESUMO
This paper presents the eleventh update of the human obesity gene map, which incorporates published results up to the end of October 2004. Evidence from single-gene mutation obesity cases, Mendelian disorders exhibiting obesity as a clinical feature, transgenic and knockout murine models relevant to obesity, quantitative trait loci (QTLs) from animal cross-breeding experiments, association studies with candidate genes, and linkages from genome scans is reviewed. As of October 2004, 173 human obesity cases due to single-gene mutations in 10 different genes have been reported, and 49 loci related to Mendelian syndromes relevant to human obesity have been mapped to a genomic region, and causal genes or strong candidates have been identified for most of these syndromes. There are 166 genes which, when mutated or expressed as transgenes in the mouse, result in phenotypes that affect body weight and adiposity. The number of QTLs reported from animal models currently reaches 221. The number of human obesity QTLs derived from genome scans continues to grow, and we have now 204 QTLs for obesity-related phenotypes from 50 genome-wide scans. A total of 38 genomic regions harbor QTLs replicated among two to four studies. The number of studies reporting associations between DNA sequence variation in specific genes and obesity phenotypes has also increased considerably with 358 findings of positive associations with 113 candidate genes. Among them, 18 genes are supported by at least five positive studies. The obesity gene map shows putative loci on all chromosomes except Y. Overall, >600 genes, markers, and chromosomal regions have been associated or linked with human obesity phenotypes. The electronic version of the map with links to useful publications and genomic and other relevant sites can be found at http://obesitygene.pbrc.edu.
Assuntos
Obesidade/genética , Tecido Adiposo/química , Animais , Índice de Massa Corporal , Mapeamento Cromossômico , Cruzamentos Genéticos , Doenças Genéticas Inatas/genética , Ligação Genética/genética , Variação Genética , Humanos , Camundongos , Camundongos Knockout , Camundongos Transgênicos , Mutação , Papio , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética , RNA Mensageiro/análise , SíndromeRESUMO
This is the tenth update of the human obesity gene map, incorporating published results up to the end of October 2003 and continuing the previous format. Evidence from single-gene mutation obesity cases, Mendelian disorders exhibiting obesity as a clinical feature, quantitative trait loci (QTLs) from human genome-wide scans and animal crossbreeding experiments, and association and linkage studies with candidate genes and other markers is reviewed. Transgenic and knockout murine models relevant to obesity are also incorporated (N = 55). As of October 2003, 41 Mendelian syndromes relevant to human obesity have been mapped to a genomic region, and causal genes or strong candidates have been identified for most of these syndromes. QTLs reported from animal models currently number 183. There are 208 human QTLs for obesity phenotypes from genome-wide scans and candidate regions in targeted studies. A total of 35 genomic regions harbor QTLs replicated among two to five studies. Attempts to relate DNA sequence variation in specific genes to obesity phenotypes continue to grow, with 272 studies reporting positive associations with 90 candidate genes. Fifteen such candidate genes are supported by at least five positive studies. The obesity gene map shows putative loci on all chromosomes except Y. Overall, more than 430 genes, markers, and chromosomal regions have been associated or linked with human obesity phenotypes. The electronic version of the map with links to useful sites can be found at http://obesitygene.pbrc.edu.