RESUMEN
The laboratory mouse is the premier animal model for studying human disease and thousands of mutants have been identified or produced, most recently through gene-specific mutagenesis approaches. High throughput strategies by the International Knockout Mouse Consortium (IKMC) are producing mutants for all protein coding genes. Generating a knock-out line involves huge monetary and time costs so capture of both the data describing each mutant alongside archiving of the line for distribution to future researchers is critical. The European Mouse Mutant Archive (EMMA) is a leading international network infrastructure for archiving and worldwide provision of mouse mutant strains. It operates in collaboration with the other members of the Federation of International Mouse Resources (FIMRe), EMMA being the European component. Additionally EMMA is one of four repositories involved in the IKMC, and therefore the current figure of 1700 archived lines will rise markedly. The EMMA database gathers and curates extensive data on each line and presents it through a user-friendly website. A BioMart interface allows advanced searching including integrated querying with other resources e.g. Ensembl. Other resources are able to display EMMA data by accessing our Distributed Annotation System server. EMMA database access is publicly available at http://www.emmanet.org.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Bases de Datos de Ácidos Nucleicos , Animales , Cromosomas , Biología Computacional/tendencias , Bases de Datos de Proteínas , Almacenamiento y Recuperación de la Información/métodos , Internet , Ratones , Ratones Endogámicos C57BL , Ratones Noqueados , Modelos Genéticos , Estructura Terciaria de Proteína , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
BACKGROUND: The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. During each release cycle approximately two weeks are allocated to generate all the genomic alignments and the protein homology predictions. The number of calculations required for this task grows approximately quadratically with the number of species. We currently support 50 species in Ensembl and we expect the number to continue to grow in the future. RESULTS: We present eHive, a new fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams. In the eHive system a MySQL database serves as the central blackboard and the autonomous agent, a Perl script, queries the system and runs jobs as required. The system allows us to define dataflow and branching rules to suit all our production pipelines. We describe the implementation of three pipelines: (1) pairwise whole genome alignments, (2) multiple whole genome alignments and (3) gene trees with protein homology inference. Finally, we show the efficiency of the system in real case scenarios. CONCLUSIONS: eHive allows us to produce computationally demanding results in a reliable and efficient way with minimal supervision and high throughput. Further documentation is available at: http://www.ensembl.org/info/docs/eHive/.
Asunto(s)
Genoma , Genómica/métodos , Programas Informáticos , Bases de Datos GenéticasRESUMEN
TreeFam (http://www.treefam.org) was developed to provide curated phylogenetic trees for all animal gene families, as well as orthologue and paralogue assignments. Release 4.0 of TreeFam contains curated trees for 1314 families and automatically generated trees for another 14,351 families. We have expanded TreeFam to include 25 fully sequenced animal genomes, as well as four genomes from plant and fungal outgroup species. We have also introduced more accurate approaches for automatically grouping genes into families, for building phylogenetic trees, and for inferring orthologues and paralogues. The user interface for viewing phylogenetic trees and family information has been improved. Furthermore, a new perl API lets users easily extract data from the TreeFam mysql database.
Asunto(s)
Bases de Datos Genéticas , Filogenia , Animales , Genómica , Internet , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
We have developed a comprehensive gene orientated phylogenetic resource, EnsemblCompara GeneTrees, based on a computational pipeline to handle clustering, multiple alignment, and tree generation, including the handling of large gene families. We developed two novel non-sequence-based metrics of gene tree correctness and benchmarked a number of tree methods. The TreeBeST method from TreeFam shows the best performance in our hands. We also compared this phylogenetic approach to clustering approaches for ortholog prediction, showing a large increase in coverage using the phylogenetic approach. All data are made available in a number of formats and will be kept up to date with the Ensembl project.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Duplicación de Gen , Filogenia , Vertebrados/clasificación , Animales , Humanos , Modelos Biológicos , Familia de Multigenes , Homología de Secuencia , Programas Informáticos , Sintenía , Vertebrados/genéticaRESUMEN
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
Asunto(s)
Evolución Molecular , Genoma Humano , Mamíferos/genética , Sistemas de Lectura Abierta , Filogenia , Alineación de Secuencia , Animales , Proyecto Genoma Humano , HumanosRESUMEN
The increasing number of complete and nearly complete metazoan genome sequences provides a significant amount of material for large-scale comparative genomic analysis. Finding new effective methods to analyse such enormous datasets has been the object of intense research. Three main areas in comparative genomics have recently shown important developments: whole-genome alignment, gene prediction and regulatory-region prediction. Each of these areas improves the methods of deciphering long genomic sequences and uncovering what lies hidden in them.
Asunto(s)
Genómica , Animales , Evolución Molecular , Genómica/estadística & datos numéricos , Humanos , Internet , Alineación de Secuencia/estadística & datos numéricos , Programas InformáticosRESUMEN
We investigated the serological, epidemiological and molecular aspects of human T-cell lymphotropic virus type I and II (HTLV-I/II) infection in the Amerindian populations of French Guiana by testing 847 sera. No HTLV-II antibodies were detected, but five individuals (0.59%) were seropositive for HTLV-I. Analysis of the nucleotide sequences of 522 bp of the env gene and the compete LTR showed that all of the strains from French Guiana belonged to the cosmopolitan subtype A. The similarities were greater between Amerindian and Creole strains than between Amerindian and Noir-Marron strains or than between Creole and Noir-Marron strains. Phylogenetic analysis showed two clusters: one of strains from Amerindians and Creoles, which belong to the transcontinental subgroup, and the other of strains from Noirs-Marrons, belonging to the West African subgroup. Our results suggest that the Amerindian HTLV-I strains are of African origin.
Asunto(s)
Anticuerpos Anti-HTLV-I/sangre , Infecciones por HTLV-I/etnología , Infecciones por HTLV-II/etnología , Indígenas Sudamericanos , Adolescente , Adulto , Anciano , Anciano de 80 o más Años , Niño , Preescolar , Femenino , Guyana Francesa/epidemiología , Genes env , Infecciones por HTLV-I/virología , Anticuerpos Anti-HTLV-II/sangre , Infecciones por HTLV-II/virología , Virus Linfotrópico T Tipo 1 Humano/genética , Virus Linfotrópico T Tipo 1 Humano/aislamiento & purificación , Virus Linfotrópico T Tipo 2 Humano/genética , Virus Linfotrópico T Tipo 2 Humano/aislamiento & purificación , Humanos , Masculino , Persona de Mediana Edad , Epidemiología Molecular , Datos de Secuencia Molecular , Filogenia , Análisis de Secuencia de ADN , Estudios Seroepidemiológicos , Secuencias Repetidas Terminales/genéticaRESUMEN
A collection of 90,000 human cDNA clones generated to increase the fraction of "full-length" cDNAs available was analyzed by sequence alignment on the human genome assembly. Five hundred fifty-two gene models not found in LocusLink, with coding regions of at least 300 bp, were defined by using this collection. Exon composition proposed for novel genes showed an average of 4.7 exons per gene. In 20% of the cases, at least half of the exons predicted for new genes coincided with evolutionary conserved regions defined by sequence comparisons with the pufferfish Tetraodon nigroviridis. Among this subset, CpG islands were observed at the 5' end of 75%. In-frame stop codons upstream of the initiator ATG were present in 49% of the new genes, and 16% contained a coding region comprising at least 50% of the cDNA sequence. This cDNA resource also provided candidate small protein-coding genes, usually not included in genome annotations. In addition, analysis of a sample from this cDNA collection indicates that approximately 380 gene models described in LocusLink could be extended at their 5' end by at least one new exon. Finally, this cDNA resource provided an experimental support for annotations based exclusively on predictions, thus representing a resource substantially improving the human genome annotation.
Asunto(s)
Regiones no Traducidas 5'/genética , ADN Complementario/genética , Genoma Humano , Adulto , Secuencia de Aminoácidos/genética , Animales , Línea Celular Tumoral , ADN Complementario/clasificación , ADN de Neoplasias/clasificación , ADN de Neoplasias/genética , Células HeLa/química , Células HeLa/metabolismo , Humanos , Células Jurkat/química , Células Jurkat/metabolismo , Ratones , Modelos Genéticos , Datos de Secuencia Molecular , Sistemas de Lectura Abierta/genética , Especificidad de Órganos/genética , Proteínas/química , Proteínas/genética , Alineación de Secuencia/clasificación , Alineación de Secuencia/métodos , Homología de Secuencia de Ácido Nucleico , Tetraodontiformes/genéticaRESUMEN
Chromosome 14 is one of five acrocentric chromosomes in the human genome. These chromosomes are characterized by a heterochromatic short arm that contains essentially ribosomal RNA genes, and a euchromatic long arm in which most, if not all, of the protein-coding genes are located. The finished sequence of human chromosome 14 comprises 87,410,661 base pairs, representing 100% of its euchromatic portion, in a single continuous segment covering the entire long arm with no gaps. Two loci of crucial importance for the immune system, as well as more than 60 disease genes, have been localized so far on chromosome 14. We identified 1,050 genes and gene fragments, and 393 pseudogenes. On the basis of comparisons with other vertebrate genomes, we estimate that more than 96% of the chromosome 14 genes have been annotated. From an analysis of the CpG island occurrences, we estimate that 70% of these annotated genes are complete at their 5' end.
Asunto(s)
Cromosomas Humanos Par 14/genética , Mapeo Físico de Cromosoma , Análisis de Secuencia de ADN , Regiones no Traducidas 5'/genética , Animales , Composición de Base , Cromosomas Artificiales/genética , Islas de CpG/genética , ADN Mitocondrial/genética , ADN Ribosómico/genética , Genes/genética , Genómica , Humanos , Inmunidad/genética , Ratones , Repeticiones de Microsatélite/genética , Datos de Secuencia Molecular , Sistemas de Lectura Abierta/genética , Seudogenes/genética , Reproducibilidad de los Resultados , Sintenía/genéticaRESUMEN
Ensembl (http://www.ensembl.org/) is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.