RESUMEN
Although sequencing of the human genome will soon be completed, gene identification and annotation remains a challenge. Early estimates suggested that there might be 60,000-100,000 (ref. 1) human genes, but recent analyses of the available data from EST sequencing projects have estimated as few as 45,000 (ref. 2) or as many as 140, 000 (ref. 3) distinct genes. The Chromosome 22 Sequencing Consortium estimated a minimum of 45,000 genes based on their annotation of the complete chromosome, although their data suggests there may be additional genes. The nearly 2,000,000 human ESTs in dbEST provide an important resource for gene identification and genome annotation, but these single-pass sequences must be carefully analysed to remove contaminating sequences, including those from genomic DNA, spurious transcription, and vector and bacterial sequences. We have developed a highly refined and rigorously tested protocol for cleaning, clustering and assembling EST sequences to produce high-fidelity consensus sequences for the represented genes (F.L. et al., manuscript submitted) and used this to create the TIGR Gene Indices-databases of expressed genes for human, mouse, rat and other species (http://www.tigr.org/tdb/tgi.html). Using highly refined and tested algorithms for EST analysis, we have arrived at two independent estimates indicating the human genome contains approximately 120,000 genes.
Asunto(s)
Etiquetas de Secuencia Expresada , Genes , Genoma Humano , Algoritmos , Cromosomas Humanos Par 22/genética , Biología Computacional , Secuencia de Consenso/genética , Bases de Datos Factuales , Humanos , Internet , Mapeo Físico de Cromosoma , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
Although the list of completed genome sequencing projects has expanded rapidly, sequencing and analysis of expressed sequence tags (ESTs) remain a primary tool for discovery of novel genes in many eukaryotes and a key element in genome annotation. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi) are a collection of 77 species-specific databases that use a highly refined protocol to analyze gene and EST sequences in an attempt to identify and characterize expressed transcripts and to present them on the Web in a user-friendly, consistent fashion. A Gene Index database is constructed for each selected organism by first clustering, then assembling EST and annotated cDNA and gene sequences from GenBank. This process produces a set of unique, high-fidelity virtual transcripts, or tentative consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to genetic and physical maps, to provide links to orthologous and paralogous genes, and as a resource for comparative and functional genomic analysis.
Asunto(s)
Bases de Datos Genéticas , Etiquetas de Secuencia Expresada/química , Genómica , Animales , Secuencia de Bases , Secuencia de Consenso , Bases de Datos Genéticas/tendencias , Células Eucariotas/metabolismo , Genoma , Humanos , Internet , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
The vast body of Expressed Sequence Tag (EST) data in the public databases provide an important resource for comparative and functional genomics studies and an invaluable tool for the annotation of genomic sequences. We have developed a rigorous protocol for reconstructing the sequences of transcribed genes from EST and gene sequence fragments. A key element in developing this protocol has been the evaluation of a number of sequence assembly programs to determine which most faithfully reproduce transcript sequences from EST data. The TIGR Gene Indices constructed using this protocol for human, mouse, rat and a variety of other plant and animal models have demonstrated their utility in a variety of applications and are freely available to the scientific research community.
Asunto(s)
Etiquetas de Secuencia Expresada , Análisis de Secuencia de ADN/métodos , Algoritmos , Animales , Secuencia de Consenso , Bases de Datos Factuales , Humanos , Familia de Multigenes , RatasRESUMEN
While genome sequencing projects are advancing rapidly, EST sequencing and analysis remains a primary research tool for the identification and categorization of gene sequences in a wide variety of species and an important resource for annotation of genomic sequence. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi. shtml) are a collection of species-specific databases that use a highly refined protocol to analyze EST sequences in an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed by first clustering, then assembling EST and annotated gene sequences from GenBank for the targeted species. This process produces a set of unique, high-fidelity virtual transcripts, or Tentative Consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, to provide links between orthologous and paralogous genes and as a resource for comparative sequence analysis.
Asunto(s)
Bases de Datos Factuales , Etiquetas de Secuencia Expresada , Animales , Secuencia de Bases , Genes/genética , Humanos , Internet , Datos de Secuencia Molecular , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico , Especificidad de la EspecieRESUMEN
Expressed sequence tag (EST) projects have produced extremely valuable resources for identifying genes affecting phenotypes of interest. A large-scale EST sequencing project for rainbow trout was initiated to identify and functionally annotate as many unique transcripts as possible. Over 45,000 5' ESTs were obtained by sequencing clones from a single normalized library constructed using mRNA from six tissues. The production of this sequence data and creation of a rainbow trout Gene Index eliminating redundancy and providing annotation for these sequences will facilitate research in this species.
Asunto(s)
ADN Complementario/genética , Bases de Datos Genéticas/tendencias , Biblioteca de Genes , Genes/genética , Oncorhynchus mykiss/genética , Análisis de Secuencia de ADN/veterinaria , Animales , Arabidopsis/genética , Bagres/genética , Bovinos , Pollos/genética , Análisis por Conglomerados , ADN de Plantas/genética , Bases de Datos Genéticas/estadística & datos numéricos , Etiquetas de Secuencia Expresada , Genes/fisiología , Genes de Plantas/genética , Genes de Plantas/fisiología , Humanos , Ratones , Datos de Secuencia Molecular , Ratas , Análisis de Secuencia de ADN/estadística & datos numéricos , Porcinos/genética , Pez Cebra/genéticaRESUMEN
Microarray expression analysis is providing unprecedented data on gene expression in humans and mammalian model systems. Although such studies provide a tremendous resource for understanding human disease states, one of the significant challenges is cross-referencing the data derived from different species, across diverse expression analysis platforms, in order to properly derive inferences regarding gene expression and disease state. To address this problem, we have developed RESOURCERER, a microarray-resource annotation and cross-reference database built using the analysis of expressed sequence tags (ESTs) and gene sequences provided by the TIGR Gene Index (TGI) and TIGR Orthologous Gene Alignment (TOGA) databases [now called Eukaryotic Gene Orthologs (EGO)].
Asunto(s)
Bases de Datos Genéticas , Perfilación de la Expresión Génica , Análisis de Secuencia por Matrices de Oligonucleótidos , Animales , Etiquetas de Secuencia Expresada , Humanos , Internet , Ratones , Ratas , Integración de SistemasRESUMEN
An essential component of functional genomics studies is the sequence of DNA expressed in tissues of interest. To provide a resource of bovine-specific expressed sequence data and facilitate this powerful approach in cattle research, four normalized cDNA libraries were produced and arrayed for high-throughput sequencing. The libraries were made with RNA pooled from multiple tissues to increase efficiency of normalization and maximize the number of independent genes for which sequence data were obtained. Target tissues included those with highest likelihood to have impact on production parameters of animal health, growth, reproductive efficiency, and carcass merit. Success of normalization and inter- and intralibrary redundancy were assessed by collecting 6000-23,000 sequences from each of the libraries (68,520 total sequences deposited in GenBank). Sequence comparison and assembly of these sequences was performed in combination with 56,500 other bovine EST sequences present in the GenBank dbEST database to construct a cattle Gene Index (available from The Institute for Genomic Research at http://www.tigr.org/tdb/tgi.shtml). The 124,381 bovine ESTs present in GenBank at the time of the analysis form 16,740 assemblies that are listed and annotated on the Web site. Analysis of individual library sequence data indicates that the pooled-tissue approach was highly effective in preparing libraries for efficient deep sequencing.
Asunto(s)
Biblioteca de Genes , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Animales , Bovinos , Bases de Datos Factuales , Etiquetas de Secuencia Expresada , Femenino , Feto , Perfilación de la Expresión Génica/métodos , Especificidad de Órganos/genética , EmbarazoRESUMEN
Approximately 80% of the maize genome comprises highly repetitive sequences interspersed with single-copy, gene-rich sequences, and standard genome sequencing strategies are not readily adaptable to this type of genome. Methodologies that enrich for genic sequences might more rapidly generate useful results from complex genomes. Equivalent numbers of clones from maize selected by techniques called methylation filtering and High C0t selection were sequenced to generate approximately 200,000 reads (approximately 132 megabases), which were assembled into contigs. Combination of the two techniques resulted in a sixfold reduction in the effective genome size and a fourfold increase in the gene identification rate in comparison to a nonenriched library.