RESUMEN
A collection of 90,000 human cDNA clones generated to increase the fraction of "full-length" cDNAs available was analyzed by sequence alignment on the human genome assembly. Five hundred fifty-two gene models not found in LocusLink, with coding regions of at least 300 bp, were defined by using this collection. Exon composition proposed for novel genes showed an average of 4.7 exons per gene. In 20% of the cases, at least half of the exons predicted for new genes coincided with evolutionary conserved regions defined by sequence comparisons with the pufferfish Tetraodon nigroviridis. Among this subset, CpG islands were observed at the 5' end of 75%. In-frame stop codons upstream of the initiator ATG were present in 49% of the new genes, and 16% contained a coding region comprising at least 50% of the cDNA sequence. This cDNA resource also provided candidate small protein-coding genes, usually not included in genome annotations. In addition, analysis of a sample from this cDNA collection indicates that approximately 380 gene models described in LocusLink could be extended at their 5' end by at least one new exon. Finally, this cDNA resource provided an experimental support for annotations based exclusively on predictions, thus representing a resource substantially improving the human genome annotation.
Asunto(s)
Regiones no Traducidas 5'/genética , ADN Complementario/genética , Genoma Humano , Adulto , Secuencia de Aminoácidos/genética , Animales , Línea Celular Tumoral , ADN Complementario/clasificación , ADN de Neoplasias/clasificación , ADN de Neoplasias/genética , Células HeLa/química , Células HeLa/metabolismo , Humanos , Células Jurkat/química , Células Jurkat/metabolismo , Ratones , Modelos Genéticos , Datos de Secuencia Molecular , Sistemas de Lectura Abierta/genética , Especificidad de Órganos/genética , Proteínas/química , Proteínas/genética , Alineación de Secuencia/clasificación , Alineación de Secuencia/métodos , Homología de Secuencia de Ácido Nucleico , Tetraodontiformes/genéticaRESUMEN
Chromosome 14 is one of five acrocentric chromosomes in the human genome. These chromosomes are characterized by a heterochromatic short arm that contains essentially ribosomal RNA genes, and a euchromatic long arm in which most, if not all, of the protein-coding genes are located. The finished sequence of human chromosome 14 comprises 87,410,661 base pairs, representing 100% of its euchromatic portion, in a single continuous segment covering the entire long arm with no gaps. Two loci of crucial importance for the immune system, as well as more than 60 disease genes, have been localized so far on chromosome 14. We identified 1,050 genes and gene fragments, and 393 pseudogenes. On the basis of comparisons with other vertebrate genomes, we estimate that more than 96% of the chromosome 14 genes have been annotated. From an analysis of the CpG island occurrences, we estimate that 70% of these annotated genes are complete at their 5' end.