Loose ends: almost one in five human genes still have unresolved coding status.

Abascal, Federico; Juan, David; Jungreis, Irwin; Kellis, Manolis; Martinez, Laura; Rigau, Maria; Rodriguez, Jose Manuel; Vazquez, Jesus; Tress, Michael L

Abascal, Federico; Juan, David; Jungreis, Irwin; Kellis, Manolis; Martinez, Laura; Rigau, Maria; Rodriguez, Jose Manuel; Vazquez, Jesus; Tress, Michael L.

Afiliação

Abascal F; Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambridgeshire, UK.
Juan D; Comparative Genomics Lab, Instituto de Biologica Evolutiva, Universitat Pompeu Fabra, Barcelona, Spain.
Jungreis I; MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA and Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Martinez L; Bioinformatics Unit, Spanish National Cancer Research Centre, Madrid, Spain.
Rigau M; Computational Biology Life Sciences Group, Barcelona Supercomputing Center, Barcelona, Spain.
Rodriguez JM; Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain.
Vazquez J; Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain.
Tress ML; Bioinformatics Unit, Spanish National Cancer Research Centre, Madrid, Spain.

Nucleic Acids Res ; 46(14): 7070-7084, 2018 08 21.

Article em En | MEDLINE | ID: mdl-29982784

ABSTRACT

ABSTRACT

Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.

Assuntos

Genes; Anticorpos; Variações do Número de Cópias de DNA; Variação Genética; Genoma Humano; Humanos; Anotação de Sequência Molecular; Proteínas/genética; Proteínas/imunologia; Proteínas/metabolismo; Pseudogenes

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genes Limite: Humans Idioma: En Revista: Nucleic Acids Res Ano de publicação: 2018 Tipo de documento: Article País de afiliação: Reino Unido

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google