Identification of sample annotation errors in gene expression datasets.

Lohr, Miriam; Hellwig, Birte; Edlund, Karolina; Mattsson, Johanna S M; Botling, Johan; Schmidt, Marcus; Hengstler, Jan G; Micke, Patrick; Rahnenführer, Jörg

Lohr, Miriam; Hellwig, Birte; Edlund, Karolina; Mattsson, Johanna S M; Botling, Johan; Schmidt, Marcus; Hengstler, Jan G; Micke, Patrick; Rahnenführer, Jörg.

Afiliação

Lohr M; Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany.
Hellwig B; Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany.
Edlund K; Leibniz Research Centre for Working Environment and Human Factors (IfADo) at Dortmund TU, Dortmund, Germany.
Mattsson JS; Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.
Botling J; Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.
Schmidt M; Department of Obstetrics and Gynecology, University Hospital, Mainz, Germany.
Hengstler JG; Leibniz Research Centre for Working Environment and Human Factors (IfADo) at Dortmund TU, Dortmund, Germany.
Micke P; Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.
Rahnenführer J; Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany. rahnenfuehrer@statistik.tu-dortmund.de.

Arch Toxicol ; 89(12): 2265-72, 2015 Dec.

Article em En | MEDLINE | ID: mdl-26608184

RESUMO

The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data.

Assuntos

Bases de Dados Genéticas/normas; Perfilação da Expressão Gênica/métodos; Análise de Sequência com Séries de Oligonucleotídeos/métodos; Feminino; Humanos; Masculino; Reprodutibilidade dos Testes; Transcriptoma

Palavras-chave

Gene expression; Malefemale classifier; Microarray; Misannotation; Quality control

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Análise de Sequência com Séries de Oligonucleotídeos / Perfilação da Expressão Gênica / Bases de Dados Genéticas Tipo de estudo: Diagnostic_studies / Prognostic_studies Limite: Female / Humans / Male Idioma: En Ano de publicação: 2015 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google