Measuring the effect of inter-study variability on estimating prediction error.

Ma, Shuyi; Sung, Jaeyun; Magis, Andrew T; Wang, Yuliang; Geman, Donald; Price, Nathan D

Ma, Shuyi; Sung, Jaeyun; Magis, Andrew T; Wang, Yuliang; Geman, Donald; Price, Nathan D.

Afiliação

Ma S; Institute for Systems Biology, Seattle, Washington, United States of America; Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America.
Sung J; Institute for Systems Biology, Seattle, Washington, United States of America; Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk, Republic of Korea.
Magis AT; Institute for Systems Biology, Seattle, Washington, United States of America; Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United States of America.
Wang Y; Institute for Systems Biology, Seattle, Washington, United States of America; Sage Bionetworks, Seattle, Washington, United States of America.
Geman D; Institute for Computational Medicine & Department of Applied Mathematics and Statistics, John Hopkins University, Baltimore, Maryland, United States of America.
Price ND; Institute for Systems Biology, Seattle, Washington, United States of America; Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America; Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United Sta

PLoS One ; 9(10): e110840, 2014.

Article em En | MEDLINE | ID: mdl-25330348

ABSTRACT

ABSTRACT

BACKGROUND:

The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies.

METHODS:

Here we quantify the impact of these combined "study-effects" on a disease signature's predictive performance by comparing two types of validation

methods:

ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance.

RESULTS:

As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification.

CONCLUSIONS:

We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

Assuntos

Biomarcadores Tumorais/biossíntese; Perfilação da Expressão Gênica/métodos; Regulação Neoplásica da Expressão Gênica; Análise de Sequência com Séries de Oligonucleotídeos/métodos; Adenocarcinoma/genética; Adenocarcinoma/patologia; Adenocarcinoma de Pulmão; Carcinoma de Células Escamosas/genética; Carcinoma de Células Escamosas/patologia; Humanos; Neoplasias Pulmonares/genética; Neoplasias Pulmonares/patologia; Fenótipo; Doença Pulmonar Obstrutiva Crônica/genética; Doença Pulmonar Obstrutiva Crônica/patologia; Análise de Sequência de RNA; Máquina de Vetores de Suporte

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Biomarcadores Tumorais / Regulação Neoplásica da Expressão Gênica / Análise de Sequência com Séries de Oligonucleotídeos / Perfilação da Expressão Gênica Idioma: En Ano de publicação: 2014 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google