Your browser doesn't support javascript.
loading
Robustifying genomic classifiers to batch effects via ensemble learning.
Zhang, Yuqing; Patil, Prasad; Johnson, W Evan; Parmigiani, Giovanni.
Afiliação
  • Zhang Y; Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA.
  • Patil P; Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.
  • Johnson WE; Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.
  • Parmigiani G; Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA.
Bioinformatics ; 37(11): 1521-1527, 2021 07 12.
Article em En | MEDLINE | ID: mdl-33245114
ABSTRACT
MOTIVATION Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods.

RESULTS:

We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https//github.com/zhangyuqing/bea_ensemble. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Assuntos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genoma / Genômica Tipo de estudo: Guideline / Prognostic_studies Limite: Humans Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genoma / Genômica Tipo de estudo: Guideline / Prognostic_studies Limite: Humans Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Estados Unidos