A comparative study of evaluating missing value imputation methods in label-free proteomics.

Jin, Liang; Bi, Yingtao; Hu, Chenqi; Qu, Jun; Shen, Shichen; Wang, Xue; Tian, Yu

Jin, Liang; Bi, Yingtao; Hu, Chenqi; Qu, Jun; Shen, Shichen; Wang, Xue; Tian, Yu.

Afiliação

Jin L; Drug Metabolism and Pharmacokinetics, AbbVie Bioresearch Center, Worcester, MA, 01605, USA.
Bi Y; Discovery and Exploratory Statistics, AbbVie Bioresearch Center, Worcester, MA, 01605, USA.
Hu C; Drug Metabolism and Pharmacokinetics, AbbVie Bioresearch Center, Worcester, MA, 01605, USA.
Qu J; Department of Pharmaceutical Science, SUNY at Buffalo, Buffalo, NY, 14228, USA.
Shen S; Center of Excellence in Bioinformatics & Life Science, Buffalo, NY, 14203, USA.
Wang X; Department of Pharmaceutical Science, SUNY at Buffalo, Buffalo, NY, 14228, USA.
Tian Y; Center of Excellence in Bioinformatics & Life Science, Buffalo, NY, 14203, USA.

Sci Rep ; 11(1): 1760, 2021 01 19.

Article em En | MEDLINE | ID: mdl-33469060

ABSTRACT

ABSTRACT

The presence of missing values (MVs) in label-free quantitative proteomics greatly reduces the completeness of data. Imputation has been widely utilized to handle MVs, and selection of the proper method is critical for the accuracy and reliability of imputation. Here we present a comparative study that evaluates the performance of seven popular imputation methods with a large-scale benchmark dataset and an immune cell dataset. Simulated MVs were incorporated into the complete part of each dataset with different combinations of MV rates and missing not at random (MNAR) rates. Normalized root mean square error (NRMSE) was applied to evaluate the accuracy of protein abundances and intergroup protein ratios after imputation. Detection of true positives (TPs) and false altered-protein discovery rate (FADR) between groups were also compared using the benchmark dataset. Furthermore, the accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. We observed that the accuracy of imputation is primarily affected by the MNAR rate rather than the MV rate, and downstream analysis can be largely impacted by the selection of imputation methods. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amount of TPs with the average FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics.

Assuntos

Proteínas de Escherichia coli/análise; Proteínas de Neoplasias/análise; Proteoma/análise; Proteômica/métodos; Proteínas de Saccharomyces cerevisiae/análise; Algoritmos; Análise de Dados; Conjuntos de Dados como Assunto; Processamento Eletrônico de Dados; Escherichia coli/metabolismo; Humanos; Saccharomyces cerevisiae/metabolismo

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteoma / Proteínas de Escherichia coli / Proteínas de Saccharomyces cerevisiae / Proteômica / Proteínas de Neoplasias Limite: Humans Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google