SEAOP: a statistical ensemble approach for outlier detection in quantitative proteomics data.

Huang, Jinze; Zhao, Yang; Meng, Bo; Lu, Ao; Wei, Yaoguang; Dong, Lianhua; Fang, Xiang; An, Dong; Dai, Xinhua

Huang, Jinze; Zhao, Yang; Meng, Bo; Lu, Ao; Wei, Yaoguang; Dong, Lianhua; Fang, Xiang; An, Dong; Dai, Xinhua.

Afiliação

Huang J; College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China.
Zhao Y; Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China.
Meng B; Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China.
Lu A; College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China.
Wei Y; College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China.
Dong L; Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China.
Fang X; Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China.
An D; College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China.
Dai X; Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China.

Brief Bioinform ; 25(3)2024 Mar 27.

Article em En | MEDLINE | ID: mdl-38557674

ABSTRACT

ABSTRACT

Quality control in quantitative proteomics is a persistent challenge, particularly in identifying and managing outliers. Unsupervised learning models, which rely on data structure rather than predefined labels, offer potential solutions. However, without clear labels, their effectiveness might be compromised. Single models are susceptible to the randomness of parameters and initialization, which can result in a high rate of false positives. Ensemble models, on the other hand, have shown capabilities in effectively mitigating the impacts of such randomness and assisting in accurately detecting true outliers. Therefore, we introduced SEAOP, a Python toolbox that utilizes an ensemble mechanism by integrating multi-round data management and a statistics-based decision pipeline with multiple models. Specifically, SEAOP uses multi-round resampling to create diverse sub-data spaces and employs outlier detection methods to identify candidate outliers in each space. Candidates are then aggregated as confirmed outliers via a chi-square test, adhering to a 95% confidence level, to ensure the precision of the unsupervised approaches. Additionally, SEAOP introduces a visualization strategy, specifically designed to intuitively and effectively display the distribution of both outlier and non-outlier samples. Optimal hyperparameter models of SEAOP for outlier detection were identified by using a gradient-simulated standard dataset and Mann-Kendall trend test. The performance of the SEAOP toolbox was evaluated using three experimental datasets, confirming its reliability and accuracy in handling quantitative proteomics.

Assuntos

Gerenciamento de Dados; Proteômica; Reprodutibilidade dos Testes; Controle de Qualidade; Interpretação Estatística de Dados

Palavras-chave

Python; ensemble; outlier detection; proteomics; quality control

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteômica / Gerenciamento de Dados Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteômica / Gerenciamento de Dados Idioma: En Ano de publicação: 2024 Tipo de documento: Article