Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra.

Silva, André R F; Lima, Diogo B; Kurt, Louise U; Dupré, Mathieu; Chamot-Rooke, Julia; Santos, Marlon D M; Nicolau, Carolina Alves; Valente, Richard Hemmi; Barbosa, Valmir C; Carvalho, Paulo C

Silva, André R F; Lima, Diogo B; Kurt, Louise U; Dupré, Mathieu; Chamot-Rooke, Julia; Santos, Marlon D M; Nicolau, Carolina Alves; Valente, Richard Hemmi; Barbosa, Valmir C; Carvalho, Paulo C.

Afiliación

Silva ARF; Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil. Electronic address: andrerfsilva@gmail.com.
Lima DB; Department of Chemical Biology, Leibniz - Forschungsinstitut für Molekulare Pharmakologie (FMP), Berlin, Germany.
Kurt LU; Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil.
Dupré M; Mass Spectrometry for Biology Unit, CNRS USR 2000, Institut Pasteur, Paris, France.
Chamot-Rooke J; Mass Spectrometry for Biology Unit, CNRS USR 2000, Institut Pasteur, Paris, France.
Santos MDM; Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil.
Nicolau CA; Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Rio de Janeiro, Brazil; Centre de Recherche en Cancérologie et Immunologie Nantes-Angers (CRCINA), Team SOAP, INSERM U1232, Nantes, France.
Valente RH; Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Rio de Janeiro, Brazil.
Barbosa VC; Systems Engineering and Computer Science Program, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil. Electronic address: valmir@cos.ufrj.br.
Carvalho PC; Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil. Electronic address: paulo@pcarvalho.com.

J Proteomics ; 245: 104282, 2021 08 15.

Article en En | MEDLINE | ID: mdl-34089898

ABSTRACT

ABSTRACT

In proteomics, the identification of peptides from mass spectral data can be mathematically described as the partitioning of mass spectra into clusters (i.e., groups of spectra derived from the same peptide). The way partitions are validated is just as important, having evolved side by side with the clustering algorithms themselves and given rise to many partition assessment measures. An assessment measure is said to have a selection bias if, and only if, the probability that a randomly chosen partition scoring a high value depends on the number of clusters in the partition. In the context of clustering mass spectra, this might mislead the validation process to favor clustering algorithms that generate too many (or few) spectral clusters, regardless of the underlying peptide sequence. A selection bias toward the number of peptides is desirable for proteomics as it estimates the number of peptides in a complex protein mixture. Here, we introduce an assessment measure that is purposely biased toward the number of peptide ion species. We also introduce a partition assessment framework for proteomics, called the Partition Assessment Tool, and demonstrate its importance by evaluating the performance of eight clustering algorithms on seven proteomics datasets while discussing the trade-offs involved.

SIGNIFICANCE:

Clustering algorithms are widely adopted in proteomics for undertaking several tasks such as speeding up search engines, generating consensus mass spectra, and to aid in the classification of proteomic profiles. Choosing which algorithm is most fit for the task at hand is not simple as each algorithm has advantages and disadvantages; furthermore, specifying clustering parameters is also a necessary and fundamental step. For example, deciding on whether to generate "pure clusters" or fewer clusters but accepting noise. With this as motivation, we verify the performance of several widely adopted algorithms on proteomic datasets and introduce a theoretical framework for drawing conclusions on which approach is suitable for the task at hand.

Asunto(s)

Proteómica; Programas Informáticos; Algoritmos; Análisis por Conglomerados; Bases de Datos de Proteínas; Sesgo de Selección; Espectrometría de Masas en Tándem

Palabras clave

Clustering; Partition assessment tool; Tandem mass spectra

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Proteómica Tipo de estudio: Prognostic_studies Idioma: En Revista: J Proteomics Asunto de la revista: BIOQUIMICA Año: 2021 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google