Ten quick tips for sequence-based prediction of protein properties using machine learning.

Hou, Qingzhen; Waury, Katharina; Gogishvili, Dea; Feenstra, K Anton

Hou, Qingzhen; Waury, Katharina; Gogishvili, Dea; Feenstra, K Anton.

Afiliación

Hou Q; Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong, P. R. China.
Waury K; National Institute of Health Data Science of China, Shandong University, Shandong, P. R. China.
Gogishvili D; Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.
Feenstra KA; Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.

PLoS Comput Biol ; 18(12): e1010669, 2022 12.

Article en En | MEDLINE | ID: mdl-36454728

RESUMEN

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

Asunto(s)

Benchmarking; Aprendizaje Automático; Secuencia de Aminoácidos; Mapeo Cromosómico; Conocimiento

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Benchmarking / Aprendizaje Automático Tipo de estudio: Prognostic_studies / Risk_factors_studies Idioma: En Revista: PLoS Comput Biol Asunto de la revista: BIOLOGIA / INFORMATICA MEDICA Año: 2022 Tipo del documento: Article Pais de publicación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google