Ten quick tips for sequence-based prediction of protein properties using machine learning.

Hou, Qingzhen; Waury, Katharina; Gogishvili, Dea; Feenstra, K Anton

Hou, Qingzhen; Waury, Katharina; Gogishvili, Dea; Feenstra, K Anton.

Afiliação

Hou Q; Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong, P. R. China.
Waury K; National Institute of Health Data Science of China, Shandong University, Shandong, P. R. China.
Gogishvili D; Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.
Feenstra KA; Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.

PLoS Comput Biol ; 18(12): e1010669, 2022 12.

Article em En | MEDLINE | ID: mdl-36454728

ABSTRACT

ABSTRACT

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

Assuntos

Benchmarking; Aprendizado de Máquina; Sequência de Aminoácidos; Mapeamento Cromossômico; Conhecimento

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Benchmarking / Aprendizado de Máquina Tipo de estudo: Prognostic_studies / Risk_factors_studies Idioma: En Revista: PLoS Comput Biol Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2022 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google