Critiquing Protein Family Classification Models Using Sufficient Input Subsets.

Carter, Brandon; Bileschi, Maxwell; Smith, Jamie; Sanderson, Theo; Bryant, Drew; Belanger, David; Colwell, Lucy J

Carter, Brandon; Bileschi, Maxwell; Smith, Jamie; Sanderson, Theo; Bryant, Drew; Belanger, David; Colwell, Lucy J.

Afiliación

Carter B; MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA.
Bileschi M; Google Research, Mountain View, California, USA.
Smith J; Google Research, Mountain View, California, USA.
Sanderson T; Google Research, Mountain View, California, USA.
Bryant D; Google Research, Mountain View, California, USA.
Belanger D; Google Research, Mountain View, California, USA.
Colwell LJ; Google Research, Mountain View, California, USA.

J Comput Biol ; 27(8): 1219-1231, 2020 08.

Article en En | MEDLINE | ID: mdl-31874057

ABSTRACT

ABSTRACT

In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.

Asunto(s)

Biología Computacional; Modelos Biológicos; Familia de Multigenes/genética; Proteínas/clasificación; Aprendizaje Profundo; Humanos; Aprendizaje Automático; Redes Neurales de la Computación; Proteínas/genética

Palabras clave

interpretability; machine learning; model selection; neural networks; protein classification; protein domain

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Proteínas / Familia de Multigenes / Biología Computacional / Modelos Biológicos Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: J Comput Biol Asunto de la revista: BIOLOGIA MOLECULAR / INFORMATICA MEDICA Año: 2020 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google