Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning.

Marconato, Emanuele; Passerini, Andrea; Teso, Stefano

Marconato, Emanuele; Passerini, Andrea; Teso, Stefano.

Afiliação

Marconato E; Dipartimento di Ingegneria e Scienza dell'Informazione, University of Trento, 38123 Trento, Italy.
Passerini A; Dipartimento di Informatica, University of Pisa, 56126 Pisa, Italy.
Teso S; Dipartimento di Ingegneria e Scienza dell'Informazione, University of Trento, 38123 Trento, Italy.

Entropy (Basel) ; 25(12)2023 Nov 22.

Article em En | MEDLINE | ID: mdl-38136454

ABSTRACT

ABSTRACT

Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of interpretable concepts learned from data. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and concept-based neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the

problem:

a representation is understandable only insofar as it can be understood by the human at the receiving end. The key challenge in human-interpretable representation learning (hrl) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring interpretable representations suitable for both post hoc explainers and concept-based neural networks. Our formalization of hrl builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of alignment between the machine's representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive name transfer game, and clarify the relationship between alignment and a well-known property of representations, namely disentanglement. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as concept leakage, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.

Palavras-chave

alignment; causal abstractions; causal representation learning; concept leakage; disentanglement; explainable AI

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Revista: Entropy (Basel) Ano de publicação: 2023 Tipo de documento: Article País de afiliação: Itália

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Revista: Entropy (Basel) Ano de publicação: 2023 Tipo de documento: Article País de afiliação: Itália