A quantitative uncertainty metric controls error in neural network-driven chemical discovery.

Janet, Jon Paul; Duan, Chenru; Yang, Tzuhsiung; Nandy, Aditya; Kulik, Heather J

Janet, Jon Paul; Duan, Chenru; Yang, Tzuhsiung; Nandy, Aditya; Kulik, Heather J.

Afiliação

Janet JP; Department of Chemical Engineering , Massachusetts Institute of Technology , Cambridge , MA 02139 , USA . Email: hjkulik@mit.edu ; Tel: +1-617-253-4584.
Duan C; Department of Chemical Engineering , Massachusetts Institute of Technology , Cambridge , MA 02139 , USA . Email: hjkulik@mit.edu ; Tel: +1-617-253-4584.
Yang T; Department of Chemistry , Massachusetts Institute of Technology , Cambridge , MA 02139 , USA.
Nandy A; Department of Chemical Engineering , Massachusetts Institute of Technology , Cambridge , MA 02139 , USA . Email: hjkulik@mit.edu ; Tel: +1-617-253-4584.
Kulik HJ; Department of Chemical Engineering , Massachusetts Institute of Technology , Cambridge , MA 02139 , USA . Email: hjkulik@mit.edu ; Tel: +1-617-253-4584.

Chem Sci ; 10(34): 7913-7922, 2019 Sep 14.

Article em En | MEDLINE | ID: mdl-31588334

ABSTRACT

ABSTRACT

Machine learning (ML) models, such as artificial neural networks, have emerged as a complement to high-throughput screening, enabling characterization of new compounds in seconds instead of hours. The promise of ML models to enable large-scale chemical space exploration can only be realized if it is straightforward to identify when molecules and materials are outside the model's domain of applicability. Established uncertainty metrics for neural network models are either costly to obtain (e.g., ensemble models) or rely on feature engineering (e.g., feature space distances), and each has limitations in estimating prediction errors for chemical space exploration. We introduce the distance to available data in the latent space of a neural network ML model as a low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. The calibrated performance of this approach exceeds widely used uncertainty metrics and is readily applied to models of increasing complexity at no additional cost. Tightening latent distance cutoffs systematically drives down predicted model errors below training errors, thus enabling predictive error control in chemical discovery or identification of useful data points for active learning.

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Revista: Chem Sci Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google