Predicting Reaction Yields via Supervised Learning.

Zuranski, Andrzej M; Martinez Alvarado, Jesus I; Shields, Benjamin J; Doyle, Abigail G

Numerous disciplines, such as image recognition and language translation, have been revolutionized by using machine learning (ML) to leverage big data. In organic synthesis, providing accurate chemical reactivity predictions with supervised ML could assist chemists with reaction prediction, optimization, and mechanistic interrogation.To apply supervised ML to chemical reactions, one needs to define the object of prediction (e.g., yield, enantioselectivity, solubility, or a recommendation) and represent reactions with descriptive data. Our group's effort has focused on representing chemical reactions using DFT-derived physical features of the reacting molecules and conditions, which serve as features for building supervised ML models.In this Account, we present a review and perspective on three studies conducted by our group where ML models have been employed to predict reaction yield. First, we focus on a small reaction data set where 16 phosphine ligands were evaluated in a single Ni-catalyzed Suzuki-Miyaura cross-coupling reaction, and the reaction yield was modeled with linear regression. In this setting, where the regression complexity is strongly limited by the amount of available data, we emphasize the importance of identifying single features that are directly relevant to reactivity. Next, we focus on models trained on two larger data sets obtained with high-throughput experimentation (HTE). With hundreds to thousands of reactions available, more complex models can be explored, for example, models that algorithmically perform feature selection from a broad set of candidate features. We examine how a variety of ML algorithms model these data sets and how well these models generalize to out-of-sample substrates. Specifically, we compare the ML models that use DFT-based featurization to a baseline model that is obtained with features that carry no physical information, that is, random features, and to a naive non-ML model that averages yields of reactions that share the same conditions and substrate combinations. We find that for only one of the two data sets, DFT-based featurization leads to a significant, although moderate, out-of-sample prediction improvement. The source of this improvement was further isolated to specific features which allowed us to formulate a testable mechanistic hypothesis that was validated experimentally. Finally, we offer remarks on supervised ML model building on HTE data sets focusing on algorithmic improvements in model training.Statistical methods in chemistry have a rich history, but only recently has ML gained widespread attention in reaction development. As the untapped potential of ML is explored, novel tools are likely to arise from future research. Our studies suggest that supervised ML can lead to improved predictions of reaction yield over simpler modeling methods and facilitate mechanistic understanding of reaction dynamics. However, further research and development is required to establish ML as an indispensable tool in reactivity modeling.