RESUMEN
Infusing "chemical wisdom" should improve the data-driven approaches that rely exclusively on historical synthetic data for automatic retrosynthesis planning. For this purpose, we designed a chemistry-informed molecular graph (CIMG) to describe chemical reactions. A collection of key information that is most relevant to chemical reactions is integrated in CIMG:NMR chemical shifts as vertex features, bond dissociation energies as edge features, and solvent/catalyst information as global features. For any given compound as a target, a product CIMG is generated and exploited by a graph neural network (GNN) model to choose reaction template(s) leading to this product. A reactant CIMG is then inferred and used in two GNN models to select appropriate catalyst and solvent, respectively. Finally, a fourth GNN model compares the two CIMG descriptors to check the plausibility of the proposed reaction. A reaction vector is obtained for every molecule in training these models. The chemical wisdom of reaction propensity contained in the pretrained reaction vectors is exploited to autocategorize molecules/reactions and to accelerate Monte Carlo tree search (MCTS) for multistep retrosynthesis planning. Full synthetic routes with recommended catalysts/solvents are predicted efficiently using this CIMG-based approach.
Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Catálisis , Técnicas de Química Sintética , Método de Montecarlo , SolventesRESUMEN
Label-free data mining can efficiently feed large amounts of data from the vast scientific literature into artificial intelligence (AI) processing systems. Here, we demonstrate an unsupervised syntactic distance analysis (SDA) approach that is capable of mining chemical substances, functions, properties, and operations without annotation. This SDA approach was evaluated in several areas of research from the physical sciences and achieved performance in information mining comparable to that of supervised learning, as shown by its satisfactory scores of 0.62-0.72, 0.60-0.82, and 0.86-0.95 in precision, recall, and accuracy, respectively. We also showcase how our approach can assist robotic chemists programmed to perform research focused on double-perovskite colloidal nanocrystals, gold colloidal nanocrystals, oxygen evolution reaction catalysts, and enzyme-like catalysts by designing materials, formulations, and synthesis parameters based on data mined from 1.1 million literature references.