ABSTRACT
The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained. In this Perspective, we analyze several data curation and sharing initiatives that have seen success in chemistry and molecular biology. We discuss several factors that have contributed to their success and how we can take lessons from these case studies and apply them to reaction data. Finally, we spotlight the Open Reaction Database and summarize key actions the community can take toward making reaction data more findable, accessible, interoperable, and reusable (FAIR), including the use of mandates from funding agencies and publishers.
Subject(s)
Data Curation , Informatics , Databases, Factual , Information DisseminationABSTRACT
Chemical reaction data in journal articles, patents, and even electronic laboratory notebooks are currently stored in various formats, often unstructured, which presents a significant barrier to downstream applications, including the training of machine-learning models. We present the Open Reaction Database (ORD), an open-access schema and infrastructure for structuring and sharing organic reaction data, including a centralized data repository. The ORD schema supports conventional and emerging technologies, from benchtop reactions to automated high-throughput experiments and flow chemistry. The data, schema, supporting code, and web-based user interfaces are all publicly available on GitHub. Our vision is that a consistent data representation and infrastructure to support data sharing will enable downstream applications that will greatly improve the state of the art with respect to computer-aided synthesis planning, reaction prediction, and other predictive chemistry tasks.
ABSTRACT
Molecular similarity has been effectively applied to many problems in cheminformatics and computational drug discovery, but modern methods can be prohibitively expensive for large-scale applications. The SCISSORS method rapidly approximates measures of pairwise molecular similarity such as ROCS and LINGO Tanimotos, acting as a filter to quickly reduce the size of a problem. We report an in-depth analysis of SCISSORS performance, including a mapping of the SCISSORS error distribution, benchmarking, and investigation of several algorithmic modifications. We show that SCISSORS can accurately predict multiconformer similarity and suggest a method for estimating optimal SCISSORS parameters in a data set-specific manner. These results are a useful resource for researchers seeking to incorporate SCISSORS into molecular similarity applications.