RESUMO
Sustainability is here to stay. As businesses migrate away from fossil fuels and toward renewable sources, chemistry will play a crucial role in bringing the economy to a point of net-zero emissions. In fact, chemistry has always been at the forefront of developing new or enhanced materials to fulfill societal demands, resulting in goods with appropriate physical or chemical qualities. Today, the main focus is on developing goods and materials that have a less negative impact on the environment, which may include (but is not limited to) leaving behind smaller carbon footprints. Integrating data and AI can speed up the discovery of new eco-friendly materials, predict environmental impact factors for early assessment of new technological integration, enhance plant design and management, and optimize processes to reduce costs and improve efficiency, all of which contribute to a more rapid transition to a sustainable system. In this perspective, we hint at how AI technologies have been employed so far first, at estimating sustainability metrics and second, at designing more sustainable chemical processes.
RESUMO
Computer-aided synthesis design, automation, and analytics assisted by machine learning are promising resources in the researcher's toolkit. Each component may alleviate the chemist from routine tasks, provide valuable insights from data, and enable more informed experimental design. Herein, we highlight selected works in the field and discuss the different approaches and the problems to which they may apply. We emphasize that there are currently few tools with a low barrier of entry for non-experts, which may limit widespread integration into the researcher's workflow.
RESUMO
The RXN for Chemistry project, initiated by IBM Research Europe - Zurich in 2017, aimed to develop a series of digital assets using machine learning techniques to promote the use of data-driven methodologies in synthetic organic chemistry. This research adopts an innovative concept by treating chemical reaction data as language records, treating the prediction of a synthetic organic chemistry reaction as a translation task between precursor and product languages. Over the years, the IBM Research team has successfully developed language models for various applications including forward reaction prediction, retrosynthesis, reaction classification, atom-mapping, procedure extraction from text, inference of experimental protocols and its use in programming commercial automation hardware to implement an autonomous chemical laboratory. Furthermore, the project has recently incorporated biochemical data in training models for greener and more sustainable chemical reactions. The remarkable ease of constructing prediction models and continually enhancing them through data augmentation with minimal human intervention has led to the widespread adoption of language model technologies, facilitating the digitalization of chemistry in diverse industrial sectors such as pharmaceuticals and chemical manufacturing. This manuscript provides a concise overview of the scientific components that contributed to the prestigious Sandmeyer Award in 2022.
RESUMO
CP2K is an open source electronic structure and molecular dynamics software package to perform atomistic simulations of solid-state, liquid, molecular, and biological systems. It is especially aimed at massively parallel and linear-scaling electronic structure methods and state-of-the-art ab initio molecular dynamics simulations. Excellent performance for electronic structure calculations is achieved using novel algorithms implemented for modern high-performance computing systems. This review revisits the main capabilities of CP2K to perform efficient and accurate electronic structure simulations. The emphasis is put on density functional theory and multiple post-Hartree-Fock methods using the Gaussian and plane wave approach and its augmented all-electron extension.
RESUMO
The synthesis of organic compounds, which is central to many areas such as drug discovery, material synthesis and biomolecular chemistry, requires chemists to have years of knowledge and experience. The development of technologies with the potential to learn and support experts in the design of synthetic routes is a half-century-old challenge with an interesting revival in the last decade. In fact, the renewed interest in artificial intelligence (AI), driven mainly by data availability, is profoundly changing the landscape of computer-aided chemical reaction prediction and retrosynthetic analysis. In this article, we briefly review different approaches to predict forward reactions and retrosynthesis, with a strong focus on data-driven ones. While data-driven technologies still need to demonstrate their full potential compared to expert rule-based systems in synthetic chemistry, the acceleration experienced in the last decade is a convincing sign that where we use software today, there will be AI tomorrow. This revolution will help and empower bench chemists, driving the transformation of chemistry towards a high-tech business over the next decades.
RESUMO
The present work suggests the use of a mixed water-based electrolyte containing sulfuric and phosphoric acid for both negative and positive electrolytes of a vanadium redox flow battery. Computational and experimental investigations reveal insights on the possible interactions between the vanadium ions in all oxidation states and sulphate, bisulphate, dihydrogen phosphate ions and phosphoric acid. In situ cycling experiments and ion-specific electrochemical impedance measurements confirmed a significant lowering of the charge-transfer resistance for the reduction of V(iii) ions and, consequently, an increase of the voltaic efficiency associated with the negative side of the battery. This increase of performance is attributable to the complexation of this oxidation state by phosphoric acid. So far, mixed acids have mainly been discussed with the focus on V(v) solubility. In this work we rationalize the impact of the mixed acids on the electrochemical efficiency opening new strategies on how to improve the cycling performance with ionic additives.
RESUMO
The implementation and validation of the adaptive buffered force (AdBF) quantum-mechanics/molecular-mechanics (QM/MM) method in two popular packages, CP2K and AMBER are presented. The implementations build on the existing QM/MM functionality in each code, extending it to allow for redefinition of the QM and MM regions during the simulation and reducing QM-MM interface errors by discarding forces near the boundary according to the buffered force-mixing approach. New adaptive thermostats, needed by force-mixing methods, are also implemented. Different variants of the method are benchmarked by simulating the structure of bulk water, water autoprotolysis in the presence of zinc and dimethyl-phosphate hydrolysis using various semiempirical Hamiltonians and density functional theory as the QM model. It is shown that with suitable parameters, based on force convergence tests, the AdBF QM/MM scheme can provide an accurate approximation of the structure in the dynamical QM region matching the corresponding fully QM simulations, as well as reproducing the correct energetics in all cases. Adaptive unbuffered force-mixing and adaptive conventional QM/MM methods also provide reasonable results for some systems, but are more likely to suffer from instabilities and inaccuracies.
Assuntos
Software , Simulação por Computador , Hidrólise , Estrutura Molecular , Compostos Organofosforados/química , Teoria Quântica , Termômetros , Água/química , Zinco/químicaRESUMO
Aminoglycosides containing a 2,3-trans carbamate group easily undergo anomerization from the 1,2-trans glycoside to the 1,2-cis isomer under mild acidic conditions. The N-substituent of the carbamate has a significant effect on the anomerization reaction; in particular, an N-acetyl group facilitated rapid and complete α-anomerization. The differences in reactivity due to the various N-substituents were supported by the results of DFT calculations; the orientation of the acetyl carbonyl group close to the anomeric position was found to contribute significantly to the directing of the anomerization reaction. By exploiting this reaction, oligoaminoglycosides with multiple 1,2-cis glycosidic bonds were generated from 1,2-trans glycosides in a one-step process.
Assuntos
Aminoglicosídeos/química , Glucosamina/química , Aminoglicosídeos/síntese química , Ciclização , IsomerismoRESUMO
Recent advances in language modeling have had a tremendous impact on how we handle sequential data in science. Language architectures have emerged as a hotbed of innovation and creativity in natural language processing over the last decade, and have since gained prominence in modeling proteins and chemical processes, elucidating structural relationships from textual/sequential data. Surprisingly, some of these relationships refer to three-dimensional structural features, raising important questions on the dimensionality of the information encoded within sequential data. Here, we demonstrate that the unsupervised use of a language model architecture to a language representation of bio-catalyzed chemical reactions can capture the signal at the base of the substrate-binding site atomic interactions. This allows us to identify the three-dimensional binding site position in unknown protein sequences. The language representation comprises a reaction-simplified molecular-input line-entry system (SMILES) for substrate and products, and amino acid sequence information for the enzyme. This approach can recover, with no supervision, 52.13% of the binding site when considering co-crystallized substrate-enzyme structures as ground truth, vastly outperforming other attention-based models.
RESUMO
Reaching optimal reaction conditions is crucial to achieve high yields, minimal by-products, and environmentally sustainable chemical reactions. With the recent rise of artificial intelligence, there has been a shift from traditional Edisonian trial-and-error optimization to data-driven and automated approaches, which offer significant advantages. Here, we showcase the capabilities of an integrated platform; we conducted simultaneous optimizations of four different terminal alkynes and two reaction routes using an automation platform combined with a Bayesian optimization platform. Remarkably, we achieved a conversion rate of over 80% for all four substrates in 23 experiments, covering ca. 0.2% of the combinatorial space. Further analysis allowed us to identify the influence of different reaction parameters on the reaction outcomes, demonstrating the potential for expedited reaction condition optimization and the prospect of more efficient chemical processes in the future.
RESUMO
A mild chlorination reaction of alcohols was developed using the classical thionyl chloride reagent but with added catalytic titanium(IV) chloride. These reactions proceeded rapidly to afford chlorination products in excellent yields and with preference for retention of configuration. Stereoselectivities were high for a variety of chiral cyclic secondary substrates including sterically hindered systems. Chlorosulfites were first generated in situ and converted to alkyl chlorides by the action of titanium tetrachloride which is thought to chelate the chlorosulfite leaving group and deliver the halogen nucleophile from the front face. To better understand this novel reaction pathway, an ab initio study was undertaken at the DFT level of theory using two different computational approaches. This computational evidence suggests that while the reaction proceeds through a carbocation intermediate, this charged species likely retains pyramidal geometry existing as a conformational isomer stabilized through hyperconjugation (hyperconjomers). These carbocations are then essentially "frozen" in their original configurations at the time of nucleophilic capture.
Assuntos
Álcoois/química , Cátions/química , Titânio/química , Catálise , Halogenação , Indicadores e Reagentes/química , Cinética , Estrutura Molecular , Teoria Quântica , EstereoisomerismoRESUMO
We developed a new coarse-grained (CG) model for water to study nucleation of droplets from the vapor phase. The resulting potential has a more flexible functional form and a longer range cutoff compared to other CG potentials available for water. This allowed us to extend the range of applicability of coarse-grained techniques to nucleation phenomena. By improving the description of the interactions between water molecules in the gas phase, we obtained CG model that gives similar results than the all-atom (AA) TIP4P model but at a lower computational cost. In this work we present the validation of the potential and its application to the study of nucleation of water droplets from the supersaturated vapor phase via molecular-dynamics simulations. The computed nucleation rates at T = 320 K and 350 K at different supersaturations, ranging from 5 to 15, compare very well with AA TIP4P simulations and show the right dependence on the temperature compared with available experimental data. To help comparison with the experiments, we explored in detail the different ways to control the temperature and the effects on nucleation.
RESUMO
The quest for generating novel chemistry knowledge is critical in scientific advancement, and machine learning (ML) has emerged as an asset in this pursuit. Through interpolation among learned patterns, ML can tackle tasks that were previously deemed demanding to machines. This distinctive capacity of ML provides invaluable aid to bench chemists in their daily work. However, current ML tools are typically designed to prioritize experiments with the highest likelihood of success, i.e., higher predictive confidence. In this perspective, we build on current trends that suggest a future in which ML could be just as beneficial in exploring uncharted search spaces through simulated curiosity. We discuss how low and 'negative' data can catalyse one-/few-shot learning, and how the broader use of curious ML and novelty detection algorithms can propel the next wave of chemical discoveries. We anticipate that ML for curiosity-driven research will help the community overcome potentially biased assumptions and uncover unexpected findings in the chemical sciences at an accelerated pace.
RESUMO
Over the past four years, several research groups demonstrated the combination of domain-specific language representation with recent NLP architectures to accelerate innovation in a wide range of scientific fields. Chemistry is a great example. Among the various chemical challenges addressed with language models, retrosynthesis demonstrates some of the most distinctive successes and limitations. Single-step retrosynthesis, the task of identifying reactions able to decompose a complex molecule into simpler structures, can be cast as a translation problem, in which a text-based representation of the target molecule is converted into a sequence of possible precursors. A common issue is a lack of diversity in the proposed disconnection strategies. The suggested precursors typically fall in the same reaction family, which limits the exploration of the chemical space. We present a retrosynthesis Transformer model that increases the diversity of the predictions by prepending a classification token to the language representation of the target molecule. At inference, the use of these prompt tokens allows us to steer the model towards different kinds of disconnection strategies. We show that the diversity of the predictions improves consistently, which enables recursive synthesis tools to circumvent dead ends and consequently, suggests synthesis pathways for more complex molecules.
RESUMO
The need for more efficient catalytic processes is ever-growing, and so are the costs associated with experimentally searching chemical space to find new promising catalysts. Despite the consolidated use of density functional theory (DFT) and other atomistic models for virtually screening molecules based on their simulated performance, data-driven approaches are rising as indispensable tools for designing and improving catalytic processes. Here, we present a deep learning model capable of generating new catalyst-ligand candidates by self-learning meaningful structural features solely from their language representation and computed binding energies. We train a recurrent neural network-based Variational Autoencoder (VAE) to compress the molecular representation of the catalyst into a lower dimensional latent space, in which a feed-forward neural network predicts the corresponding binding energy to be used as the optimization function. The outcome of the optimization in the latent space is then reconstructed back into the original molecular representation. These trained models achieve state-of-the-art predictive performances in catalysts' binding energy prediction and catalysts' design, with a mean absolute error of 2.42 kcal mol-1 and an ability to generate 84% valid and novel catalysts.
RESUMO
Synthesis protocol exploration is paramount in catalyst discovery, yet keeping pace with rapid literature advances is increasingly time intensive. Automated synthesis protocol analysis is attractive for swiftly identifying opportunities and informing predictive models, however such applications in heterogeneous catalysis remain limited. In this proof-of-concept, we introduce a transformer model for this task, exemplified using single-atom heterogeneous catalysts (SACs), a rapidly expanding catalyst family. Our model adeptly converts SAC protocols into action sequences, and we use this output to facilitate statistical inference of their synthesis trends and applications, potentially expediting literature review and analysis. We demonstrate the model's adaptability across distinct heterogeneous catalyst families, underscoring its versatility. Finally, our study highlights a critical issue: the lack of standardization in reporting protocols hampers machine-reading capabilities. Embracing digital advances in catalysis demands a shift in data reporting norms, and to this end, we offer guidelines for writing protocols, significantly improving machine-readability. We release our model as an open-source web application, inviting a fresh approach to accelerate heterogeneous catalysis synthesis planning.
RESUMO
Data-driven approaches to retrosynthesis are limited in user interaction, diversity of their predictions, and recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule we can steer the model to propose a broader set of precursors, thereby overcoming training data biases in retrosynthetic recommendations and achieving a 39% performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them greater control over the disconnection predictions, which results in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a two-stage schema consisting of automatic identification of disconnection sites, followed by prediction of reactant sets, thereby achieving a considerable improvement in class diversity compared with the baseline. The approach is effective in mitigating prediction biases derived from training data. This provides a wider variety of usable building blocks and improves the end user's digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is critical.
RESUMO
The world is on the verge of a new industrial revolution, and language models are poised to play a pivotal role in this transformative era. Their ability to offer intelligent insights and forecasts has made them a valuable asset for businesses seeking a competitive advantage. The chemical industry, in particular, can benefit significantly from harnessing their power. Since 2016 already, language models have been applied to tasks such as predicting reaction outcomes or retrosynthetic routes. While such models have demonstrated impressive abilities, the lack of publicly available data sets with universal coverage is often the limiting factor for achieving even higher accuracies. This makes it imperative for organizations to incorporate proprietary data sets into their model training processes to improve their performance. So far, however, these data sets frequently remain untapped as there are no established criteria for model customization. In this work, we report a successful methodology for retraining language models on reaction outcome prediction and single-step retrosynthesis tasks, using proprietary, nonpublic data sets. We report a considerable boost in accuracy by combining patent and proprietary data in a multidomain learning formulation. This exercise, inspired by a real-world use case, enables us to formulate guidelines that can be adopted in different corporate settings to customize chemical language models easily.
RESUMO
Data-driven synthesis planning has seen remarkable successes in recent years by virtue of modern approaches of artificial intelligence that efficiently exploit vast databases with experimental data on chemical reactions. However, this success story is intimately connected to the availability of existing experimental data. It may well occur in retrosynthetic and synthesis design tasks that predictions in individual steps of a reaction cascade are affected by large uncertainties. In such cases, it will, in general, not be easily possible to provide missing data from autonomously conducted experiments on demand. However, first-principles calculations can, in principle, provide missing data to enhance the confidence of an individual prediction or for model retraining. Here, we demonstrate the feasibility of such an ansatz and examine resource requirements for conducting autonomous first-principles calculations on demand.