Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 81
Filtrar
1.
Nat Prod Rep ; 2024 Aug 16.
Artículo en Inglés | MEDLINE | ID: mdl-39148455

RESUMEN

Artificial intelligence (AI) is accelerating how we conduct science, from folding proteins with AlphaFold and summarizing literature findings with large language models, to annotating genomes and prioritizing newly generated molecules for screening using specialized software. However, the application of AI to emulate human cognition in natural product research and its subsequent impact has so far been limited. One reason for this limited impact is that available natural product data is multimodal, unbalanced, unstandardized, and scattered across many data repositories. This makes natural product data challenging to use with existing deep learning architectures that consume fairly standardized, often non-relational, data. It also prevents models from learning overarching patterns in natural product science. In this Viewpoint, we address this challenge and support ongoing initiatives aimed at democratizing natural product data by collating our collective knowledge into a knowledge graph. By doing so, we believe there will be an opportunity to use such a knowledge graph to develop AI models that can truly mimic natural product scientists' decision-making.

2.
Digit Discov ; 2024 Jul 31.
Artículo en Inglés | MEDLINE | ID: mdl-39157760

RESUMEN

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

3.
J Chem Inf Model ; 64(14): 5521-5534, 2024 Jul 22.
Artículo en Inglés | MEDLINE | ID: mdl-38950894

RESUMEN

Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.


Asunto(s)
Aprendizaje Automático , Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Algoritmos , Quimioinformática/métodos
4.
Angew Chem Int Ed Engl ; : e202411296, 2024 Jul 12.
Artículo en Inglés | MEDLINE | ID: mdl-38995205

RESUMEN

Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. In this study, we construct such a dataset by imputing intermediates between experimentally reported reactants and products using expert reaction templates and train several machine learning models on the resulting dataset of 5,184,184 elementary steps. We explore the performance and capabilities of these models, focusing on their ability to predict reaction pathways and recapitulate the roles of catalysts and reagents. Additionally, we demonstrate the potential of mechanistic models in predicting impurities, often overlooked by conventional models. We conclude by evaluating the generalizability of mechanistic models to new reaction types, revealing challenges related to dataset diversity, consecutive predictions, and violations of atom conservation.

5.
Nat Comput Sci ; 4(6): 440-450, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38886590

RESUMEN

Small molecules exhibiting desirable property profiles are often discovered through an iterative process of designing, synthesizing and testing sets of molecules. The selection of molecules to synthesize from all possible candidates is a complex decision-making process that typically relies on expert chemist intuition. Here we propose a quantitative decision-making framework, SPARROW, that prioritizes molecules for evaluation by balancing expected information gain and synthetic cost. SPARROW integrates molecular design, property prediction and retrosynthetic planning to balance the utility of testing a molecule with the cost of batch synthesis. We demonstrate, through three case studies, that the developed algorithm captures the non-additive costs inherent to batch synthesis, leverages common reaction steps and intermediates, and scales to hundreds of molecules.

6.
J Am Chem Soc ; 146(23): 16052-16061, 2024 Jun 12.
Artículo en Inglés | MEDLINE | ID: mdl-38822795

RESUMEN

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

7.
J Am Chem Soc ; 146(22): 15070-15084, 2024 Jun 05.
Artículo en Inglés | MEDLINE | ID: mdl-38768950

RESUMEN

Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.


Asunto(s)
Diseño de Fármacos , Bibliotecas de Moléculas Pequeñas , Bibliotecas de Moléculas Pequeñas/química , Bibliotecas de Moléculas Pequeñas/síntesis química , Aprendizaje Automático , Teoría Funcional de la Densidad , Estructura Molecular , Química Farmacéutica/métodos
8.
Nat Rev Chem ; 8(5): 300-301, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38605148
9.
J Chem Inf Model ; 64(8): 2948-2954, 2024 04 22.
Artículo en Inglés | MEDLINE | ID: mdl-38488634

RESUMEN

SMARTS is a widely used language in cheminformatics for defining substructural queries for database lookups, reaction templates for chemical transformations, and other applications. As an extension to SMILES, many SMARTS patterns can represent the same query. Despite this, no canonicalization algorithm invariant of the line notation sequence or atomic numbering is publicly available. Here, we introduce RDCanon, an open-source Python package that can be used to standardize SMARTS queries. RDCanon is designed to ensure that the sequence of atomic queries remains consistent for all graphs representing the same substructure query and to ensure a canonical sequence of primitives within each individual atom query; furthermore, the algorithm can be applied to canonicalize the order of reactants, agents, and products and their atom map numbers in reaction SMARTS templates. As part of its canonicalization algorithm, RDCanon provides a mechanism in which the canonicalized SMARTS is optimized for speed against specific molecular databases. Several case studies are provided to showcase improved efficiency in substructure matching and retrosynthetic analysis.


Asunto(s)
Algoritmos , Programas Informáticos , Lenguajes de Programación , Quimioinformática/métodos , Bases de Datos de Compuestos Químicos
10.
Anal Chem ; 96(8): 3419-3428, 2024 Feb 27.
Artículo en Inglés | MEDLINE | ID: mdl-38349970

RESUMEN

The accurate prediction of tandem mass spectra from molecular structures has the potential to unlock new metabolomic discoveries by augmenting the community's libraries of experimental reference standards. Cheminformatic spectrum prediction strategies use a "bond-breaking" framework to iteratively simulate mass spectrum fragmentations, but these methods are (a) slow due to the need to exhaustively and combinatorially break molecules and (b) inaccurate as they often rely upon heuristics to predict the intensity of each resulting fragment; neural network alternatives mitigate computational cost but are black-box and not inherently more accurate. We introduce a physically grounded neural approach that learns to predict each breakage event and score the most relevant subset of molecular fragments quickly and accurately. We evaluate our model by predicting spectra from both public and private standard libraries, demonstrating that our hybrid approach offers state-of-the-art prediction accuracy, improved metabolite identification from a database of candidates, and higher interpretability when compared to previous breakage methods and black-box neural networks. The grounding of our approach in physical fragmentation events shows especially great promise for elucidating natural product molecules with more complex scaffolds.

11.
Nat Chem Biol ; 20(2): 170-179, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-37919549

RESUMEN

Small molecules that induce protein-protein associations represent powerful tools to modulate cell circuitry. We sought to develop a platform for the direct discovery of compounds able to induce association of any two preselected proteins, using the E3 ligase von Hippel-Lindau (VHL) and bromodomains as test systems. Leveraging the screening power of DNA-encoded libraries (DELs), we synthesized ~1 million DNA-encoded compounds that possess a VHL-targeting ligand, a variety of connectors and a diversity element generated by split-and-pool combinatorial chemistry. By screening our DEL against bromodomains in the presence and absence of VHL, we could identify VHL-bound molecules that simultaneously bind bromodomains. For highly barcode-enriched library members, ternary complex formation leading to bromodomain degradation was confirmed in cells. Furthermore, a ternary complex crystal structure was obtained for our most enriched library member with BRD4BD1 and a VHL complex. Our work provides a foundation for adapting DEL screening to the discovery of proximity-inducing small molecules.


Asunto(s)
Proteínas Nucleares , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau/química , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau/metabolismo , Proteínas Nucleares/metabolismo , Factores de Transcripción , Ubiquitina-Proteína Ligasas/metabolismo , ADN
12.
J Chem Inf Model ; 64(7): 2421-2431, 2024 Apr 08.
Artículo en Inglés | MEDLINE | ID: mdl-37725368

RESUMEN

Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parametrized fragmentation tree construction and scoring. In this work, we extend our previous spectrum Transformer methodology into an energy-based modeling framework, MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formula prediction, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data-dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network architectures. We further validate our approach on the CASMI2022 challenge data set, achieving nearly equivalent performance to the winning entry within the positive mode category without any manual curation or postprocessing of our results. These results demonstrate an exciting strategy to more powerfully leverage MS2 fragment peaks for predicting MS1 precursor chemical formulas with data-driven learning.


Asunto(s)
Redes Neurales de la Computación , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Bases de Datos Factuales
14.
ACS Macro Lett ; 12(11): 1517-1522, 2023 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-37889173

RESUMEN

We unveil a unified view on the effect of side chains on the glass transition temperatures (Tg) in polymer melts by using molecular dynamics simulations, density functional theory calculations, and available experimental data. We use acrylates as a model system and evaluate the effect of n-alkyl side chains on Tg. We find that backbone dihedral angle fluctuations follow established patterns due to sterics, as expected. However, we also find that the dihedral angle orthogonal to the backbone, which normally is neglected when discussing the effect on Tg, introduces a secondary rotational degree of freedom which strongly impacts Tg. These results are in agreement with experiments and generalize to multiple other polymer systems, as is demonstrated using available experimental data. Conversely, n-alkyl pendant groups attached to the side group reduce Tg. Our work establishes a coherent framework that unifies previously established trends, emphasizing the polarity and size effects of n-alkyl chains on Tg.

15.
Nature ; 620(7972): 47-60, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37532811

RESUMEN

Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.


Asunto(s)
Inteligencia Artificial , Proyectos de Investigación , Inteligencia Artificial/normas , Inteligencia Artificial/tendencias , Conjuntos de Datos como Asunto , Aprendizaje Profundo , Proyectos de Investigación/normas , Proyectos de Investigación/tendencias , Aprendizaje Automático no Supervisado
16.
Chemistry ; 29(60): e202301957, 2023 Oct 26.
Artículo en Inglés | MEDLINE | ID: mdl-37526059

RESUMEN

Molecular quantum mechanical modeling, accelerated by machine learning, has opened the door to high-throughput screening campaigns of complex properties, such as the activation energies of chemical reactions and absorption/emission spectra of materials and molecules; in silico. Here, we present an overview of the main principles, concepts, and design considerations involved in such hybrid computational quantum chemistry/machine learning screening workflows, with a special emphasis on some recent examples of their successful application. We end with a brief outlook of further advances that will benefit the field.

18.
Nat Commun ; 14(1): 4930, 2023 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-37582753

RESUMEN

Diversity-oriented synthesis (DOS) is a powerful strategy to prepare molecules with underrepresented features in commercial screening collections, resulting in the elucidation of novel biological mechanisms. In parallel to the development of DOS, DNA-encoded libraries (DELs) have emerged as an effective, efficient screening strategy to identify protein binders. Despite recent advancements in this field, most DEL syntheses are limited by the presence of sensitive DNA-based constructs. Here, we describe the design, synthesis, and validation experiments performed for a 3.7 million-member DEL, generated using diverse skeleton architectures with varying exit vectors and derived from DOS, to achieve structural diversity beyond what is possible by varying appendages alone. We also show screening results for three diverse protein targets. We will make this DEL available to the academic scientific community to increase access to novel structural features and accelerate early-phase drug discovery.


Asunto(s)
Descubrimiento de Drogas , Bibliotecas de Moléculas Pequeñas , Bibliotecas de Moléculas Pequeñas/química , Descubrimiento de Drogas/métodos , Biblioteca de Genes , ADN/genética , ADN/química
19.
J Chem Inf Model ; 63(14): 4253-4265, 2023 07 24.
Artículo en Inglés | MEDLINE | ID: mdl-37405398

RESUMEN

The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained. In this Perspective, we analyze several data curation and sharing initiatives that have seen success in chemistry and molecular biology. We discuss several factors that have contributed to their success and how we can take lessons from these case studies and apply them to reaction data. Finally, we spotlight the Open Reaction Database and summarize key actions the community can take toward making reaction data more findable, accessible, interoperable, and reusable (FAIR), including the use of mandates from funding agencies and publishers.


Asunto(s)
Curaduría de Datos , Informática , Bases de Datos Factuales , Difusión de la Información
20.
J Chem Inf Model ; 63(13): 4030-4041, 2023 07 10.
Artículo en Inglés | MEDLINE | ID: mdl-37368970

RESUMEN

Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.


Asunto(s)
Aprendizaje Automático
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA