Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 79
Filtrar
1.
J Am Chem Soc ; 2024 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-38822795

RESUMEN

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

2.
J Am Chem Soc ; 2024 May 20.
Artículo en Inglés | MEDLINE | ID: mdl-38768950

RESUMEN

Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.

3.
Nat Rev Chem ; 8(5): 300-301, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38605148
4.
J Chem Inf Model ; 64(8): 2948-2954, 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38488634

RESUMEN

SMARTS is a widely used language in cheminformatics for defining substructural queries for database lookups, reaction templates for chemical transformations, and other applications. As an extension to SMILES, many SMARTS patterns can represent the same query. Despite this, no canonicalization algorithm invariant of the line notation sequence or atomic numbering is publicly available. Here, we introduce RDCanon, an open-source Python package that can be used to standardize SMARTS queries. RDCanon is designed to ensure that the sequence of atomic queries remains consistent for all graphs representing the same substructure query and to ensure a canonical sequence of primitives within each individual atom query; furthermore, the algorithm can be applied to canonicalize the order of reactants, agents, and products and their atom map numbers in reaction SMARTS templates. As part of its canonicalization algorithm, RDCanon provides a mechanism in which the canonicalized SMARTS is optimized for speed against specific molecular databases. Several case studies are provided to showcase improved efficiency in substructure matching and retrosynthetic analysis.


Asunto(s)
Algoritmos , Programas Informáticos , Lenguajes de Programación , Quimioinformática/métodos , Bases de Datos de Compuestos Químicos
5.
Anal Chem ; 96(8): 3419-3428, 2024 Feb 27.
Artículo en Inglés | MEDLINE | ID: mdl-38349970

RESUMEN

The accurate prediction of tandem mass spectra from molecular structures has the potential to unlock new metabolomic discoveries by augmenting the community's libraries of experimental reference standards. Cheminformatic spectrum prediction strategies use a "bond-breaking" framework to iteratively simulate mass spectrum fragmentations, but these methods are (a) slow due to the need to exhaustively and combinatorially break molecules and (b) inaccurate as they often rely upon heuristics to predict the intensity of each resulting fragment; neural network alternatives mitigate computational cost but are black-box and not inherently more accurate. We introduce a physically grounded neural approach that learns to predict each breakage event and score the most relevant subset of molecular fragments quickly and accurately. We evaluate our model by predicting spectra from both public and private standard libraries, demonstrating that our hybrid approach offers state-of-the-art prediction accuracy, improved metabolite identification from a database of candidates, and higher interpretability when compared to previous breakage methods and black-box neural networks. The grounding of our approach in physical fragmentation events shows especially great promise for elucidating natural product molecules with more complex scaffolds.

6.
J Chem Inf Model ; 64(7): 2421-2431, 2024 Apr 08.
Artículo en Inglés | MEDLINE | ID: mdl-37725368

RESUMEN

Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parametrized fragmentation tree construction and scoring. In this work, we extend our previous spectrum Transformer methodology into an energy-based modeling framework, MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formula prediction, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data-dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network architectures. We further validate our approach on the CASMI2022 challenge data set, achieving nearly equivalent performance to the winning entry within the positive mode category without any manual curation or postprocessing of our results. These results demonstrate an exciting strategy to more powerfully leverage MS2 fragment peaks for predicting MS1 precursor chemical formulas with data-driven learning.


Asunto(s)
Redes Neurales de la Computación , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Bases de Datos Factuales
7.
Nat Chem Biol ; 20(2): 170-179, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-37919549

RESUMEN

Small molecules that induce protein-protein associations represent powerful tools to modulate cell circuitry. We sought to develop a platform for the direct discovery of compounds able to induce association of any two preselected proteins, using the E3 ligase von Hippel-Lindau (VHL) and bromodomains as test systems. Leveraging the screening power of DNA-encoded libraries (DELs), we synthesized ~1 million DNA-encoded compounds that possess a VHL-targeting ligand, a variety of connectors and a diversity element generated by split-and-pool combinatorial chemistry. By screening our DEL against bromodomains in the presence and absence of VHL, we could identify VHL-bound molecules that simultaneously bind bromodomains. For highly barcode-enriched library members, ternary complex formation leading to bromodomain degradation was confirmed in cells. Furthermore, a ternary complex crystal structure was obtained for our most enriched library member with BRD4BD1 and a VHL complex. Our work provides a foundation for adapting DEL screening to the discovery of proximity-inducing small molecules.


Asunto(s)
Proteínas Nucleares , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau/química , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau/metabolismo , Proteínas Nucleares/metabolismo , Factores de Transcripción , Ubiquitina-Proteína Ligasas/metabolismo , ADN
9.
ACS Macro Lett ; 12(11): 1517-1522, 2023 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-37889173

RESUMEN

We unveil a unified view on the effect of side chains on the glass transition temperatures (Tg) in polymer melts by using molecular dynamics simulations, density functional theory calculations, and available experimental data. We use acrylates as a model system and evaluate the effect of n-alkyl side chains on Tg. We find that backbone dihedral angle fluctuations follow established patterns due to sterics, as expected. However, we also find that the dihedral angle orthogonal to the backbone, which normally is neglected when discussing the effect on Tg, introduces a secondary rotational degree of freedom which strongly impacts Tg. These results are in agreement with experiments and generalize to multiple other polymer systems, as is demonstrated using available experimental data. Conversely, n-alkyl pendant groups attached to the side group reduce Tg. Our work establishes a coherent framework that unifies previously established trends, emphasizing the polarity and size effects of n-alkyl chains on Tg.

10.
Chemistry ; 29(60): e202301957, 2023 Oct 26.
Artículo en Inglés | MEDLINE | ID: mdl-37526059

RESUMEN

Molecular quantum mechanical modeling, accelerated by machine learning, has opened the door to high-throughput screening campaigns of complex properties, such as the activation energies of chemical reactions and absorption/emission spectra of materials and molecules; in silico. Here, we present an overview of the main principles, concepts, and design considerations involved in such hybrid computational quantum chemistry/machine learning screening workflows, with a special emphasis on some recent examples of their successful application. We end with a brief outlook of further advances that will benefit the field.

11.
Nature ; 620(7972): 47-60, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37532811

RESUMEN

Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.


Asunto(s)
Inteligencia Artificial , Proyectos de Investigación , Inteligencia Artificial/normas , Inteligencia Artificial/tendencias , Conjuntos de Datos como Asunto , Aprendizaje Profundo , Proyectos de Investigación/normas , Proyectos de Investigación/tendencias , Aprendizaje Automático no Supervisado
12.
Nat Commun ; 14(1): 4930, 2023 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-37582753

RESUMEN

Diversity-oriented synthesis (DOS) is a powerful strategy to prepare molecules with underrepresented features in commercial screening collections, resulting in the elucidation of novel biological mechanisms. In parallel to the development of DOS, DNA-encoded libraries (DELs) have emerged as an effective, efficient screening strategy to identify protein binders. Despite recent advancements in this field, most DEL syntheses are limited by the presence of sensitive DNA-based constructs. Here, we describe the design, synthesis, and validation experiments performed for a 3.7 million-member DEL, generated using diverse skeleton architectures with varying exit vectors and derived from DOS, to achieve structural diversity beyond what is possible by varying appendages alone. We also show screening results for three diverse protein targets. We will make this DEL available to the academic scientific community to increase access to novel structural features and accelerate early-phase drug discovery.


Asunto(s)
Descubrimiento de Drogas , Bibliotecas de Moléculas Pequeñas , Bibliotecas de Moléculas Pequeñas/química , Descubrimiento de Drogas/métodos , Biblioteca de Genes , ADN/genética , ADN/química
14.
J Chem Inf Model ; 63(14): 4253-4265, 2023 07 24.
Artículo en Inglés | MEDLINE | ID: mdl-37405398

RESUMEN

The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained. In this Perspective, we analyze several data curation and sharing initiatives that have seen success in chemistry and molecular biology. We discuss several factors that have contributed to their success and how we can take lessons from these case studies and apply them to reaction data. Finally, we spotlight the Open Reaction Database and summarize key actions the community can take toward making reaction data more findable, accessible, interoperable, and reusable (FAIR), including the use of mandates from funding agencies and publishers.


Asunto(s)
Curaduría de Datos , Informática , Bases de Datos Factuales , Difusión de la Información
15.
bioRxiv ; 2023 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-37502883

RESUMEN

Liquid handling robots are often limited by proprietary programming interfaces that are only compatible with a single type of robot and operating system, restricting method sharing and slowing development. Here we present PyLabRobot, an open-source, cross-platform Python interface capable of programming diverse liquid-handling robots, including Hamilton STARs, Tecan EVOs, and Opentron OT-2s. PyLabRobot provides a universal set of commands and representations for deck layout and labware, enabling the control of diverse accessory devices. The interface is extensible and can work with any robot that manipulates liquids within a Cartesian coordinate system. We validated the system through unit tests and several application demonstrations, including a browser-based simulator, a position calibration tool, and a path-teaching tool for complex movements. PyLabRobot provides a flexible, open, and collaborative programming environment for laboratory automation.

16.
J Chem Inf Model ; 63(13): 4030-4041, 2023 07 10.
Artículo en Inglés | MEDLINE | ID: mdl-37368970

RESUMEN

Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.


Asunto(s)
Aprendizaje Automático
19.
Patterns (N Y) ; 4(2): 100678, 2023 Feb 10.
Artículo en Inglés | MEDLINE | ID: mdl-36873904

RESUMEN

Molecular discovery is a multi-objective optimization problem that requires identifying a molecule or set of molecules that balance multiple, often competing, properties. Multi-objective molecular design is commonly addressed by combining properties of interest into a single objective function using scalarization, which imposes assumptions about relative importance and uncovers little about the trade-offs between objectives. In contrast to scalarization, Pareto optimization does not require knowledge of relative importance and reveals the trade-offs between objectives. However, it introduces additional considerations in algorithm design. In this review, we describe pool-based and de novo generative approaches to multi-objective molecular discovery with a focus on Pareto optimization algorithms. We show how pool-based molecular discovery is a relatively direct extension of multi-objective Bayesian optimization and how the plethora of different generative models extend from single-objective to multi-objective optimization in similar ways using non-dominated sorting in the reward function (reinforcement learning) or to select molecules for retraining (distribution learning) or propagation (genetic algorithms). Finally, we discuss some remaining challenges and opportunities in the field, emphasizing the opportunity to adopt Bayesian optimization techniques into multi-objective de novo design.

20.
J Chem Inf Model ; 63(7): 1925-1934, 2023 04 10.
Artículo en Inglés | MEDLINE | ID: mdl-36971363

RESUMEN

Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.


Asunto(s)
Benchmarking , Estructura Molecular
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...