Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 77
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Am Chem Soc ; 146(23): 16052-16061, 2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38822795

RESUMO

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

2.
Nat Comput Sci ; 2024 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-38886590

RESUMO

Small molecules exhibiting desirable property profiles are often discovered through an iterative process of designing, synthesizing and testing sets of molecules. The selection of molecules to synthesize from all possible candidates is a complex decision-making process that typically relies on expert chemist intuition. Here we propose a quantitative decision-making framework, SPARROW, that prioritizes molecules for evaluation by balancing expected information gain and synthetic cost. SPARROW integrates molecular design, property prediction and retrosynthetic planning to balance the utility of testing a molecule with the cost of batch synthesis. We demonstrate, through three case studies, that the developed algorithm captures the non-additive costs inherent to batch synthesis, leverages common reaction steps and intermediates, and scales to hundreds of molecules.

3.
J Am Chem Soc ; 146(22): 15070-15084, 2024 Jun 05.
Artigo em Inglês | MEDLINE | ID: mdl-38768950

RESUMO

Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.


Assuntos
Desenho de Fármacos , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/síntese química , Aprendizado de Máquina , Teoria da Densidade Funcional , Estrutura Molecular , Química Farmacêutica/métodos
4.
Nat Rev Chem ; 8(5): 300-301, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38605148
5.
J Chem Inf Model ; 64(8): 2948-2954, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38488634

RESUMO

SMARTS is a widely used language in cheminformatics for defining substructural queries for database lookups, reaction templates for chemical transformations, and other applications. As an extension to SMILES, many SMARTS patterns can represent the same query. Despite this, no canonicalization algorithm invariant of the line notation sequence or atomic numbering is publicly available. Here, we introduce RDCanon, an open-source Python package that can be used to standardize SMARTS queries. RDCanon is designed to ensure that the sequence of atomic queries remains consistent for all graphs representing the same substructure query and to ensure a canonical sequence of primitives within each individual atom query; furthermore, the algorithm can be applied to canonicalize the order of reactants, agents, and products and their atom map numbers in reaction SMARTS templates. As part of its canonicalization algorithm, RDCanon provides a mechanism in which the canonicalized SMARTS is optimized for speed against specific molecular databases. Several case studies are provided to showcase improved efficiency in substructure matching and retrosynthetic analysis.


Assuntos
Algoritmos , Software , Linguagens de Programação , Quimioinformática/métodos , Bases de Dados de Compostos Químicos
6.
Anal Chem ; 96(8): 3419-3428, 2024 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-38349970

RESUMO

The accurate prediction of tandem mass spectra from molecular structures has the potential to unlock new metabolomic discoveries by augmenting the community's libraries of experimental reference standards. Cheminformatic spectrum prediction strategies use a "bond-breaking" framework to iteratively simulate mass spectrum fragmentations, but these methods are (a) slow due to the need to exhaustively and combinatorially break molecules and (b) inaccurate as they often rely upon heuristics to predict the intensity of each resulting fragment; neural network alternatives mitigate computational cost but are black-box and not inherently more accurate. We introduce a physically grounded neural approach that learns to predict each breakage event and score the most relevant subset of molecular fragments quickly and accurately. We evaluate our model by predicting spectra from both public and private standard libraries, demonstrating that our hybrid approach offers state-of-the-art prediction accuracy, improved metabolite identification from a database of candidates, and higher interpretability when compared to previous breakage methods and black-box neural networks. The grounding of our approach in physical fragmentation events shows especially great promise for elucidating natural product molecules with more complex scaffolds.

7.
J Chem Inf Model ; 64(7): 2421-2431, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-37725368

RESUMO

Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parametrized fragmentation tree construction and scoring. In this work, we extend our previous spectrum Transformer methodology into an energy-based modeling framework, MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formula prediction, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data-dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network architectures. We further validate our approach on the CASMI2022 challenge data set, achieving nearly equivalent performance to the winning entry within the positive mode category without any manual curation or postprocessing of our results. These results demonstrate an exciting strategy to more powerfully leverage MS2 fragment peaks for predicting MS1 precursor chemical formulas with data-driven learning.


Assuntos
Redes Neurais de Computação , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Bases de Dados Factuais
8.
Nat Chem Biol ; 20(2): 170-179, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37919549

RESUMO

Small molecules that induce protein-protein associations represent powerful tools to modulate cell circuitry. We sought to develop a platform for the direct discovery of compounds able to induce association of any two preselected proteins, using the E3 ligase von Hippel-Lindau (VHL) and bromodomains as test systems. Leveraging the screening power of DNA-encoded libraries (DELs), we synthesized ~1 million DNA-encoded compounds that possess a VHL-targeting ligand, a variety of connectors and a diversity element generated by split-and-pool combinatorial chemistry. By screening our DEL against bromodomains in the presence and absence of VHL, we could identify VHL-bound molecules that simultaneously bind bromodomains. For highly barcode-enriched library members, ternary complex formation leading to bromodomain degradation was confirmed in cells. Furthermore, a ternary complex crystal structure was obtained for our most enriched library member with BRD4BD1 and a VHL complex. Our work provides a foundation for adapting DEL screening to the discovery of proximity-inducing small molecules.


Assuntos
Proteínas Nucleares , Proteína Supressora de Tumor Von Hippel-Lindau , Proteína Supressora de Tumor Von Hippel-Lindau/química , Proteína Supressora de Tumor Von Hippel-Lindau/metabolismo , Proteínas Nucleares/metabolismo , Fatores de Transcrição , Ubiquitina-Proteína Ligases/metabolismo , DNA
10.
ACS Macro Lett ; 12(11): 1517-1522, 2023 Nov 21.
Artigo em Inglês | MEDLINE | ID: mdl-37889173

RESUMO

We unveil a unified view on the effect of side chains on the glass transition temperatures (Tg) in polymer melts by using molecular dynamics simulations, density functional theory calculations, and available experimental data. We use acrylates as a model system and evaluate the effect of n-alkyl side chains on Tg. We find that backbone dihedral angle fluctuations follow established patterns due to sterics, as expected. However, we also find that the dihedral angle orthogonal to the backbone, which normally is neglected when discussing the effect on Tg, introduces a secondary rotational degree of freedom which strongly impacts Tg. These results are in agreement with experiments and generalize to multiple other polymer systems, as is demonstrated using available experimental data. Conversely, n-alkyl pendant groups attached to the side group reduce Tg. Our work establishes a coherent framework that unifies previously established trends, emphasizing the polarity and size effects of n-alkyl chains on Tg.

11.
Nature ; 620(7972): 47-60, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37532811

RESUMO

Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.


Assuntos
Inteligência Artificial , Projetos de Pesquisa , Inteligência Artificial/normas , Inteligência Artificial/tendências , Conjuntos de Dados como Assunto , Aprendizado Profundo , Projetos de Pesquisa/normas , Projetos de Pesquisa/tendências , Aprendizado de Máquina não Supervisionado
13.
Chemistry ; 29(60): e202301957, 2023 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-37526059

RESUMO

Molecular quantum mechanical modeling, accelerated by machine learning, has opened the door to high-throughput screening campaigns of complex properties, such as the activation energies of chemical reactions and absorption/emission spectra of materials and molecules; in silico. Here, we present an overview of the main principles, concepts, and design considerations involved in such hybrid computational quantum chemistry/machine learning screening workflows, with a special emphasis on some recent examples of their successful application. We end with a brief outlook of further advances that will benefit the field.

14.
Nat Commun ; 14(1): 4930, 2023 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-37582753

RESUMO

Diversity-oriented synthesis (DOS) is a powerful strategy to prepare molecules with underrepresented features in commercial screening collections, resulting in the elucidation of novel biological mechanisms. In parallel to the development of DOS, DNA-encoded libraries (DELs) have emerged as an effective, efficient screening strategy to identify protein binders. Despite recent advancements in this field, most DEL syntheses are limited by the presence of sensitive DNA-based constructs. Here, we describe the design, synthesis, and validation experiments performed for a 3.7 million-member DEL, generated using diverse skeleton architectures with varying exit vectors and derived from DOS, to achieve structural diversity beyond what is possible by varying appendages alone. We also show screening results for three diverse protein targets. We will make this DEL available to the academic scientific community to increase access to novel structural features and accelerate early-phase drug discovery.


Assuntos
Descoberta de Drogas , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química , Descoberta de Drogas/métodos , Biblioteca Gênica , DNA/genética , DNA/química
15.
J Chem Inf Model ; 63(14): 4253-4265, 2023 07 24.
Artigo em Inglês | MEDLINE | ID: mdl-37405398

RESUMO

The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained. In this Perspective, we analyze several data curation and sharing initiatives that have seen success in chemistry and molecular biology. We discuss several factors that have contributed to their success and how we can take lessons from these case studies and apply them to reaction data. Finally, we spotlight the Open Reaction Database and summarize key actions the community can take toward making reaction data more findable, accessible, interoperable, and reusable (FAIR), including the use of mandates from funding agencies and publishers.


Assuntos
Curadoria de Dados , Informática , Bases de Dados Factuais , Disseminação de Informação
16.
J Chem Inf Model ; 63(13): 4030-4041, 2023 07 10.
Artigo em Inglês | MEDLINE | ID: mdl-37368970

RESUMO

Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.


Assuntos
Aprendizado de Máquina
19.
Patterns (N Y) ; 4(2): 100678, 2023 Feb 10.
Artigo em Inglês | MEDLINE | ID: mdl-36873904

RESUMO

Molecular discovery is a multi-objective optimization problem that requires identifying a molecule or set of molecules that balance multiple, often competing, properties. Multi-objective molecular design is commonly addressed by combining properties of interest into a single objective function using scalarization, which imposes assumptions about relative importance and uncovers little about the trade-offs between objectives. In contrast to scalarization, Pareto optimization does not require knowledge of relative importance and reveals the trade-offs between objectives. However, it introduces additional considerations in algorithm design. In this review, we describe pool-based and de novo generative approaches to multi-objective molecular discovery with a focus on Pareto optimization algorithms. We show how pool-based molecular discovery is a relatively direct extension of multi-objective Bayesian optimization and how the plethora of different generative models extend from single-objective to multi-objective optimization in similar ways using non-dominated sorting in the reward function (reinforcement learning) or to select molecules for retraining (distribution learning) or propagation (genetic algorithms). Finally, we discuss some remaining challenges and opportunities in the field, emphasizing the opportunity to adopt Bayesian optimization techniques into multi-objective de novo design.

20.
J Chem Inf Model ; 63(7): 1925-1934, 2023 04 10.
Artigo em Inglês | MEDLINE | ID: mdl-36971363

RESUMO

Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.


Assuntos
Benchmarking , Estrutura Molecular
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...