Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 78
Filtrar
1.
Nature ; 620(7972): 47-60, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37532811

RESUMO

Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.


Assuntos
Inteligência Artificial , Projetos de Pesquisa , Inteligência Artificial/normas , Inteligência Artificial/tendências , Conjuntos de Dados como Assunto , Aprendizado Profundo , Projetos de Pesquisa/normas , Projetos de Pesquisa/tendências , Aprendizado de Máquina não Supervisionado
2.
Nat Chem Biol ; 20(2): 170-179, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37919549

RESUMO

Small molecules that induce protein-protein associations represent powerful tools to modulate cell circuitry. We sought to develop a platform for the direct discovery of compounds able to induce association of any two preselected proteins, using the E3 ligase von Hippel-Lindau (VHL) and bromodomains as test systems. Leveraging the screening power of DNA-encoded libraries (DELs), we synthesized ~1 million DNA-encoded compounds that possess a VHL-targeting ligand, a variety of connectors and a diversity element generated by split-and-pool combinatorial chemistry. By screening our DEL against bromodomains in the presence and absence of VHL, we could identify VHL-bound molecules that simultaneously bind bromodomains. For highly barcode-enriched library members, ternary complex formation leading to bromodomain degradation was confirmed in cells. Furthermore, a ternary complex crystal structure was obtained for our most enriched library member with BRD4BD1 and a VHL complex. Our work provides a foundation for adapting DEL screening to the discovery of proximity-inducing small molecules.


Assuntos
Proteínas Nucleares , Proteína Supressora de Tumor Von Hippel-Lindau , Proteína Supressora de Tumor Von Hippel-Lindau/química , Proteína Supressora de Tumor Von Hippel-Lindau/metabolismo , Proteínas Nucleares/metabolismo , Fatores de Transcrição , Ubiquitina-Proteína Ligases/metabolismo , DNA
3.
J Am Chem Soc ; 146(23): 16052-16061, 2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38822795

RESUMO

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

4.
J Am Chem Soc ; 146(22): 15070-15084, 2024 Jun 05.
Artigo em Inglês | MEDLINE | ID: mdl-38768950

RESUMO

Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.


Assuntos
Desenho de Fármacos , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/síntese química , Aprendizado de Máquina , Teoria da Densidade Funcional , Estrutura Molecular , Química Farmacêutica/métodos
5.
Anal Chem ; 96(8): 3419-3428, 2024 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-38349970

RESUMO

The accurate prediction of tandem mass spectra from molecular structures has the potential to unlock new metabolomic discoveries by augmenting the community's libraries of experimental reference standards. Cheminformatic spectrum prediction strategies use a "bond-breaking" framework to iteratively simulate mass spectrum fragmentations, but these methods are (a) slow due to the need to exhaustively and combinatorially break molecules and (b) inaccurate as they often rely upon heuristics to predict the intensity of each resulting fragment; neural network alternatives mitigate computational cost but are black-box and not inherently more accurate. We introduce a physically grounded neural approach that learns to predict each breakage event and score the most relevant subset of molecular fragments quickly and accurately. We evaluate our model by predicting spectra from both public and private standard libraries, demonstrating that our hybrid approach offers state-of-the-art prediction accuracy, improved metabolite identification from a database of candidates, and higher interpretability when compared to previous breakage methods and black-box neural networks. The grounding of our approach in physical fragmentation events shows especially great promise for elucidating natural product molecules with more complex scaffolds.

6.
J Chem Inf Model ; 64(8): 2948-2954, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38488634

RESUMO

SMARTS is a widely used language in cheminformatics for defining substructural queries for database lookups, reaction templates for chemical transformations, and other applications. As an extension to SMILES, many SMARTS patterns can represent the same query. Despite this, no canonicalization algorithm invariant of the line notation sequence or atomic numbering is publicly available. Here, we introduce RDCanon, an open-source Python package that can be used to standardize SMARTS queries. RDCanon is designed to ensure that the sequence of atomic queries remains consistent for all graphs representing the same substructure query and to ensure a canonical sequence of primitives within each individual atom query; furthermore, the algorithm can be applied to canonicalize the order of reactants, agents, and products and their atom map numbers in reaction SMARTS templates. As part of its canonicalization algorithm, RDCanon provides a mechanism in which the canonicalized SMARTS is optimized for speed against specific molecular databases. Several case studies are provided to showcase improved efficiency in substructure matching and retrosynthetic analysis.


Assuntos
Algoritmos , Software , Linguagens de Programação , Quimioinformática/métodos , Bases de Dados de Compostos Químicos
7.
J Chem Inf Model ; 64(7): 2421-2431, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-37725368

RESUMO

Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parametrized fragmentation tree construction and scoring. In this work, we extend our previous spectrum Transformer methodology into an energy-based modeling framework, MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formula prediction, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data-dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network architectures. We further validate our approach on the CASMI2022 challenge data set, achieving nearly equivalent performance to the winning entry within the positive mode category without any manual curation or postprocessing of our results. These results demonstrate an exciting strategy to more powerfully leverage MS2 fragment peaks for predicting MS1 precursor chemical formulas with data-driven learning.


Assuntos
Redes Neurais de Computação , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Bases de Dados Factuais
8.
J Chem Inf Model ; 2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38950894

RESUMO

Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.

10.
Chemistry ; 29(28): e202300387, 2023 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-36787246

RESUMO

Bioorthogonal click chemistry has become an indispensable part of the biochemist's toolbox. Despite the wide variety of applications that have been developed in recent years, only a limited number of bioorthogonal click reactions have been discovered so far, most of them based on (substituted) azides. In this work, we present a computational workflow to discover new candidate reactions with promising kinetic and thermodynamic properties for bioorthogonal click applications. Sampling only around 0.05 % of an overall search space of over 10,000,000 dipolar cycloadditions, we develop a machine learning model able to predict DFT-computed activation and reaction energies within ∼2-3 kcal/mol across the entire space. Applying this model to screen the full search space through iterative rounds of learning, we identify a broad pool of candidate reactions with rich structural diversity, which can be used as a starting point or source of inspiration for future experimental development of both azide-based and non-azide-based bioorthogonal click reactions.

11.
Chemistry ; 29(60): e202301957, 2023 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-37526059

RESUMO

Molecular quantum mechanical modeling, accelerated by machine learning, has opened the door to high-throughput screening campaigns of complex properties, such as the activation energies of chemical reactions and absorption/emission spectra of materials and molecules; in silico. Here, we present an overview of the main principles, concepts, and design considerations involved in such hybrid computational quantum chemistry/machine learning screening workflows, with a special emphasis on some recent examples of their successful application. We end with a brief outlook of further advances that will benefit the field.

12.
PLoS Comput Biol ; 18(2): e1009853, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35143485

RESUMO

Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.


Assuntos
Aprendizado de Máquina , Proteínas , Sequência de Aminoácidos , Descoberta de Drogas , Proteínas/química , Especificidade por Substrato
13.
J Chem Inf Model ; 63(14): 4253-4265, 2023 07 24.
Artigo em Inglês | MEDLINE | ID: mdl-37405398

RESUMO

The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained. In this Perspective, we analyze several data curation and sharing initiatives that have seen success in chemistry and molecular biology. We discuss several factors that have contributed to their success and how we can take lessons from these case studies and apply them to reaction data. Finally, we spotlight the Open Reaction Database and summarize key actions the community can take toward making reaction data more findable, accessible, interoperable, and reusable (FAIR), including the use of mandates from funding agencies and publishers.


Assuntos
Curadoria de Dados , Informática , Bases de Dados Factuais , Disseminação de Informação
14.
J Chem Inf Model ; 63(13): 4030-4041, 2023 07 10.
Artigo em Inglês | MEDLINE | ID: mdl-37368970

RESUMO

Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.


Assuntos
Aprendizado de Máquina
15.
J Chem Inf Model ; 63(7): 1925-1934, 2023 04 10.
Artigo em Inglês | MEDLINE | ID: mdl-36971363

RESUMO

Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.


Assuntos
Benchmarking , Estrutura Molecular
16.
J Chem Inf Model ; 62(15): 3503-3513, 2022 08 08.
Artigo em Inglês | MEDLINE | ID: mdl-35881916

RESUMO

Synthesis planning and reaction outcome prediction are two fundamental problems in computer-aided organic chemistry for which a variety of data-driven approaches have emerged. Natural language approaches that model each problem as a SMILES-to-SMILES translation lead to a simple end-to-end formulation, reduce the need for data preprocessing, and enable the use of well-optimized machine translation model architectures. However, SMILES representations are not efficient for capturing information about molecular structures, as evidenced by the success of SMILES augmentation to boost empirical performance. Here, we describe a novel Graph2SMILES model that combines the power of Transformer models for text generation with the permutation invariance of molecular graph encoders that mitigates the need for input data augmentation. In our encoder, a directed message passing neural network (D-MPNN) captures local chemical environments, and the global attention encoder allows for long-range and intermolecular interactions, enhanced by graph-aware positional embedding. As an end-to-end architecture, Graph2SMILES can be used as a drop-in replacement for the Transformer in any task involving molecule(s)-to-molecule(s) transformations, which we empirically demonstrate leads to improved performance on existing benchmarks for both retrosynthesis and reaction outcome prediction.


Assuntos
Redes Neurais de Computação , Estrutura Molecular
17.
J Chem Inf Model ; 62(10): 2316-2331, 2022 05 23.
Artigo em Inglês | MEDLINE | ID: mdl-35535861

RESUMO

DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find novel small molecules that bind a protein target. Applying QSAR modeling to DEL selection data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been done recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" in order to accommodate the sparse and noisy nature of DEL data. However, a binary classification model cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules, using a custom negative-log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships. Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a DEL dataset of 108,528 compounds screened against carbonic anhydrase (CAIX), and a dataset of 5,655,000 compounds screened against soluble epoxide hydrolase (sEH) and SIRT2. Due to the treatment of uncertainty in the data through the negative-log-likelihood loss used during training, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying structure-activity trends and highly enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression modeling is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.


Assuntos
DNA , Bibliotecas de Moléculas Pequenas , DNA/química , Descoberta de Drogas/métodos , Aprendizado de Máquina , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/farmacologia , Incerteza
18.
J Chem Inf Model ; 62(9): 2035-2045, 2022 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-34115937

RESUMO

Access to structured chemical reaction data is of key importance for chemists in performing bench experiments and in modern applications like computer-aided drug design. Existing reaction databases are generally populated by human curators through manual abstraction from published literature (e.g., patents and journals), which is time consuming and labor intensive, especially with the exponential growth of chemical literature in recent years. In this study, we focus on developing automated methods for extracting reactions from chemical literature. We consider journal publications as the target source of information, which are more comprehensive and better represent the latest developments in chemistry compared to patents; however, they are less formulaic in their descriptions of reactions. To implement the reaction extraction system, we first devised a chemical reaction schema, primarily including a central product, and a set of associated reaction roles such as reactants, catalyst, solvent, and so on. We formulate the task as a structure prediction problem and solve it with a two-stage deep learning framework consisting of product extraction and reaction role labeling. Both models are built upon Transformer-based encoders, which are adaptively pretrained using domain and task-relevant unlabeled data. Our models are shown to be both effective and data efficient, achieving an F1 score of 76.2% in product extraction and 78.7% in role extraction, with only hundreds of annotated reactions.


Assuntos
Bases de Dados Factuais , Humanos
19.
J Chem Inf Model ; 62(19): 4660-4671, 2022 10 10.
Artigo em Inglês | MEDLINE | ID: mdl-36112568

RESUMO

In molecular discovery and drug design, structure-property relationships and activity landscapes are often qualitatively or quantitatively analyzed to guide the navigation of chemical space. The roughness (or smoothness) of these molecular property landscapes is one of their most studied geometric attributes, as it can characterize the presence of activity cliffs, with rougher landscapes generally expected to pose tougher optimization challenges. Here, we introduce a general, quantitative measure for describing the roughness of molecular property landscapes. The proposed roughness index (ROGI) is loosely inspired by the concept of fractal dimension and strongly correlates with the out-of-sample error achieved by machine learning models on numerous regression tasks.


Assuntos
Desenho de Fármacos , Aprendizado de Máquina
20.
J Chem Inf Model ; 62(16): 3854-3862, 2022 08 22.
Artigo em Inglês | MEDLINE | ID: mdl-35938299

RESUMO

High-throughput virtual screening is an indispensable technique utilized in the discovery of small molecules. In cases where the library of molecules is exceedingly large, the cost of an exhaustive virtual screen may be prohibitive. Model-guided optimization has been employed to lower these costs through dramatic increases in sample efficiency compared to random selection. However, these techniques introduce new costs to the workflow through the surrogate model training and inference steps. In this study, we propose an extension to the framework of model-guided optimization that mitigates inference costs using a technique we refer to as design space pruning (DSP), which irreversibly removes poor-performing candidates from consideration. We study the application of DSP to a variety of optimization tasks and observe significant reductions in overhead costs while exhibiting similar performance to the baseline optimization. DSP represents an attractive extension of model-guided optimization that can limit overhead costs in optimization settings where these costs are non-negligible relative to objective costs, such as docking.


Assuntos
Ensaios de Triagem em Larga Escala , Fluxo de Trabalho
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA