RESUMO
We have analyzed 40 different databases ranging in size from a few thousand to nearly 100 million molecules, comprising a total of over 210 million structures, for their tautomeric conflicts. A tautomeric conflict is defined as an occurrence of two or more structures within a data set identified by the tautomeric rules applied as being tautomers of each other. We tested a total of 119 detailed tautomeric transform rules expressed as SMIRKS, out of which 79 yielded at least one conflict. These transformations include three types of tautomerism: prototropic, ring-chain, and valence tautomerism. The databases analyzed spanned a wide variety of types including large aggregating databases, drug collections, and structure collections based on experimental data. All databases analyzed showed intra-database tautomeric conflicts. The conflict rates as percentage of the database were typically in the few tenths of a percent range, which for the largest databases amounts to >100,000 cases per database.
Assuntos
Bases de Dados de Compostos Químicos , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química , IsomerismoRESUMO
Although the size of virtual libraries of synthesizable compounds is growing rapidly, we are still enumerating only tiny fractions of the drug-like chemical universe. Our capability to mine these newly generated libraries also lags their growth. That is why fragment-based approaches that utilize on-demand virtual combinatorial libraries are gaining popularity in drug discovery. These à la carte libraries utilize synthetic blocks found to be effective binders in parts of target protein pockets and a variety of reliable chemistries to connect them. There is, however, no data on the potential impact of the chemistries used for making on-demand libraries on the hit rates during virtual screening. There are also no rules to guide in the selection of these synthetic methods for production of custom libraries. We have used the SAVI (Synthetically Accessible Virtual Inventory) library, constructed using 53 reliable reaction types (transforms), to evaluate the impact of these chemistries on docking hit rates for 40 well-characterized protein pockets. The data shows that the virtual hit rates differ significantly for different chemistries with cross coupling reactions such as Sonogashira, Suzuki-Miyaura, Hiyama and Liebeskind-Srogl coupling producing the highest hit rates. Virtual hit rates appear to depend not only on the property of the formed chemical bond but also on the diversity of available building blocks and the scope of the reaction. The data identifies reactions that deserve wider use through increasing the number of corresponding building blocks and suggests the reactions that are more effective for pockets with certain physical and hydrogen bond-forming properties.
Assuntos
Simulação de Acoplamento Molecular , Ligação Proteica , Proteínas , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/farmacologia , Proteínas/química , Proteínas/metabolismo , Sítios de Ligação , Descoberta de Drogas/métodos , Ligantes , Desenho de Fármacos , HumanosRESUMO
Germline antibodies, the initial set of antibodies produced by the immune system, are critical for host defense, and information about their binding properties can be useful for designing vaccines, understanding the origins of autoantibodies, and developing monoclonal antibodies. Numerous studies have found that germline antibodies are polyreactive with malleable, flexible binding pockets. While insightful, it remains unclear how broadly this model applies, as there are many families of antibodies that have not yet been studied. In addition, the methods used to obtain germline antibodies typically rely on assumptions and do not work well for many antibodies. Herein, we present a distinct approach for isolating germline antibodies that involves immunizing activation-induced cytidine deaminase (AID) knockout mice. This strategy amplifies antigen-specific B cells, but somatic hypermutation does not occur because AID is absent. Using synthetic haptens, glycoproteins, and whole cells, we obtained germline antibodies to an assortment of clinically important tumor-associated carbohydrate antigens, including Lewis Y, the Tn antigen, sialyl Lewis C, and Lewis X (CD15/SSEA-1). Through glycan microarray profiling and cell binding, we demonstrate that all but one of these germline antibodies had high selectivity for their glycan targets. Using molecular dynamics simulations, we provide insights into the structural basis of glycan recognition. The results have important implications for designing carbohydrate-based vaccines, developing anti-glycan monoclonal antibodies, and understanding antibody evolution within the immune system.
Assuntos
Anticorpos Monoclonais , Antígenos Glicosídicos Associados a Tumores , Animais , Anticorpos Monoclonais/química , Biomarcadores Tumorais , Carboidratos , Células Germinativas , Camundongos , Camundongos Knockout , Polissacarídeos/químicaRESUMO
Designing new medicines more cheaply and quickly is tightly linked to the quest of exploring chemical space more widely and efficiently. Chemical space is monumentally large, but recent advances in computer software and hardware have enabled researchers to navigate virtual chemical spaces containing billions of chemical structures. This review specifically concerns collections of many millions or even billions of enumerated chemical structures as well as even larger chemical spaces that are not fully enumerated. We present examples of chemical libraries and spaces and the means used to construct them, and we discuss new technologies for searching huge libraries and for searching combinatorially in chemical space. We also cover space navigation techniques and consider new approaches to de novo drug design and the impact of the "autonomous laboratory" on synthesis of designed compounds. Finally, we summarize some other challenges and opportunities for the future.
Assuntos
Descoberta de Drogas , Bibliotecas de Moléculas Pequenas , Desenho de Fármacos , Descoberta de Drogas/métodos , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/farmacologiaRESUMO
MOTIVATION: Identification of new molecules promising for treatment of HIV-infection and HIV-associated disorders remains an important task in order to provide safer and more effective therapies. Utilization of prior knowledge by application of computer-aided drug discovery approaches reduces time and financial expenses and increases the chances of positive results in anti-HIV R&D. To provide the scientific community with a tool that allows estimating of potential agents for treatment of HIV-infection and its comorbidities, we have created a freely-available web-resource for prediction of relevant biological activities based on the structural formulae of drug-like molecules. RESULTS: Over 50 000 experimental records for anti-retroviral agents from ChEMBL database were extracted for creating the training sets. After careful examination, about seven thousand molecules inhibiting five HIV-1 proteins were used to develop regression and classification models with the GUSAR software. The average values of R2 = 0.95 and Q2 = 0.72 in validation procedure demonstrated the reasonable accuracy and predictivity of the obtained (Q)SAR models. Prediction of 81 biological activities associated with the treatment of HIV-associated comorbidities with 92% mean accuracy was realized using the PASS program. AVAILABILITY AND IMPLEMENTATION: Freely available on the web at http://www.way2drug.com/hiv/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Infecções por HIV , HIV , Prednisolona , Software , Proteínas Virais , Simulação por Computador , HIV/genética , Infecções por HIV/tratamento farmacológico , Prednisolona/análogos & derivados , Proteínas , Relação Estrutura-AtividadeRESUMO
Computational methods to predict molecular properties regarding safety and toxicology represent alternative approaches to expedite drug development, screen environmental chemicals, and thus significantly reduce associated time and costs. There is a strong need and interest in the development of computational methods that yield reliable predictions of toxicity, and many approaches, including the recently introduced deep neural networks, have been leveraged towards this goal. Herein, we report on the collection, curation, and integration of data from the public data sets that were the source of the ChemIDplus database for systemic acute toxicity. These efforts generated the largest publicly available such data set comprising > 80,000 compounds measured against a total of 59 acute systemic toxicity end points. This data was used for developing multiple single- and multitask models utilizing random forest, deep neural networks, convolutional, and graph convolutional neural network approaches. For the first time, we also reported the consensus models based on different multitask approaches. To the best of our knowledge, prediction models for 36 of the 59 end points have never been published before. Furthermore, our results demonstrated a significantly better performance of the consensus model obtained from three multitask learning approaches that particularly predicted the 29 smaller tasks (less than 300 compounds) better than other models developed in the study. The curated data set and the developed models have been made publicly available at https://github.com/ncats/ld50-multitask, https://predictor.ncats.io/, and https://cactus.nci.nih.gov/download/acute-toxicity-db (data set only) to support regulatory and research applications.
Assuntos
Aprendizado Profundo , Consenso , Bases de Dados Factuais , Redes Neurais de ComputaçãoRESUMO
We report a database of tautomeric structures that contains 2819 tautomeric tuples extracted from 171 publications. Each tautomeric entry has been annotated with experimental conditions reported in the respective publication, plus bibliographic details, structural identifiers (e.g., NCI/CADD identifiers FICTS, FICuS, uuuuu, and Standard InChI), and chemical information (e.g., SMILES, molecular weight). The majority of tautomeric tuples found were pairs; the remaining 10% were triples, quadruples, or quintuples, amounting to a total number of structures of 5977. The types of tautomerism were mainly prototropic tautomerism (79%), followed by ring-chain (13%) and valence tautomerism (8%). The experimental conditions reported in the publications included about 50 pure solvents and 9 solvent mixtures with 26 unique spectroscopic or nonspectroscopic methods. 1H and 13C NMR were the most frequently used methods. A total of 77 different tautomeric transform rules (SMIRKS) are covered by at least one example tuple in the database. This database is freely available as a spreadsheet at https://cactus.nci.nih.gov/download/tautomer/.
Assuntos
Isomerismo , Bases de Dados Factuais , Espectroscopia de Ressonância MagnéticaRESUMO
We have adopted and extended the CHMTRN language and used it for the knowledge base of a computer program to generate a large database of synthetically accessible, drug-like chemical structures, the Synthetically Accessible Virtual Inventory (SAVI) Database. CHMTRN is a powerful language originally developed in the LHASA (Logic and Heuristics Applied to Synthetic Analysis) project at Harvard University and used together with the chemical pattern description language, PATRAN, to describe chemical retro-reactions. The languages have proven to be useful beyond the design of retrosynthetic routes and have the potential for much wider use in chemistry; this paper describes CHMTRN and PATRAN as now reimplemented for the forward-synthetic SAVI project but able to describe both forward and retro-reactions.
Assuntos
Técnicas de Química Combinatória , Software , Bases de Dados Factuais , HumanosRESUMO
We have collected 86 different transforms of tautomeric interconversions. Out of those, 54 are for prototropic (non-ring-chain) tautomerism, 21 for ring-chain tautomerism, and 11 for valence tautomerism. The majority of these rules have been extracted from experimental literature. Twenty rules, covering the most well-known types of tautomerism such as keto-enol tautomerism, were taken from the default handling of tautomerism by the chemoinformatics toolkit CACTVS. The rules were analyzed against nine differerent databases totaling over 400 million (non-unique) structures as to their occurrence rates, mutual overlap in coverage, and recapitulation of the rules' enumerated tautomer sets by InChI V.1.05, both in InChI's Standard and a Nonstandard version with the increased tautomer-handling options 15T and KET turned on. These results and the background of this study are discussed in the context of the IUPAC InChI Project tasked with the redesign of handling of tautomerism for an InChI version 2. Applying the rules presented in this paper would approximately triple the number of compounds in typical small-molecule databases that would be affected by tautomeric interconversion by InChI V2. A web tool has been created to test these rules at https://cactus.nci.nih.gov/tautomerizer.
Assuntos
Quimioinformática , Bases de Dados FactuaisRESUMO
Due to its antiangiogenic and anti-immunomodulatory activity, thalidomide continues to be of clinical interest despite its teratogenic actions, and efforts to synthesize safer, clinically active thalidomide analogs are continually underway. In this study, a cohort of 27 chemically diverse thalidomide analogs was evaluated for antiangiogenic activity in an ex vivo rat aorta ring assay. The protein cereblon has been identified as the target for thalidomide, and in silico pharmacophore analysis and molecular docking with a crystal structure of human cereblon were used to investigate the cereblon binding abilities of the thalidomide analogs. The results suggest that not all antiangiogenic thalidomide analogs can bind cereblon, and multiple targets and mechanisms of action may be involved.
Assuntos
Proteínas Adaptadoras de Transdução de Sinal/metabolismo , Inibidores da Angiogênese/farmacologia , Aorta/efeitos dos fármacos , Simulação de Acoplamento Molecular , Neovascularização Fisiológica/efeitos dos fármacos , Talidomida/análogos & derivados , Talidomida/farmacologia , Ubiquitina-Proteína Ligases/metabolismo , Inibidores da Angiogênese/química , Animais , Simulação por Computador , Humanos , Masculino , Ratos , Ratos Sprague-DawleyRESUMO
A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure-activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.
Assuntos
Mineração de Dados/métodos , Descoberta de Drogas/métodos , Bases de Dados Factuais , Infecções por HIV/tratamento farmacológico , Transcriptase Reversa do HIV/antagonistas & inibidores , HIV-1/efeitos dos fármacos , HIV-1/enzimologia , Humanos , PubMed , Inibidores da Transcriptase Reversa/farmacologiaRESUMO
Despite the achievements of antiretroviral therapy, discovery of new anti-HIV medicines remains an essential task because the existing drugs do not provide a complete cure for the infected patients, exhibit severe adverse effects, and lead to the appearance of resistant strains. To predict the interaction of drug-like compounds with multiple targets for HIV treatment, ligand-based drug design approach is widely applied. In this study, we evaluated the possibilities and limitations of (Q)SAR analysis aimed at the discovery of novel antiretroviral agents inhibiting the vital HIV enzymes. Local (Q)SAR models are based on the analysis of structure-activity relationships for molecules from the same chemical class, which significantly restrict their applicability domain. In contrast, global (Q)SAR models exploit data from heterogeneous sets of drug-like compounds, which allows their application to databases containing diverse structures. We compared the information for HIV-1 integrase, protease and reverse transcriptase inhibitors available in the EBI ChEMBL, NIAID HIV/OI/TB Therapeutics, and Clarivate Analytics Integrity databases as the sources for (Q)SAR training sets. Using the PASS and GUSAR software, we developed and validated a variety of (Q)SAR models, which can be further used for virtual screening of new antiretrovirals in the SAVI library. The developed models are implemented in the freely available web resource AntiHIV-Pred.
Assuntos
Fármacos Anti-HIV/farmacologia , HIV-1/metabolismo , Relação Quantitativa Estrutura-Atividade , Proteínas Virais/antagonistas & inibidores , Fármacos Anti-HIV/química , Bases de Dados como Assunto , HIV-1/efeitos dos fármacos , Humanos , Concentração Inibidora 50 , Análise de Regressão , Reprodutibilidade dos Testes , Proteínas Virais/metabolismoRESUMO
In this review, we address a fundamental question: What is the range of conformational energies seen in ligands in protein-ligand crystal structures? This value is important biophysically, for better understanding the protein-ligand binding process; and practically, for providing a parameter to be used in many computational drug design methods such as docking and pharmacophore searches. We synthesize a selection of previously reported conflicting results from computational studies of this issue and conclude that high ligand conformational energies really are present in some crystal structures. The main source of disagreement between different analyses appears to be due to divergent treatments of electrostatics and solvation. At the same time, however, for many ligands, a high conformational energy is in error, due to either crystal structure inaccuracies or incorrect determination of the reference state. Aside from simple chemistry mistakes, we argue that crystal structure error may mainly be because of the heuristic weighting of ligand stereochemical restraints relative to the fit of the structure to the electron density. This problem cannot be fixed with improvements to electron density fitting or with simple ligand geometry checks, though better metrics are needed for evaluating ligand and binding site chemistry in addition to geometry during structure refinement. The ultimate solution for accurately determining ligand conformational energies lies in ultrahigh-resolution crystal structures that can be refined without restraints.
Assuntos
Conformação Proteica , Proteínas/química , Termodinâmica , Animais , Sítios de Ligação , Cristalografia por Raios X , Desenho de Fármacos , Humanos , Ligantes , Simulação de Acoplamento Molecular , Ligação Proteica , Proteínas/agonistas , Proteínas/antagonistas & inibidores , Solubilidade , Eletricidade EstáticaRESUMO
Severe adverse drug reactions (ADRs) are the fourth leading cause of fatality in the U.S. with more than 100,000 deaths per year. As up to 30% of all ADRs are believed to be caused by drug-drug interactions (DDIs), typically mediated by cytochrome P450s, possibilities to predict DDIs from existing knowledge are important. We collected data from public sources on 1485, 2628, 4371, and 27,966 possible DDIs mediated by four cytochrome P450 isoforms 1A2, 2C9, 2D6, and 3A4 for 55, 73, 94, and 237 drugs, respectively. For each of these data sets, we developed and validated QSAR models for the prediction of DDIs. As a unique feature of our approach, the interacting drug pairs were represented as binary chemical mixtures in a 1:1 ratio. We used two types of chemical descriptors: quantitative neighborhoods of atoms (QNA) and simplex descriptors. Radial basis functions with self-consistent regression (RBF-SCR) and random forest (RF) were utilized to build QSAR models predicting the likelihood of DDIs for any pair of drug molecules. Our models showed balanced accuracy of 72-79% for the external test sets with a coverage of 81.36-100% when a conservative threshold for the model's applicability domain was applied. We generated virtually all possible binary combinations of marketed drugs and employed our models to identify drug pairs predicted to be instances of DDI. More than 4500 of these predicted DDIs that were not found in our training sets were confirmed by data from the DrugBank database.
Assuntos
Algoritmos , Sistema Enzimático do Citocromo P-450/química , Sistema Enzimático do Citocromo P-450/metabolismo , Interações Medicamentosas , Modelos Moleculares , Relação Quantitativa Estrutura-Atividade , Bases de Dados Factuais , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Modelos BiológicosRESUMO
We investigated how many cases of the same chemical sold as different products (at possibly different prices) occurred in a prototypical large aggregated database and simultaneously tested the tautomerism definitions in the chemoinformatics toolkit CACTVS. We applied the standard CACTVS tautomeric transforms plus a set of recently developed ring-chain transforms to the Aldrich Market Select (AMS) database of 6 million screening samples and building blocks. In 30â¯000 cases, two or more AMS products were found to be just different tautomeric forms of the same compound. We purchased and analyzed 166 such tautomer pairs and triplets by 1H and 13C NMR to determine whether the CACTVS transforms accurately predicted what is the same "stuff in the bottle". Essentially all prototropic transforms with examples in the AMS were confirmed. Some of the ring-chain transforms were found to be too "aggressive", i.e. to equate structures with one another that were different compounds.
Assuntos
Bases de Dados Factuais , Informática/métodos , Compostos Orgânicos/química , Bases de Dados Factuais/economia , IsomerismoRESUMO
Warfarin, an important anticoagulant drug, can exist in solution in 40 distinct tautomeric forms through both prototropic tautomerism and ring-chain tautomerism. We have investigated all warfarin tautomers with computational and NMR approaches. Relative energies calculated at the B3LYP/6-311G++(d,p) level of theory indicate that the 4-hydroxycoumarin cyclic hemiketal tautomer is the most stable tautomer in aqueous solution, followed by the 4-hydroxycoumarin open-chain tautomer. This is in agreement with our NMR experiments where the spectral assignments indicate that warfarin exists mainly as a mixture of cyclic hemiketal diastereomers, with an open-chain tautomer as a minor component. We present a diagram of the interconversion of warfarin created taking into account the calculated equilibrium constants (pK(T)) for all tautomeric reactions. These findings help with gaining further understanding of proton transfer and ring closure tautomerization processes. We also discuss the results in the context of chemoinformatics rules for handling tautomerism.
Assuntos
Anticoagulantes/química , Simulação de Dinâmica Molecular , Teoria Quântica , Varfarina/química , Espectroscopia de Ressonância Magnética , Estrutura Molecular , EstereoisomerismoRESUMO
Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.
Assuntos
Bases de Dados de Produtos Farmacêuticos , Transcriptase Reversa do HIV/antagonistas & inibidores , HIV-1/enzimologia , Modelos Estatísticos , Relação Quantitativa Estrutura-Atividade , Inibidores da Transcriptase Reversa/química , Inibidores da Transcriptase Reversa/farmacologia , Algoritmos , Descoberta de Drogas , Farmacorresistência Viral , HIV-1/efeitos dos fármacosRESUMO
A compound exhibits (prototropic) tautomerism if it can be represented by two or more structures that are related by a formal intramolecular movement of a hydrogen atom from one heavy atom position to another. When the movement of the proton is accompanied by the opening or closing of a ring it is called ring-chain tautomerism. This type of tautomerism is well observed in carbohydrates, but it also occurs in other molecules such as warfarin. In this work, we present an approach that allows for the generation of all ring-chain tautomers of a given chemical structure. Based on Baldwin's Rules estimating the likelihood of ring closure reactions to occur, we have defined a set of transform rules covering the majority of ring-chain tautomerism cases. The rules automatically detect substructures in a given compound that can undergo a ring-chain tautomeric transformation. Each transformation is encoded in SMIRKS line notation. All work was implemented in the chemoinformatics toolkit CACTVS. We report on the application of our ring-chain tautomerism rules to a large database of commercially available screening samples in order to identify ring-chain tautomers.
Assuntos
Conformação Molecular , Ciclização , Bases de Dados de Compostos QuímicosRESUMO
Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).