RESUMO
SUMMARY: Knowledge graphs are being increasingly used in biomedical research to link large amounts of heterogenous data and facilitate reasoning across diverse knowledge sources. Wider adoption and exploration of knowledge graphs in the biomedical research community is limited by requirements to understand the underlying graph structure in terms of entity types and relationships, represented as nodes and edges, respectively, and learn specialized query languages for graph mining and exploration. We have developed a user-friendly interface dubbed ExEmPLAR (Extracting, Exploring, and Embedding Pathways Leading to Actionable Research) to aid reasoning over biomedical knowledge graphs and assist with data-driven research and hypothesis generation. We explain the key functionalities of ExEmPLAR and demonstrate its use with a case study considering the relationship of Trypanosoma cruzi, the etiological agent of Chagas disease, to frequently associated cardiovascular conditions. AVAILABILITY AND IMPLEMENTATION: ExEmPLAR is freely accessible at https://www.exemplar.mml.unc.edu/. For code and instructions for the using the application, see: https://github.com/beasleyjonm/AOP-COP-Path-Extractor.
Assuntos
Pesquisa Biomédica , Reconhecimento Automatizado de PadrãoRESUMO
Heparan sulfate (HS), a sulfated polysaccharide abundant in the extracellular matrix, plays pivotal roles in various physiological and pathological processes by interacting with proteins. Investigating the binding selectivity of HS oligosaccharides to target proteins is essential, but the exhaustive inclusion of all possible oligosaccharides in microarray experiments is impractical. To address this challenge, we present a hybrid pipeline that integrates microarray and in silico techniques to design oligosaccharides with desired protein affinity. Using fibroblast growth factor 2 (FGF2) as a model protein, we assembled an in-house dataset of HS oligosaccharides on microarrays and developed two structural representations: a standard representation with all atoms explicit and a simplified representation with disaccharide units as "quasi-atoms." Predictive Quantitative Structure-Activity Relationship (QSAR) models for FGF2 affinity were developed using the Random Forest (RF) algorithm. The resulting models, considering the applicability domain, demonstrated high predictivity, with a correct classification rate of 0.81-0.80 and improved positive predictive values (PPV) up to 0.95. Virtual screening of 40 new oligosaccharides using the simplified model identified 15 computational hits, 11 of which were experimentally validated for high FGF2 affinity. This hybrid approach marks a significant step toward the targeted design of oligosaccharides with desired protein interactions, providing a foundation for broader applications in glycobiology.
Assuntos
Heparitina Sulfato , Análise em Microsséries , Modelos Moleculares , Humanos , Fator 2 de Crescimento de Fibroblastos/química , Fator 2 de Crescimento de Fibroblastos/metabolismo , Heparitina Sulfato/química , Heparitina Sulfato/metabolismo , Oligossacarídeos/química , Oligossacarídeos/metabolismo , Ligação Proteica , Relação Quantitativa Estrutura-AtividadeRESUMO
There have been significant advances in the flexibility and power of in vitro cell-free translation systems. The increasing ability to incorporate noncanonical amino acids and complement translation with recombinant enzymes has enabled cell-free production of peptide-based natural products (NPs) and NP-like molecules. We anticipate that many more such compounds and analogs might be accessed in this way. To assess the peptide NP space that is directly accessible to current cell-free technologies, we developed a peptide parsing algorithm that breaks down peptide NPs into building blocks based on ribosomal translation logic. Using the resultant data set, we broadly analyze the biophysical properties of these privileged compounds and perform a retrobiosynthetic analysis to predict which peptide NPs could be directly synthesized in augmented cell-free translation reactions. We then tested these predictions by preparing a library of highly modified peptide NPs. Two macrocyclases, PatG and PCY1, were used to effect the head-to-tail macrocyclization of candidate NPs. This retrobiosynthetic analysis identified a collection of high-priority building blocks that are enriched throughout peptide NPs, yet they had not previously been tested in cell-free translation. To expand the cell-free toolbox into this space, we established, optimized, and characterized the flexizyme-enabled ribosomal incorporation of piperazic acids. Overall, these results demonstrate the feasibility of cell-free translation for peptide NP total synthesis while expanding the limits of the technology. This work provides a novel computational tool for exploration of peptide NP chemical space, that could be expanded in the future to allow design of ribosomal biosynthetic pathways for NPs and NP-like molecules.
Assuntos
Produtos Biológicos , Produtos Biológicos/química , Quimioinformática , Peptídeos/química , Biossíntese Peptídica , AminoácidosRESUMO
Deep learning methods that predict protein-ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein-ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed Banana (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 µM cutoff. Our model achieved an AUC of 0.72 on BigBind's test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, Banana achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with Gnina. We suggest that Banana, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.
Assuntos
Proteínas , Ubiquitina-Proteína Ligases , Simulação de Acoplamento Molecular , Ligantes , Estudos Prospectivos , Proteínas/química , Ligação Proteica , Ubiquitina-Proteína Ligases/metabolismoRESUMO
We introduce STOPLIGHT, a web portal to assist medicinal chemists in prioritizing hits from screening campaigns and the selection of compounds for optimization. STOPLIGHT incorporates services to assess 6 physiochemical and structural properties, 6 assay liabilities, and 11 pharmacokinetic properties, for any small molecule represented by its SMILES string. We briefly describe each service and illustrate the utility of this portal with a case study. The STOPLIGHT portal provides a user-friendly tool to guide hit selection in early drug discovery campaigns, whereby compounds with unfavorable properties can be quickly recognized and eliminated.
Assuntos
Descoberta de Drogas , Descoberta de Drogas/métodos , Software , Avaliação Pré-Clínica de Medicamentos/métodos , Internet , Bibliotecas de Moléculas Pequenas/químicaRESUMO
In the ligand prediction category of CASP15, the challenge was to predict the positions and conformations of small molecules binding to proteins that were provided as amino acid sequences or as models generated by the AlphaFold2 program. For most targets, we used our template-based ligand docking program ClusPro ligTBM, also implemented as a public server available at https://ligtbm.cluspro.org/. Since many targets had multiple chains and a number of ligands, several templates, and some manual interventions were required. In a few cases, no templates were found, and we had to use direct docking using the Glide program. Nevertheless, ligTBM was shown to be a very useful tool, and by any ranking criteria, our group was ranked among the top five best-performing teams. In fact, all the best groups used template-based docking methods. Thus, it appears that the AlphaFold2-generated models, despite the high accuracy of the predicted backbone, have local differences from the x-ray structure that make the use of direct docking methods more challenging. The results of CASP15 confirm that this limitation can be frequently overcome by homology-based docking.
Assuntos
Proteínas , Software , Conformação Proteica , Simulação de Acoplamento Molecular , Ligantes , Proteínas/química , Ligação Proteica , Sítios de LigaçãoRESUMO
SUMMARY: In response to the COVID-19 pandemic, we established COVID-KOP, a new knowledgebase integrating the existing Reasoning Over Biomedical Objects linked in Knowledge Oriented Pathways (ROBOKOP) biomedical knowledge graph with information from recent biomedical literature on COVID-19 annotated in the CORD-19 collection. COVID-KOP can be used effectively to generate new hypotheses concerning repurposing of known drugs and clinical drug candidates against COVID-19 by establishing respective confirmatory pathways of drug action. AVAILABILITY AND IMPLEMENTATION: COVID-KOP is freely accessible at https://covidkop.renci.org/. For code and instructions for the original ROBOKOP, see: https://github.com/NCATS-Gamma/robokop.
Assuntos
COVID-19 , Bases de Dados Factuais , Humanos , Bases de Conhecimento , Pandemias , SARS-CoV-2RESUMO
Exogenous metal particles and ions from implant devices are known to cause severe toxic events with symptoms ranging from adverse local tissue reactions to systemic toxicities, potentially leading to the development of cancers, heart conditions, and neurological disorders. Toxicity mechanisms, also known as Adverse Outcome Pathways (AOPs), that explain these metal-induced toxicities are severely understudied. Therefore, we deployed in silico structure- and knowledge-based approaches to identify proteome-level perturbations caused by metals and pathways that link these events to human diseases. We captured 177 structure-based, 347 knowledge-based, and 402 imputed metal-gene/protein relationships for chromium, cobalt, molybdenum, nickel, and titanium. We prioritized 72 proteins hypothesized to directly contact implant surfaces and contribute to adverse outcomes. Results of this exploratory analysis were formalized as structured AOPs. We considered three case studies reflecting the following possible situations: (i) the metal-protein-disease relationship was previously known; (ii) the metal-protein, protein-disease, and metal-disease relationships were individually known but were not linked (as a unified AOP); and (iii) one of three relationships was unknown and was imputed by our methods. These situations were illustrated by case studies on nickel-induced allergy/hypersensitivity, cobalt-induced heart failure, and titanium-induced periprosthetic osteolysis, respectively. All workflows, data, and results are freely available in https://github.com/DnlRKorn/Knowledge_Based_Metallomics/. An interactive view of select data is available at the ROBOKOP Neo4j Browser at http://robokopkg.renci.org/browser/.
Assuntos
Rotas de Resultados Adversos , Níquel , Humanos , Níquel/efeitos adversos , Titânio/toxicidade , Metais/toxicidade , Cobalto , CromoRESUMO
COVID-19 has resulted in huge numbers of infections and deaths worldwide and brought the most severe disruptions to societies and economies since the Great Depression. Massive experimental and computational research effort to understand and characterize the disease and rapidly develop diagnostics, vaccines, and drugs has emerged in response to this devastating pandemic and more than 130 000 COVID-19-related research papers have been published in peer-reviewed journals or deposited in preprint servers. Much of the research effort has focused on the discovery of novel drug candidates or repurposing of existing drugs against COVID-19, and many such projects have been either exclusively computational or computer-aided experimental studies. Herein, we provide an expert overview of the key computational methods and their applications for the discovery of COVID-19 small-molecule therapeutics that have been reported in the research literature. We further outline that, after the first year the COVID-19 pandemic, it appears that drug repurposing has not produced rapid and global solutions. However, several known drugs have been used in the clinic to cure COVID-19 patients, and a few repurposed drugs continue to be considered in clinical trials, along with several novel clinical candidates. We posit that truly impactful computational tools must deliver actionable, experimentally testable hypotheses enabling the discovery of novel drugs and drug combinations, and that open science and rapid sharing of research results are critical to accelerate the development of novel, much needed therapeutics for COVID-19.
Assuntos
Tratamento Farmacológico da COVID-19 , Simulação por Computador , Desenho de Fármacos , Descoberta de Drogas/métodos , Reposicionamento de Medicamentos , Antivirais/uso terapêutico , COVID-19/virologia , Ensaios Clínicos como Assunto , Humanos , Pandemias , SARS-CoV-2/efeitos dos fármacosRESUMO
Safety assessment is an essential component of the regulatory acceptance of industrial chemicals. Previously, we have developed a model to predict the skin sensitization potential of chemicals for two assays, the human patch test and murine local lymph node assay, and implemented this model in a web portal. Here, we report on the substantially revised and expanded freely available web tool, Pred-Skin version 3.0. This up-to-date version of Pred-Skin incorporates multiple quantitative structure-activity relationship (QSAR) models developed with in vitro, in chemico, and mice and human in vivo data, integrated into a consensus naïve Bayes model that predicts human effects. Individual QSAR models were generated using skin sensitization data derived from human repeat insult patch tests, human maximization tests, and mouse local lymph node assays. In addition, data for three validated alternative methods, the direct peptide reactivity assay, KeratinoSens, and the human cell line activation test, were employed as well. Models were developed using open-source tools and rigorously validated according to the best practices of QSAR modeling. Predictions obtained from these models were then used to build a naïve Bayes model for predicting human skin sensitization with the following external prediction accuracy: correct classification rate (89%), sensitivity (94%), positive predicted value (91%), specificity (84%), and negative predicted value (89%). As an additional assessment of model performance, we identified 11 cosmetic ingredients known to cause skin sensitization but were not included in our training set, and nine of them were accurately predicted as sensitizers by our models. Pred-Skin can be used as a reliable alternative to animal tests for predicting human skin sensitization.
Assuntos
Cosméticos/efeitos adversos , Testes Cutâneos , Pele/efeitos dos fármacos , Animais , Teorema de Bayes , Cosméticos/química , Humanos , Camundongos , Relação Quantitativa Estrutura-AtividadeRESUMO
Deep learning models have demonstrated outstanding results in many data-rich areas of research, such as computer vision and natural language processing. Currently, there is a rise of deep learning in computational chemistry and materials informatics, where deep learning could be effectively applied in modeling the relationship between chemical structures and their properties. With the immense growth of chemical and materials data, deep learning models can begin to outperform conventional machine learning techniques such as random forest, support vector machines, and nearest neighbor. Herein, we introduce OpenChem, a PyTorch-based deep learning toolkit for computational chemistry and drug design. OpenChem offers easy and fast model development, modular software design, and several data preprocessing modules. It is freely available via the GitHub repository.
Assuntos
Aprendizado Profundo , Química Computacional , Desenho de Fármacos , Aprendizado de Máquina , Máquina de Vetores de SuporteRESUMO
Many laboratories working in the field of drug discovery use the ZINC database to identify and then acquire commercially available chemicals. However, finding the best deal for a given compound is often time-intensive and laborious, as the process involves searching for all vendors selling the desired compound, comparing prices, and interacting with the preferred vendor. To streamline this process, we have developed ZINC Express, a web application that simplifies the online purchase of chemicals annotated in the ZINC database. For any compound with a known ZINC ID, ZINC Express finds a list of vendors offering that compound and for each such vendor returns the available package quantities, the price of each package, and the price per milligram along with a link to that vendor. We expect that ZINC Express will be of use to both computational and experimental researchers. ZINC Express is freely accessible online at https://zincexpress.mml.unc.edu/.
Assuntos
Comércio , Descoberta de Drogas , Bases de Dados Factuais , ZincoRESUMO
The COVID-19 pandemic has catalyzed a widespread effort to identify drug candidates and biological targets of relevance to SARS-COV-2 infection, which resulted in large numbers of publications on this subject. We have built the COVID-19 Knowledge Extractor (COKE), a web application to extract, curate, and annotate essential drug-target relationships from the research literature on COVID-19. SciBiteAI ontological tagging of the COVID Open Research Data set (CORD-19), a repository of COVID-19 scientific publications, was employed to identify drug-target relationships. Entity identifiers were resolved through lookup routines using UniProt and DrugBank. A custom algorithm was used to identify co-occurrences of the target protein and drug terms, and confidence scores were calculated for each entity pair. COKE processing of the current CORD-19 database identified about 3000 drug-protein pairs, including 29 unique proteins and 500 investigational, experimental, and approved drugs. Some of these drugs are presently undergoing clinical trials for COVID-19. The COKE repository and web application can serve as a useful resource for drug repurposing against SARS-CoV-2. COKE is freely available at https://coke.mml.unc.edu/, and the code is available at https://github.com/DnlRKorn/CoKE.
Assuntos
COVID-19 , Preparações Farmacêuticas , Antivirais , Reposicionamento de Medicamentos , Humanos , Pandemias , SARS-CoV-2RESUMO
Computational methods to predict molecular properties regarding safety and toxicology represent alternative approaches to expedite drug development, screen environmental chemicals, and thus significantly reduce associated time and costs. There is a strong need and interest in the development of computational methods that yield reliable predictions of toxicity, and many approaches, including the recently introduced deep neural networks, have been leveraged towards this goal. Herein, we report on the collection, curation, and integration of data from the public data sets that were the source of the ChemIDplus database for systemic acute toxicity. These efforts generated the largest publicly available such data set comprising > 80,000 compounds measured against a total of 59 acute systemic toxicity end points. This data was used for developing multiple single- and multitask models utilizing random forest, deep neural networks, convolutional, and graph convolutional neural network approaches. For the first time, we also reported the consensus models based on different multitask approaches. To the best of our knowledge, prediction models for 36 of the 59 end points have never been published before. Furthermore, our results demonstrated a significantly better performance of the consensus model obtained from three multitask learning approaches that particularly predicted the 29 smaller tasks (less than 300 compounds) better than other models developed in the study. The curated data set and the developed models have been made publicly available at https://github.com/ncats/ld50-multitask, https://predictor.ncats.io/, and https://cactus.nci.nih.gov/download/acute-toxicity-db (data set only) to support regulatory and research applications.
Assuntos
Aprendizado Profundo , Consenso , Bases de Dados Factuais , Redes Neurais de ComputaçãoRESUMO
We aimed to develop and validate a new graph embedding algorithm for embedding drug-disease-target networks to generate novel drug repurposing hypotheses. Our model denotes drugs, diseases and targets as subjects, predicates and objects, respectively. Each entity is represented by a multidimensional vector and the predicate is regarded as a translation vector from a subject to an object vectors. These vectors are optimized so that when a subject-predicate-object triple represents a known drug-disease-target relationship, the summed vector between the subject and the predicate is to be close to that of the object; otherwise, the summed vector is distant from the object. The DTINet dataset was utilized to test this algorithm and discover unknown links between drugs and diseases. In cross-validation experiments, this new algorithm outperformed the original DTINet model. The MRR (Mean Reciprocal Rank) values of our models were around 0.80 while those of the original model were about 0.70. In addition, we have identified and verified several pairs of new therapeutic relations as well as adverse effect relations that were not recorded in the original DTINet dataset. This approach showed excellent performance, and the predicted drug-disease and drug-side-effect relationships were found to be consistent with literature reports. This novel method can be used to analyze diverse types of emerging biomedical and healthcare-related knowledge graphs (KG).
Assuntos
Reposicionamento de Medicamentos , Preparações Farmacêuticas , Algoritmos , Humanos , Conhecimento , Reconhecimento Automatizado de PadrãoRESUMO
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.
Assuntos
Química Farmacêutica/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/metabolismo , Preparações Farmacêuticas/química , Algoritmos , Animais , Inteligência Artificial , Bases de Dados Factuais , Desenho de Fármacos , História do Século XX , História do Século XXI , Humanos , Modelos Moleculares , Relação Quantitativa Estrutura-Atividade , Teoria Quântica , Reprodutibilidade dos TestesRESUMO
Correction for 'QSAR without borders' by Eugene N. Muratov et al., Chem. Soc. Rev., 2020, DOI: 10.1039/d0cs00098a.
RESUMO
New Approach Methodologies (NAMs) that employ artificial intelligence (AI) for predicting adverse effects of chemicals have generated optimistic expectations as alternatives to animal testing. However, the major underappreciated challenge in developing robust and predictive AI models is the impact of the quality of the input data on the model accuracy. Indeed, poor data reproducibility and quality have been frequently cited as factors contributing to the crisis in biomedical research, as well as similar shortcomings in the fields of toxicology and chemistry. In this article, we review the most recent efforts to improve confidence in the robustness of toxicological data and investigate the impact that data curation has on the confidence in model predictions. We also present two case studies demonstrating the effect of data curation on the performance of AI models for predicting skin sensitisation and skin irritation. We show that, whereas models generated with uncurated data had a 7-24% higher correct classification rate (CCR), the perceived performance was, in fact, inflated owing to the high number of duplicates in the training set. We assert that data curation is a critical step in building computational models, to help ensure that reliable predictions of chemical toxicity are achieved through use of the models.
Assuntos
Alternativas aos Testes com Animais , Inteligência Artificial , Animais , Simulação por Computador , Confiabilidade dos Dados , Reprodutibilidade dos TestesRESUMO
MOTIVATION: Non-ribosomal peptide synthetases (NRPSs) are modular enzymatic machines that catalyze the ribosome-independent production of structurally complex small peptides, many of which have important clinical applications as antibiotics, antifungals and anti-cancer agents. Several groups have tried to expand natural product diversity by intermixing different NRPS modules to create synthetic peptides. This approach has not been as successful as anticipated, suggesting that these modules are not fully interchangeable. RESULTS: We explored whether Inter-Modular Linkers (IMLs) impact the ability of NRPS modules to communicate during the synthesis of NRPs. We developed a parser to extract 39 804 IMLs from both well annotated and putative NRPS biosynthetic gene clusters from 39 232 bacterial genomes and established the first IMLs database. We analyzed these IMLs and identified a striking relationship between IMLs and the amino acid substrates of their adjacent modules. More than 92% of the identified IMLs connect modules that activate a particular pair of substrates, suggesting that significant specificity is embedded within these sequences. We therefore propose that incorporating the correct IML is critical when attempting combinatorial biosynthesis of novel NRPS. AVAILABILITY AND IMPLEMENTATION: The IMLs database as well as the NRPS-Parser have been made available on the web at https://nrps-linker.unc.edu. The entire source code of the project is hosted in GitHub repository (https://github.com/SWFarag/nrps-linker). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Ribossomos , Antibacterianos , Produtos Biológicos , Peptídeo Sintases , PeptídeosRESUMO
SUMMARY: Knowledge graphs (KGs) are quickly becoming a common-place tool for storing relationships between entities from which higher-level reasoning can be conducted. KGs are typically stored in a graph-database format, and graph-database queries can be used to answer questions of interest that have been posed by users such as biomedical researchers. For simple queries, the inclusion of direct connections in the KG and the storage and analysis of query results are straightforward; however, for complex queries, these capabilities become exponentially more challenging with each increase in complexity of the query. For instance, one relatively complex query can yield a KG with hundreds of thousands of query results. Thus, the ability to efficiently query, store, rank and explore sub-graphs of a complex KG represents a major challenge to any effort designed to exploit the use of KGs for applications in biomedical research and other domains. We present Reasoning Over Biomedical Objects linked in Knowledge Oriented Pathways as an abstraction layer and user interface to more easily query KGs and store, rank and explore query results. AVAILABILITY AND IMPLEMENTATION: An instance of the ROBOKOP UI for exploration of the ROBOKOP Knowledge Graph can be found at http://robokop.renci.org. The ROBOKOP Knowledge Graph can be accessed at http://robokopkg.renci.org. Code and instructions for building and deploying ROBOKOP are available under the MIT open software license from https://github.com/NCATS-Gamma/robokop. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.