ABSTRACT
ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL's content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for â¼270 000 bioactivity measurements.
Subject(s)
Drug Discovery , Databases, Factual , Time FactorsABSTRACT
The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the latest developments in the services provided by EMBL-EBI data resources to scientific communities globally. These developments aim to ensure EMBL-EBI resources meet the current and future needs of these scientific communities, accelerating the impact of open biological data for all.
Subject(s)
Academies and Institutes , Computational Biology , Computational Biology/organization & administration , Computational Biology/trends , Academies and Institutes/organization & administration , Academies and Institutes/trends , Databases, Nucleic Acid , EuropeABSTRACT
Low success rates during drug development are due, in part, to the difficulty of defining drug mechanism-of-action and molecular markers of therapeutic activity. Here, we integrated 199,219 drug sensitivity measurements for 397 unique anti-cancer drugs with genome-wide CRISPR loss-of-function screens in 484 cell lines to systematically investigate cellular drug mechanism-of-action. We observed an enrichment for positive associations between the profile of drug sensitivity and knockout of a drug's nominal target, and by leveraging protein-protein networks, we identified pathways underpinning drug sensitivity. This revealed an unappreciated positive association between mitochondrial E3 ubiquitin-protein ligase MARCH5 dependency and sensitivity to MCL1 inhibitors in breast cancer cell lines. We also estimated drug on-target and off-target activity, informing on specificity, potency and toxicity. Linking drug and gene dependency together with genomic data sets uncovered contexts in which molecular networks when perturbed mediate cancer cell loss-of-fitness and thereby provide independent and orthogonal evidence of biomarkers for drug development. This study illustrates how integrating cell line drug sensitivity with CRISPR loss-of-function screens can elucidate mechanism-of-action to advance drug development.
Subject(s)
Antineoplastic Agents/pharmacology , CRISPR-Cas Systems , Drug Development/methods , Drug Screening Assays, Antitumor/methods , Gene Regulatory Networks/drug effects , Genetic Fitness/drug effects , Protein Interaction Maps/drug effects , Antineoplastic Agents/toxicity , Biomarkers/metabolism , Cell Line, Tumor , Gene Knockout Techniques , Gene Regulatory Networks/genetics , Genetic Fitness/genetics , Genomics , Humans , Linear Models , Membrane Proteins/genetics , Membrane Proteins/metabolism , Myeloid Cell Leukemia Sequence 1 Protein/antagonists & inhibitors , Pharmaceutical Preparations/metabolism , Software , Ubiquitin-Protein Ligases/genetics , Ubiquitin-Protein Ligases/metabolismABSTRACT
The safety of marketed drugs is an ongoing concern, with some of the more frequently prescribed medicines resulting in serious or life-threatening adverse effects in some patients. Safety-related information for approved drugs has been curated to include the assignment of toxicity class(es) based on their withdrawn status and/or black box warning information described on medicinal product labels. The ChEMBL resource contains a wide range of bioactivity data types, from early "Discovery" stage preclinical data for individual compounds through to postclinical data on marketed drugs; the inclusion of the curated drug safety data set within this framework can support a wide range of safety-related drug discovery questions. The curated drug safety data set will be made freely available through ChEMBL and updated in future database releases.
Subject(s)
Pharmaceutical Preparations/chemistry , Data Curation , Drug Approval , Drug-Related Side Effects and Adverse Reactions , Humans , Models, MolecularABSTRACT
ChEMBL is a large, open-access bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012, 2014 and 2017 Nucleic Acids Research Database Issues. In the last two years, several important improvements have been made to the database and are described here. These include more robust capture and representation of assay details; a new data deposition system, allowing updating of data sets and deposition of supplementary data; and a completely redesigned web interface, with enhanced search and filtering capabilities.
Subject(s)
Databases, Pharmaceutical , Drug Discovery , Biological Assay , Periodicals as Topic , User-Computer InterfaceABSTRACT
Methods that survey protein surfaces for binding hotspots can help to evaluate target tractability and guide exploration of potential ligand binding regions. Fragment Hotspot Maps builds upon interaction data mined from the CSD (Cambridge Structural Database) and exploits the idea of identifying hotspots using small chemical fragments, which is now widely used to design new drug leads. Prior to this publication, Fragment Hotspot Maps was only publicly available through a web application. To increase the accessibility of this algorithm we present the Hotspots API (application programming interface), a toolkit that offers programmatic access to the core Fragment Hotspot Maps algorithm, thereby facilitating the interpretation and application of the analysis. To demonstrate the package's utility, we present a workflow which automatically derives protein hydrogen-bond constraints for molecular docking with GOLD. The Hotspots API is available from https://github.com/prcurran/hotspots under the MIT license and is dependent upon the commercial CSD Python API.
Subject(s)
Drug Design , Software , Databases, Factual , Molecular Docking Simulation , ProteinsABSTRACT
ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services.
Subject(s)
Databases, Chemical , Databases, Nucleic Acid , Search Engine , Computational Biology/methods , Crop Protection , Drug Discovery , Gene Ontology , Humans , Molecular Sequence Annotation , Pharmacology/methods , User-Computer Interface , Web BrowserABSTRACT
Providing a better understanding of what makes a compound a successful drug candidate is crucial for reducing the high attrition rates in drug discovery. Analyses of the differences between active compounds, clinical candidates and drugs require high-quality datasets. However, most datasets of drug discovery programs are not openly available. This work introduces a dataset of compound-target pairs extracted from the open-source bioactivity database ChEMBL (release 32). Compound-target pairs in the dataset either have at least one measured activity or are part of the manually curated set of known interactions in ChEMBL. Known interactions between drugs or clinical candidates and targets are specifically annotated to facilitate analyses of differences between drugs, clinical candidates, and other active compounds. In total, the dataset comprises 614,594 compound-target pairs, 5,109 (3,932) of which are known interactions between drugs (clinical candidates) and targets. The extraction is performed in an automated manner and fully reproducible. We are providing not only the datasets but also the code to rerun the analyses with other ChEMBL releases.
Subject(s)
Drug Discovery , Humans , Pharmaceutical Preparations , Databases, PharmaceuticalABSTRACT
Published compounds from ChEMBL version 32 are used to seek evidence for the occurrence of "natural selection" in drug discovery. Three measures of natural product (NP) character were applied, to compare time- and target-matched compounds reaching the clinic (clinical compounds in phase 1-3 development and approved drugs) with background compounds (reference compounds). Pseudo-NPs (PNPs), containing NP fragments combined in ways inaccessible by nature, are increasing over time, reaching 67% of clinical compounds first disclosed since 2010. PNPs are 54% more likely to be found in post-2008 clinical versus reference compounds. The majority of target classes show increased clinical compound NP character versus their reference compounds. Only 176 NP fragments appear in >1000 clinical compounds published since 2008, yet these make up on average 63% of the clinical compound's core scaffolds. There is untapped potential awaiting exploitation, by applying nature's building blocksâ"natural intelligence"âto drug design.
Subject(s)
Biological Products , Drug Discovery , Small Molecule Libraries , Biological Products/chemistry , Biological Products/pharmacology , Humans , Small Molecule Libraries/chemistry , Drug DesignABSTRACT
The Knowledge Management Center (KMC) for the Illuminating the Druggable Genome (IDG) project aims to aggregate, update, and articulate protein-centric data knowledge for the entire human proteome, with emphasis on the understudied proteins from the three IDG protein families. KMC collates and analyzes data from over 70 resources to compile the Target Central Resource Database (TCRD), which is the web-based informatics platform (Pharos). These data include experimental, computational, and text-mined information on protein structures, compound interactions, and disease and phenotype associations. Based on this knowledge, proteins are classified into different Target Development Levels (TDLs) for identification of understudied targets. Additional work by the KMC focuses on enriching target knowledge and producing DrugCentral and other data visualization tools for expanding investigation of understudied targets.
Subject(s)
Genome , Knowledge Management , Humans , Proteome , Databases, Factual , InformaticsABSTRACT
The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.
Subject(s)
Data Science , Drug Discovery , Machine Learning , Drug Discovery/methods , Data Science/methods , Humans , Artificial Intelligence , Information Dissemination/methods , Data Mining/methods , Cloud Computing , Databases, FactualABSTRACT
The patent literature is a potentially valuable source of bioactivity data. In this article we describe a process to prioritise 3.7 million life science relevant patents obtained from the SureChEMBL database (https://www.surechembl.org/), according to how likely they were to contain bioactivity data for potent small molecules on less-studied targets, based on the classification developed by the Illuminating the Druggable Genome (IDG) project. The overall goal was to select a smaller number of patents that could be manually curated and incorporated into the ChEMBL database. Using relatively simple annotation and filtering pipelines, we have been able to identify a substantial number of patents containing quantitative bioactivity data for understudied targets that had not previously been reported in the peer-reviewed medicinal chemistry literature. We quantify the added value of such methods in terms of the numbers of targets that are so identified, and provide some specific illustrative examples. Our work underlines the potential value in searching the patent corpus in addition to the more traditional peer-reviewed literature. The small molecules found in these patents, together with their measured activity against the targets, are now accessible via the ChEMBL database.
Subject(s)
Chemistry, Pharmaceutical , Drug Discovery , Drug Discovery/methods , Databases, FactualABSTRACT
Efforts to tackle malaria must continue for a disease that threatens half of the global population. Parasite resistance to current therapies requires new chemotypes that are able to demonstrate effectiveness and safety. Previously, we developed a machine-learning-based approach to predict compound antimalarial activity, which was trained on the compound collections of several organizations. The resulting prediction platform, MAIP, was made freely available to the scientific community and offers a solution to prioritize molecules of interest in virtual screening and hit-to-lead optimization. Here, we experimentally validate MAIP and demonstrate how the approach was used in combination with a robust compound selection workflow and a recently introduced innovative high-throughput screening (HTS) cascade to select and purchase compounds from a public library for subsequent experimental screening. We observed a 12-fold enrichment compared with a randomly selected set of molecules, and the eight hits we ultimately selected exhibit good potency and absorption, distribution, metabolism, and excretion (ADME) profiles.
ABSTRACT
Advancing age is the greatest risk factor for developing multiple age-related diseases. Therapeutic approaches targeting the underlying pathways of ageing, rather than individual diseases, may be an effective way to treat and prevent age-related morbidity while reducing the burden of polypharmacy. We harness the Open Targets Genetics Portal to perform a systematic analysis of nearly 1,400 genome-wide association studies (GWAS) mapped to 34 age-related diseases and traits, identifying genetic signals that are shared between two or more of these traits. Using locus-to-gene (L2G) mapping, we identify 995 targets with shared genetic links to age-related diseases and traits, which are enriched in mechanisms of ageing and include known ageing and longevity-related genes. Of these 995 genes, 128 are the target of an approved or investigational drug, 526 have experimental evidence of binding pockets or are predicted to be tractable, and 341 have no existing tractability evidence, representing underexplored genes which may reveal novel biological insights and therapeutic opportunities. We present these candidate targets for exploration and prioritisation in a web application.
Subject(s)
Aging , Genome-Wide Association Study , Multimorbidity , Longevity , Phenotype , Aging/genetics , HumansABSTRACT
We conduct a large-scale meta-analysis of heart failure genome-wide association studies (GWAS) consisting of over 90,000 heart failure cases and more than 1 million control individuals of European ancestry to uncover novel genetic determinants for heart failure. Using the GWAS results and blood protein quantitative loci, we perform Mendelian randomization and colocalization analyses on human proteins to provide putative causal evidence for the role of druggable proteins in the genesis of heart failure. We identify 39 genome-wide significant heart failure risk variants, of which 18 are previously unreported. Using a combination of Mendelian randomization proteomics and genetic cis-only colocalization analyses, we identify 10 additional putatively causal genes for heart failure. Findings from GWAS and Mendelian randomization-proteomics identify seven (CAMK2D, PRKD1, PRKD3, MAPK3, TNFSF12, APOC3 and NAE1) proteins as potential targets for interventions to be used in primary prevention of heart failure.
Subject(s)
Genome-Wide Association Study , Heart Failure , Humans , Mendelian Randomization Analysis , Proteomics , Heart Failure/drug therapy , Heart Failure/geneticsABSTRACT
One aspirational goal of computational chemistry is to predict potent and drug-like binders for any protein, such that only those that bind are synthesized. In this Roadmap, we describe the launch of Critical Assessment of Computational Hit-finding Experiments (CACHE), a public benchmarking project to compare and improve small molecule hit-finding algorithms through cycles of prediction and experimental testing. Participants will predict small molecule binders for new and biologically relevant protein targets representing different prediction scenarios. Predicted compounds will be tested rigorously in an experimental hub, and all predicted binders as well as all experimental screening data, including the chemical structures of experimentally tested compounds, will be made publicly available, and not subject to any intellectual property restrictions. The ability of a range of computational approaches to find novel binders will be evaluated, compared, and openly published. CACHE will launch 3 new benchmarking exercises every year. The outcomes will be better prediction methods, new small molecule binders for target proteins of importance for fundamental biology or drug discovery, and a major technological step towards achieving the goal of Target 2035, a global initiative to identify pharmacological probes for all human proteins.
ABSTRACT
Twenty years after the publication of the first draft of the human genome, our knowledge of the human proteome is still fragmented. The challenge of translating the wealth of new knowledge from genomics into new medicines is that proteins, and not genes, are the primary executers of biological function. Therefore, much of how biology works in health and disease must be understood through the lens of protein function. Accordingly, a subset of human proteins has been at the heart of research interests of scientists over the centuries, and we have accumulated varying degrees of knowledge about approximately 65% of the human proteome. Nevertheless, a large proportion of proteins in the human proteome (â¼35%) remains uncharacterized, and less than 5% of the human proteome has been successfully targeted for drug discovery. This highlights the profound disconnect between our abilities to obtain genetic information and subsequent development of effective medicines. Target 2035 is an international federation of biomedical scientists from the public and private sectors, which aims to address this gap by developing and applying new technologies to create by year 2035 chemogenomic libraries, chemical probes, and/or biological probes for the entire human proteome.
ABSTRACT
Development of adaptive immunity after COVID-19 and after vaccination against SARS-CoV-2 is predicated on recognition of viral peptides, presented on HLA class II molecules, by CD4+ T-cells. We capitalised on extensive high-resolution HLA data on twenty five human race/ethnic populations to investigate the role of HLA polymorphism on SARS-CoV-2 immunogenicity at the population and individual level. Within populations, we identify wide inter-individual variability in predicted peptide presentation from structural, non-structural and accessory SARS-CoV-2 proteins, according to individual HLA genotype. However, we find similar potential for anti-SARS-CoV-2 cellular immunity at the population level suggesting that HLA polymorphism is unlikely to account for observed disparities in clinical outcomes after COVID-19 among different race/ethnic groups. Our findings provide important insight on the potential role of HLA polymorphism on development of protective immunity after SARS-CoV-2 infection and after vaccination and a firm basis for further experimental studies in this field.
Subject(s)
COVID-19/immunology , Histocompatibility Antigens Class II/genetics , Immunity, Cellular , SARS-CoV-2/immunology , Antigen Presentation , CD4-Positive T-Lymphocytes/immunology , COVID-19/genetics , Genotype , Histocompatibility Antigens Class II/immunology , Humans , Peptides/immunology , Polymorphism, Genetic , Proteome/immunology , Viral Proteins/immunologyABSTRACT
Physicochemical descriptors commonly used to define "drug-likeness" and ligand efficiency measures are assessed for their ability to differentiate marketed drugs from compounds reported to bind to their efficacious target or targets. Using ChEMBL version 26, a data set of 643 drugs acting on 271 targets was assembled, comprising 1104 drug-target pairs having ≥100 published compounds per target. Taking into account changes in their physicochemical properties over time, drugs are analyzed according to their target class, therapy area, and route of administration. Recent drugs, approved in 2010-2020, display no overall differences in molecular weight, lipophilicity, hydrogen bonding, or polar surface area from their target comparator compounds. Drugs are differentiated from target comparators by higher potency, ligand efficiency (LE), lipophilic ligand efficiency (LLE), and lower carboaromaticity. Overall, 96% of drugs have LE or LLE values, or both, greater than the median values of their target comparator compounds.
Subject(s)
Ligands , Pharmaceutical Preparations/chemistry , Databases, Chemical , Drug Administration Routes , Hydrogen Bonding , Hydrophobic and Hydrophilic Interactions , Molecular Weight , Pharmaceutical Preparations/metabolismABSTRACT
Proteolysis-targeting chimeras (PROTACs) are an emerging drug modality that may offer new opportunities to circumvent some of the limitations associated with traditional small-molecule therapeutics. By analogy with the concept of the 'druggable genome', the question arises as to which potential drug targets might PROTAC-mediated protein degradation be most applicable. Here, we present a systematic approach to the assessment of the PROTAC tractability (PROTACtability) of protein targets using a series of criteria based on data and information from a diverse range of relevant publicly available resources. Our approach could support decision-making on whether or not a particular target may be amenable to modulation using a PROTAC. Using our approach, we identified 1,067 proteins of the human proteome that have not yet been described in the literature as PROTAC targets that offer potential opportunities for future PROTAC-based efforts.