RESUMO
Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.
Assuntos
Benchmarking , Relação Quantitativa Estrutura-Atividade , Bioensaio , Aprendizado de MáquinaRESUMO
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow privacy-preserving usage of large amounts of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
Assuntos
Descoberta de Drogas/métodos , Aprendizado de Máquina , Desenho de Fármacos , Humanos , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/farmacologiaRESUMO
ADME (Absorption, Distribution, Metabolism, Excretion) properties are key parameters to judge whether a drug candidate exhibits a desired pharmacokinetic (PK) profile. In this study, we tested multi-task machine learning (ML) models to predict ADME and animal PK endpoints trained on in-house data generated at Boehringer Ingelheim. Models were evaluated both at the design stage of a compound (i. e., no experimental data of test compounds available) and at testing stage when a particular assay would be conducted (i. e., experimental data of earlier conducted assays may be available). Using realistic time-splits, we found a clear benefit in performance of multi-task graph-based neural network models over single-task model, which was even stronger when experimental data of earlier assays is available. In an attempt to explain the success of multi-task models, we found that especially endpoints with the largest numbers of data points (physicochemical endpoints, clearance in microsomes) are responsible for increased predictivity in more complex ADME and PK endpoints. In summary, our study provides insight into how data for multiple ADME/PK endpoints in a pharmaceutical company can be best leveraged to optimize predictivity of ML models.
Assuntos
Aprendizado de Máquina , Animais , Preparações Farmacêuticas/metabolismo , Preparações Farmacêuticas/química , Humanos , Farmacocinética , Redes Neurais de Computação , Modelos BiológicosRESUMO
Knowledge about interrelationships between different proteins is crucial in fundamental research for the elucidation of protein networks and pathways. Furthermore, it is especially critical in chemical biology to identify further key regulators of a disease and to take advantage of polypharmacology effects. Here, we present a new concept that combines a scaffold-based analysis of bioactivity data with a subsequent screening to identify novel inhibitors for a protein target of interest. The initial scaffold-based analysis revealed a flavone-like scaffold that can be found in ligands of different unrelated proteins indicating a similarity in ligand binding. This similarity was further investigated by testing compounds on bromodomain-containing protein 4 (BRD4) that were similar to known ligands of the other identified protein targets. Several new BRD4 inhibitors were identified and proven to be validated hits based on orthogonal assays and X-ray crystallography. The most important discovery was an unexpected relationship between BRD4 and peroxisome-proliferator activated receptor gamma (PPARγ). Both proteins share binding site similarities near a common hydrophobic subpocket which should allow the design of a polypharmacology-based ligand targeting both proteins. Such dual-BRD4-PPARγ modulators open up new therapeutic opportunities, because both are important drug targets for cancer therapy and many more important diseases. Thereon, a complex structure of sulfasalazine was obtained that involves two bromodomains and could be a potential starting point for the design of a bivalent BRD4 inhibitor.
Assuntos
Proteínas de Ciclo Celular/metabolismo , PPAR gama/metabolismo , Bibliotecas de Moléculas Pequenas/metabolismo , Fatores de Transcrição/metabolismo , Sítios de Ligação , Cristalografia por Raios X , Flavonas/química , Flavonas/metabolismo , Humanos , Simulação de Acoplamento Molecular , Estrutura Molecular , Polifarmacologia , Ligação Proteica , Bibliotecas de Moléculas Pequenas/químicaRESUMO
With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.
RESUMO
Protein ligand interaction fingerprints are a powerful approach for the analysis and assessment of docking poses to improve docking performance in virtual screening. In this study, a novel interaction fingerprint approach (PADIF, protein per atom score contributions derived interaction fingerprint) is presented which was specifically designed for utilising the GOLD scoring functions' atom contributions together with a specific scoring scheme. This allows the incorporation of known protein-ligand complex structures for a target-specific scoring. Unlike many other methods, this approach uses weighting factors reflecting the relative frequency of a specific interaction in the references and penalizes destabilizing interactions. In addition, and for the first time, an exhaustive validation study was performed that assesses the performance of PADIF and two other interaction fingerprints in virtual screening. Here, PADIF shows superior results, and some rules of thumb for a successful use of interaction fingerprints could be identified.
RESUMO
A common issue during drug design and development is the discovery of novel scaffolds for protein targets. On the one hand the chemical space of purchasable compounds is rather limited; on the other hand artificially generated molecules suffer from a grave lack of accessibility in practice. Therefore, we generated a novel virtual library of small molecules which are synthesizable from purchasable educts, called CHIPMUNK (CHemically feasible Inâ silico Public Molecular UNiverse Knowledge base). Altogether, CHIPMUNK covers over 95 million compounds and encompasses regions of the chemical space that are not covered by existing databases. The coverage of CHIPMUNK exceeds the chemical space spanned by the Lipinski rule of five to foster the exploration of novel and difficult target classes. The analysis of the generated property space reveals that CHIPMUNK is well suited for the design of protein-protein interaction inhibitors (PPIIs). Furthermore, a recently developed structural clustering algorithm (StruClus) for big data was used to partition the sub-libraries into meaningful subsets and assist scientists to process the large amount of data. These clustered subsets also contain the target space based on ChEMBL data which was included during clustering.
Assuntos
Proteínas/química , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/farmacologia , Algoritmos , Química Farmacêutica , Análise por Conglomerados , Desenho de Fármacos , Ligação Proteica/efeitos dos fármacos , Proteínas/antagonistas & inibidores , Bibliotecas de Moléculas Pequenas/síntese químicaRESUMO
The ever increasing bioactivity data that are produced nowadays allow exhaustive data mining and knowledge discovery approaches that change chemical biology research. A wealth of chemoinformatics tools, web services, and applications therefore exists that supports a careful evaluation and analysis of experimental data to draw conclusions that can influence the further development of chemical probes and potential lead structures. This review focuses on open-source approaches that can be handled by scientists who are not familiar with computational methods having no expert knowledge in chemoinformatics and modeling. Our aim is to present an easily manageable toolbox for support of every day laboratory work. This includes, among other things, the available bioactivity and related molecule databases as well as tools to handle and analyze in-house data.
Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Descoberta de Drogas/métodos , Animais , Bases de Dados Factuais , HumanosRESUMO
The era of big data is influencing the way how rational drug discovery and the development of bioactive molecules is performed and versatile tools are needed to assist in molecular design workflows. Scaffold Hunter is a flexible visual analytics framework for the analysis of chemical compound data and combines techniques from several fields such as data mining and information visualization. The framework allows analyzing high-dimensional chemical compound data in an interactive fashion, combining intuitive visualizations with automated analysis methods including versatile clustering methods. Originally designed to analyze the scaffold tree, Scaffold Hunter is continuously revised and extended. We describe recent extensions that significantly increase the applicability for a variety of tasks.