Search | Nursing VHL Search Portal

1.

Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models.

Moret, Michael; Grisoni, Francesca; Katzberger, Paul; Schneider, Gisbert.

J Chem Inf Model ; 62(5): 1199-1206, 2022 03 14.

Article in English | MEDLINE | ID: mdl-35191696

ABSTRACT

Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry system (SMILES) strings. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare "greedy" (beam search) with "explorative" (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

Subject(s)

Language , Models, Chemical , Probability

2.

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs.

van Tilborg, Derek; Alenicheva, Alisa; Grisoni, Francesca.

J Chem Inf Model ; 62(23): 5938-5951, 2022 Dec 12.

Article in English | MEDLINE | ID: mdl-36456532

ABSTRACT

Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffsâpairs of molecules that are highly similar in their structure but exhibit large differences in potencyâhave received limited attention for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization but also models that are well equipped to accurately predict the potency of activity cliffs have increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked a total of 24 machine and deep learning approaches on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. Our findings highlight large case-by-case differences in performance, advocating for (a) the inclusion of dedicated "activity-cliff-centered" metrics during model development and evaluation and (b) the development of novel algorithms to better predict the properties of activity cliffs. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community toward addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.

Subject(s)

Drug Discovery , Machine Learning , Structure-Activity Relationship , Models, Molecular , Algorithms

3.

Beam Search for Automated Design and Scoring of Novel ROR Ligands with Machine Intelligence*.

Moret, Michael; Helmstädter, Moritz; Grisoni, Francesca; Schneider, Gisbert; Merk, Daniel.

Angew Chem Int Ed Engl ; 60(35): 19477-19482, 2021 08 23.

Article in English | MEDLINE | ID: mdl-34165856

ABSTRACT

Chemical language models enable deânovo drug design without the requirement for explicit molecular construction rules. While such models have been applied to generate novel compounds with desired bioactivity, the actual prioritization and selection of the most promising computational designs remains challenging. Herein, we leveraged the probabilities learnt by chemical language models with the beam search algorithm as a model-intrinsic technique for automated molecule design and scoring. Prospective application of this method yielded novel inverse agonists of retinoic acid receptor-related orphan receptors (RORs). Each design was synthesizable in three reaction steps and presented low-micromolar to nanomolar potency towards RORÎ³. This model-intrinsic sampling technique eliminates the strict need for external compound scoring functions, thereby further extending the applicability of generative artificial intelligence to data-driven drug discovery.

Subject(s)

Automation , Biological Products/pharmacology , Drug Design , Receptors, Retinoic Acid/agonists , Algorithms , Biological Products/chemical synthesis , Biological Products/chemistry , Humans , Ligands , Molecular Structure

4.

Lipid discovery for mRNA delivery guided by machine learning.

van der Meel, Roy; Grisoni, Francesca; Mulder, Willem J M.

Nat Mater ; 2024 Jul 02.

Article in English | MEDLINE | ID: mdl-38956348

5.

NURA: A curated dataset of nuclear receptor modulators.

Valsecchi, Cecile; Grisoni, Francesca; Motta, Stefano; Bonati, Laura; Ballabio, Davide.

Toxicol Appl Pharmacol ; 407: 115244, 2020 11 15.

Article in English | MEDLINE | ID: mdl-32961130

ABSTRACT

Nuclear receptors (NRs) are key regulators of human health and constitute a relevant target for medicinal chemistry applications as well as for toxicological risk assessment. Several open databases dedicated to small molecules that modulate NRs exist; however, depending on their final aim (i.e., adverse effect assessment or drug design), these databases contain a different amount and type of annotated molecules, along with a different distribution of experimental bioactivity values. Stemming from these considerations, in this work we aim to provide a unified dataset, NURA (NUclear Receptor Activity) dataset, collecting curated information on small molecules that modulate NRs, to be intended for both pharmacological and toxicological applications. NURA contains bioactivity annotations for 15,247 molecules and 11 selected NRs, and it was obtained by integrating and curating data from toxicological and pharmacological databases (i.e., Tox21, ChEMBL, NR-DBIND and BindingDB). Our results show that NURA dataset is a useful tool to bridge the gap between toxicology- and medicinal-chemistry-related databases, as it is enriched in terms of number of molecules, structural diversity and covered atomic scaffolds compared to the single sources. To the best of our knowledge, NURA dataset is the most exhaustive collection of small molecules annotated for their modulation of the chosen nuclear receptors. NURA dataset is intended to support decision-making in pharmacology and toxicology, as well as to contribute to data-driven applications, such as machine learning. The dataset and the data curation pipeline can be downloaded free of charge on Zenodo at the following DOI: https://doi.org/10.5281/zenodo.3991561.

Subject(s)

Databases, Factual , Receptors, Cytoplasmic and Nuclear/drug effects , Chemistry, Pharmaceutical/methods , Computer Simulation , Data Collection , Data Interpretation, Statistical , Drug Evaluation, Preclinical , Humans , In Vitro Techniques , Models, Molecular , Small Molecule Libraries , Software , Toxicology/methods

6.

Consensus versus Individual QSARs in Classification: Comparison on a Large-Scale Case Study.

Valsecchi, Cecile; Grisoni, Francesca; Consonni, Viviana; Ballabio, Davide.

J Chem Inf Model ; 60(3): 1215-1223, 2020 03 23.

Article in English | MEDLINE | ID: mdl-32073844

ABSTRACT

Consensus strategies have been widely applied in many different scientific fields, based on the assumption that the fusion of several sources of information increases the outcome reliability. Despite the widespread application of consensus approaches, their advantages in quantitative structure-activity relationship (QSAR) modeling have not been thoroughly evaluated, mainly due to the lack of appropriate large-scale data sets. In this study, we evaluated the advantages and drawbacks of consensus approaches compared to single classification QSAR models. To this end, we used a data set of three properties (androgen receptor binding, agonism, and antagonism) for approximately 4000 molecules with predictions performed by more than 20 QSAR models, made available in a large-scale collaborative project. The individual QSAR models were compared with two consensus approaches, majority voting and the Bayes consensus with discrete probability distributions, in both protective and nonprotective forms. Consensus strategies proved to be more accurate and to better cover the analyzed chemical space than individual QSARs on average, thus motivating their widespread application for property prediction. Scripts and data to reproduce the results of this study are available for download.

Subject(s)

Quantitative Structure-Activity Relationship , Bayes Theorem , Consensus , Reproducibility of Results

7.

Bidirectional Molecule Generation with Recurrent Neural Networks.

Grisoni, Francesca; Moret, Michael; Lingwood, Robin; Schneider, Gisbert.

J Chem Inf Model ; 60(3): 1175-1183, 2020 03 23.

Article in English | MEDLINE | ID: mdl-31904964

ABSTRACT

Recurrent neural networks (RNNs) are able to generate de novo molecular designs using simplified molecular input line entry systems (SMILES) string representations of the chemical structure. RNN-based structure generation is usually performed unidirectionally, by growing SMILES strings from left to right. However, there is no natural start or end of a small molecule, and SMILES strings are intrinsically nonunivocal representations of molecular graphs. These properties motivate bidirectional structure generation. Here, bidirectional generative RNNs for SMILES-based molecule design are introduced. To this end, two established bidirectional methods were implemented, and a new method for SMILES string generation and data augmentation is introduced-the bidirectional molecule design by alternate learning (BIMODAL). These three bidirectional strategies were compared to the unidirectional forward RNN approach for SMILES string generation, in terms of the (i) novelty, (ii) scaffold diversity, and (iii) chemical-biological relevance of the computer-generated molecules. The results positively advocate bidirectional strategies for SMILES-based molecular de novo design, with BIMODAL showing superior results to the unidirectional forward RNN for most of the criteria in the tested conditions. The code of the methods and the pretrained models can be found at URL https://github.com/ETHmodlab/BIMODAL.

Subject(s)

Neural Networks, Computer

8.

Verification of Chromatographic Profile of Primary Essential Oil of Pinus sylvestris L. Combined with Chemometric Analysis.

Allenspach, Martina; Valder, Claudia; Flamm, Daniela; Grisoni, Francesca; Steuer, Christian.

Molecules ; 25(13)2020 Jun 28.

Article in English | MEDLINE | ID: mdl-32605289

ABSTRACT

Chromatographic profiles of primary essential oils (EO) deliver valuable authentic information about composition and compound pattern. Primary EOs obtained from Pinus sylvestris L. (PS) from different global origins were analyzed using gas chromatography coupled to a flame ionization detector (GC-FID) and identified by GC hyphenated to mass spectrometer (GC-MS). A primary EO of PS was characterized by a distinct sesquiterpene pattern followed by a diterpene profile containing diterpenoids of the labdane, pimarane or abietane type. Based on their sesquiterpene compound patterns, primary EOs of PS were separated into their geographical origin using component analysis. Furthermore, differentiation of closely related pine EOs by partial least square discriminant analysis proved the existence of a primary EO of PS. The developed and validated PLS-DA model is suitable as a screening tool to assess the correct chemotaxonomic identification of a primary pine EOs as it classified all pine EOs correctly.

Subject(s)

Oils, Volatile/analysis , Pinus sylvestris/chemistry , Discriminant Analysis , Diterpenes/analysis , Diterpenes/chemistry , Gas Chromatography-Mass Spectrometry , Molecular Structure , Plant Oils/analysis , Sesquiterpenes/analysis , Sesquiterpenes/chemistry

9.

Machine Learning Consensus To Predict the Binding to the Androgen Receptor within the CoMPARA Project.

Grisoni, Francesca; Consonni, Viviana; Ballabio, Davide.

J Chem Inf Model ; 59(5): 1839-1848, 2019 05 28.

Article in English | MEDLINE | ID: mdl-30668916

ABSTRACT

The nuclear androgen receptor (AR) is one of the most relevant biological targets of Endocrine Disrupting Chemicals (EDCs), which produce adverse effects by interfering with hormonal regulation and endocrine system functioning. This paper describes novel in silico models to identify organic AR modulators in the context of the Collaborative Modeling Project of Androgen Receptor Activity (CoMPARA), coordinated by the National Center of Computational Toxicology (U.S. Environmental Protection Agency). The collaborative project involved 35 international research groups to prioritize the experimental tests of approximatively 40k compounds, based on the predictions provided by each participant. In this paper, we describe our machine learning approach to predict the binding to AR, which is based on a consensus of a multivariate Bernoulli Naive Bayes, a Random Forest, and N-Nearest Neighbor classification models. The approach was developed in compliance with the Organization of Economic Cooperation and Development (OECD) principles, trained on 1687 ToxCast molecules classified according to 11 in vitro assays, and further validated on a set of 3,882 external compounds. The models provided robust and reliable predictions and were used to gather novel data-driven insights on the structural features related to AR binding, agonism, and antagonism.

Subject(s)

Androgen Receptor Antagonists/pharmacology , Androgens/pharmacology , Endocrine Disruptors/pharmacology , Machine Learning , Receptors, Androgen/metabolism , Androgen Receptor Antagonists/chemistry , Androgens/chemistry , Drug Discovery , Endocrine Disruptors/chemistry , Humans , Molecular Docking Simulation , Protein Binding , Software

10.

De novo Molecular Design with Generative Long Short-term Memory.

Grisoni, Francesca; Schneider, Gisbert.

Chimia (Aarau) ; 73(12): 1006-1011, 2019 Dec 18.

Article in English | MEDLINE | ID: mdl-31883552

ABSTRACT

Drug discovery benefits from computational models aiding the identification of new chemical matter with bespoke properties. The field of de novo drug design has been particularly revitalized by adaptation of generative machine learning models from the field of natural language processing. These deep neural network models are trained on recognizing molecular structures and generate new molecular entities without relying on pre-determined sets of molecular building blocks and chemical transformations for virtual molecule construction. Implicit representation of chemical knowledge provides an alternative to formulating the molecular design task in terms of the established, explicit chemical vocabulary. Here, we review de novo molecular design approaches from the field of 'artificial intelligence', focusing on instances of deep generative models, and highlight the prospective application of long short-term memory models to hit and lead finding in medicinal chemistry.

Subject(s)

Memory, Short-Term , Drug Design , Machine Learning , Neural Networks, Computer , Prospective Studies

11.

Correction to "Exposing the Limitations of Molecular Machine Learning with Activity Cliffs".

van Tilborg, Derek; Alenicheva, Alisa; Grisoni, Francesca.

J Chem Inf Model ; 63(7): 2266, 2023 Apr 10.

Article in English | MEDLINE | ID: mdl-36995229

12.

Correction to "Exposing the Limitations of Molecular Machine Learning with Activity Cliffs".

van Tilborg, Derek; Alenicheva, Alisa; Grisoni, Francesca.

J Chem Inf Model ; 2023 Oct 18.

Article in English | MEDLINE | ID: mdl-37851546

13.

Beware of Unreliable Q²! A Comparative Study of Regression Metrics for Predictivity Assessment of QSAR Models.

Todeschini, Roberto; Ballabio, Davide; Grisoni, Francesca.

J Chem Inf Model ; 56(10): 1905-1913, 2016 10 24.

Article in English | MEDLINE | ID: mdl-27633067

ABSTRACT

Validation is an essential step of QSAR modeling, and it can be performed by both internal validation techniques (e.g., cross-validation, bootstrap) or by an external set of test objects, that is, objects not used for model development and/or optimization. The evaluation of model predictive ability is then completed by comparing experimental and predicted values of test molecules. When dealing with quantitative QSAR models, validation results are generally expressed in terms of Q2 metrics. In this work, four fundamental mathematical principles, which should be respected by any Q2 metric, are introduced. Then, the behavior of five different metrics (QF12, QF22, QF32, QCCC2, and QRm2) is compared and critically discussed. The conclusions highlight that only the QF32 metric satisfies all the stated conditions, while the remaining metrics show different theoretical flaws.

Subject(s)

Algorithms , Quantitative Structure-Activity Relationship , Computer Simulation , Models, Chemical

14.

Expert QSAR system for predicting the bioconcentration factor under the REACH regulation.

Grisoni, Francesca; Consonni, Viviana; Vighi, Marco; Villa, Sara; Todeschini, Roberto.

Environ Res ; 148: 507-512, 2016 07.

Article in English | MEDLINE | ID: mdl-27152714

ABSTRACT

Expert systems are a rational integration of several models that generally aim to exploit their advantages and overcome their drawbacks. This work is founded on our previously published Quantitative Structure-Activity Relationship (QSAR) classification scheme, which detects compounds whose Bioconcentration Factor (BCF) is (1) well predicted by the octanol-water partition coefficient (KOW), (2) underestimated by KOW or (3) overestimated by KOW. The classification scheme served as the starting point to identify and combine the best BCF model for each class among three VEGA models and one KOW-based equation. The rationalized model integration showed stability and surprising performance on unknown data when compared with benchmark BCF models. Model simplicity, transparency and mechanistic interpretation were fostered in order to allow for its application and acceptance within the REACH framework.

Subject(s)

Models, Theoretical , Quantitative Structure-Activity Relationship , 1-Octanol/chemistry , European Union , Government Regulation , Hazardous Substances/chemistry , Water/chemistry

15.

In Silico Prediction of Cytochrome P450-Drug Interaction: QSARs for CYP3A4 and CYP2C9.

Nembri, Serena; Grisoni, Francesca; Consonni, Viviana; Todeschini, Roberto.

Int J Mol Sci ; 17(6)2016 Jun 09.

Article in English | MEDLINE | ID: mdl-27294921

ABSTRACT

Cytochromes P450 (CYP) are the main actors in the oxidation of xenobiotics and play a crucial role in drug safety, persistence, bioactivation, and drug-drug/food-drug interaction. This work aims to develop Quantitative Structure-Activity Relationship (QSAR) models to predict the drug interaction with two of the most important CYP isoforms, namely 2C9 and 3A4. The presented models are calibrated on 9122 drug-like compounds, using three different modelling approaches and two types of molecular description (classical molecular descriptors and binary fingerprints). For each isoform, three classification models are presented, based on a different approach and with different advantages: (1) a very simple and interpretable classification tree; (2) a local (k-Nearest Neighbor) model based classical descriptors and; (3) a model based on a recently proposed local classifier (N-Nearest Neighbor) on binary fingerprints. The salient features of the work are (1) the thorough model validation and the applicability domain assessment; (2) the descriptor interpretation, which highlighted the crucial aspects of P450-drug interaction; and (3) the consensus aggregation of models, which largely increased the prediction accuracy.

Subject(s)

Cytochrome P-450 CYP2C9 Inhibitors/pharmacology , Cytochrome P-450 CYP2C9/chemistry , Cytochrome P-450 CYP3A Inhibitors/pharmacology , Cytochrome P-450 CYP3A/chemistry , Quantitative Structure-Activity Relationship , Animals , Computer Simulation , Cytochrome P-450 CYP2C9/metabolism , Cytochrome P-450 CYP2C9 Inhibitors/chemistry , Cytochrome P-450 CYP3A/metabolism , Cytochrome P-450 CYP3A Inhibitors/chemistry , Humans , Protein Binding

16.

Deep learning for low-data drug discovery: Hurdles and opportunities.

van Tilborg, Derek; Brinkmann, Helena; Criscuolo, Emanuele; Rossen, Luke; Özçelik, Riza; Grisoni, Francesca.

Curr Opin Struct Biol ; 86: 102818, 2024 06.

Article in English | MEDLINE | ID: mdl-38669740

ABSTRACT

Deep learning is becoming increasingly relevant in drug discovery, from de novo design to protein structure prediction and synthesis planning. However, it is often challenged by the small data regimes typical of certain drug discovery tasks. In such scenarios, deep learning approaches-which are notoriously 'data-hungry'-might fail to live up to their promise. Developing novel approaches to leverage the power of deep learning in low-data scenarios is sparking great attention, and future developments are expected to propel the field further. This mini-review provides an overview of recent low-data-learning approaches in drug discovery, analyzing their hurdles and advantages. Finally, we venture to provide a forecast of future research directions in low-data learning for drug discovery.

Subject(s)

Deep Learning , Drug Discovery , Drug Discovery/methods , Humans , Proteins/chemistry , Proteins/metabolism

17.

Effectiveness of molecular fingerprints for exploring the chemical space of natural products.

Boldini, Davide; Ballabio, Davide; Consonni, Viviana; Todeschini, Roberto; Grisoni, Francesca; Sieber, Stephan A.

J Cheminform ; 16(1): 35, 2024 Mar 25.

Article in English | MEDLINE | ID: mdl-38528548

ABSTRACT

Natural products are a diverse class of compounds with promising biological properties, such as high potency and excellent selectivity. However, they have different structural motifs than typical drug-like compounds, e.g., a wider range of molecular weight, multiple stereocenters and higher fraction of sp3-hybridized carbons. This makes the encoding of natural products via molecular fingerprints difficult, thus restricting their use in cheminformatics studies. To tackle this issue, we explored over 30 years of research to systematically evaluate which molecular fingerprint provides the best performance on the natural product chemical space. We considered 20 molecular fingerprints from four different sources, which we then benchmarked on over 100,000 unique natural products from the COCONUT (COlleCtion of Open Natural prodUcTs) and CMNPD (Comprehensive Marine Natural Products Database) databases. Our analysis focused on the correlation between different fingerprints and their classification performance on 12 bioactivity prediction datasets. Our results show that different encodings can provide fundamentally different views of the natural product chemical space, leading to substantial differences in pairwise similarity and performance. While Extended Connectivity Fingerprints are the de-facto option to encoding drug-like compounds, other fingerprints resulted to match or outperform them for bioactivity prediction of natural products. These results highlight the need to evaluate multiple fingerprinting algorithms for optimal performance and suggest new areas of research. Finally, we provide an open-source Python package for computing all molecular fingerprints considered in the study, as well as data and scripts necessary to reproduce the results, at https://github.com/dahvida/NP_Fingerprints .

18.

Baricitinib and tofacitinib off-target profile, with a focus on Alzheimer's disease.

Faquetti, Maria L; Slappendel, Laura; Bigonne, Hélène; Grisoni, Francesca; Schneider, Petra; Aichinger, Georg; Schneider, Gisbert; Sturla, Shana J; Burden, Andrea M.

Alzheimers Dement (N Y) ; 10(1): e12445, 2024.

Article in English | MEDLINE | ID: mdl-38528988

ABSTRACT

INTRODUCTION: Janus kinase (JAK) inhibitors were recently identified as promising drug candidates for repurposing in Alzheimer's disease (AD) due to their capacity to suppress inflammation via modulation of JAK/STAT signaling pathways. Besides interaction with primary therapeutic targets, JAK inhibitor drugs frequently interact with unintended, often unknown, biological off-targets, leading to associated effects. Nevertheless, the relevance of JAK inhibitors' off-target interactions in the context of AD remains unclear. METHODS: Putative off-targets of baricitinib and tofacitinib were predicted using a machine learning (ML) approach. After screening scientific literature, off-targets were filtered based on their relevance to AD. Targets that had not been previously identified as off-targets of baricitinib or tofacitinib were subsequently tested using biochemical or cell-based assays. From those, active concentrations were compared to bioavailable concentrations in the brain predicted by physiologically based pharmacokinetic (PBPK) modeling. RESULTS: With the aid of ML and in vitro activity assays, we identified two enzymes previously unknown to be inhibited by baricitinib, namely casein kinase 2 subunit alpha 2 (CK2-α2) and dual leucine zipper kinase (MAP3K12), both with binding constant (K d) values of 5.8 µM. Predicted maximum concentrations of baricitinib in brain tissue using PBPK modeling range from 1.3 to 23 nM, which is two to three orders of magnitude below the corresponding binding constant. CONCLUSION: In this study, we extended the list of baricitinib off-targets that are potentially relevant for AD progression and predicted drug distribution in the brain. The results suggest a low likelihood of successful repurposing in AD due to low brain permeability, even at the maximum recommended daily dose. While additional research is needed to evaluate the potential impact of the off-target interaction on AD, the combined approach of ML-based target prediction, in vitro confirmation, and PBPK modeling may help prioritize drugs with a high likelihood of being effectively repurposed for AD. Highlights: This study explored JAK inhibitors' off-targets in AD using a multidisciplinary approach.We combined machine learning, in vitro tests, and PBPK modelling to predict and validate new off-target interactions of tofacitinib and baricitinib in AD.Previously unknown inhibition of two enzymes (CK2-a2 and MAP3K12) by baricitinib were confirmed using in vitro experiments.Our PBPK model indicates that baricitinib low brain permeability limits AD repurposing.The proposed multidisciplinary approach optimizes drug repurposing efforts in AD research.

19.

Chemical language models for de novo drug design: Challenges and opportunities.

Grisoni, Francesca.

Curr Opin Struct Biol ; 79: 102527, 2023 04.

Article in English | MEDLINE | ID: mdl-36738564

ABSTRACT

Generative deep learning is accelerating de novo drug design, by allowing the generation of molecules with desired properties on demand. Chemical language models - which generate new molecules in the form of strings using deep learning - have been particularly successful in this endeavour. Thanks to advances in natural language processing methods and interdisciplinary collaborations, chemical language models are expected to become increasingly relevant in drug discovery. This minireview provides an overview of the current state-of-the-art of chemical language models for de novo design, and analyses current limitations, challenges, and advantages. Finally, a perspective on future opportunities is provided.

Subject(s)

Drug Design , Drug Discovery , Models, Chemical

20.

Practical guidelines for the use of gradient boosting for molecular property prediction.

Boldini, Davide; Grisoni, Francesca; Kuhn, Daniel; Friedrich, Lukas; Sieber, Stephan A.

J Cheminform ; 15(1): 73, 2023 Aug 28.

Article in English | MEDLINE | ID: mdl-37641120

ABSTRACT

Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL