|

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information.

Heyndrickx, Wouter; Mervin, Lewis; Morawietz, Tobias; Sturm, Noé; Friedrich, Lukas; Zalewski, Adam; Pentina, Anastasia; Humbeck, Lina; Oldenhof, Martijn; Niwayama, Ritsuya; Schmidtke, Peter; Fechner, Nikolas; Simm, Jaak; Arany, Adam; Drizard, Nicolas; Jabal, Rama; Afanasyeva, Arina; Loeb, Regis; Verma, Shlok; Harnqvist, Simon; Holmes, Matthew; Pejo, Balazs; Telenczuk, Maria; Holway, Nicholas; Dieckmann, Arne; Rieke, Nicola; Zumsande, Friederike; Clevert, Djork-Arné; Krug, Michael; Luscombe, Christopher; Green, Darren; Ertl, Peter; Antal, Peter; Marcus, David; Do Huu, Nicolas; Fuji, Hideyoshi; Pickett, Stephen; Acs, Gergely; Boniface, Eric; Beck, Bernd; Sun, Yax; Gohier, Arnaud; Rippmann, Friedrich; Engkvist, Ola; Göller, Andreas H; Moreau, Yves; Galtier, Mathieu N; Schuffenhauer, Ansgar; Ceulemans, Hugo.

J Chem Inf Model ; 64(7): 2331-2344, 2024 Apr 08.

Article En | MEDLINE | ID: mdl-37642660

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.

Benchmarking , Quantitative Structure-Activity Relationship , Biological Assay , Machine Learning

Don't Overweight Weights: Evaluation of Weighting Strategies for Multi-Task Bioactivity Classification Models.

Humbeck, Lina; Morawietz, Tobias; Sturm, Noe; Zalewski, Adam; Harnqvist, Simon; Heyndrickx, Wouter; Holmes, Matthew; Beck, Bernd.

Molecules ; 26(22)2021 Nov 18.

Article En | MEDLINE | ID: mdl-34834051

Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow privacy-preserving usage of large amounts of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.

Drug Discovery/methods , Machine Learning , Drug Design , Humans , Small Molecule Libraries/chemistry , Small Molecule Libraries/pharmacology

Variables Influencing Differences in Sequence Conservation in the Fission Yeast Schizosaccharomyces pombe.

Harnqvist, Simon Emanuel; Grace, Cooper Alastair; Jeffares, Daniel Charlton.

J Mol Evol ; 89(9-10): 601-610, 2021 12.

Article En | MEDLINE | ID: mdl-34436628

Which variables determine the constraints on gene sequence evolution is one of the most central questions in molecular evolution. In the fission yeast Schizosaccharomyces pombe, an important model organism, the variables influencing the rate of sequence evolution have yet to be determined. Previous studies in other single celled organisms have generally found gene expression levels to be most significant, with numerous other variables such as gene length and functional importance identified as having a smaller impact. Using publicly available data, we used partial least squares regression, principal components regression, and partial correlations to determine the variables most strongly associated with sequence evolution constraints. We identify centrality in the protein-protein interactions network, amino acid composition, and cellular location as the most important determinants of sequence conservation. However, each factor only explains a small amount of variance, and there are numerous variables having a significant or heterogeneous influence. Our models explain more than half of the variance in dN, raising the possibility that future refined models could quantify the role of stochastics in evolutionary rate variation.

Schizosaccharomyces , Evolution, Molecular , Gene Expression , Schizosaccharomyces/genetics