Splitting chemical structure data sets for federated privacy-preserving machine learning.

Simm, Jaak; Humbeck, Lina; Zalewski, Adam; Sturm, Noe; Heyndrickx, Wouter; Moreau, Yves; Beck, Bernd; Schuffenhauer, Ansgar

Simm, Jaak; Humbeck, Lina; Zalewski, Adam; Sturm, Noe; Heyndrickx, Wouter; Moreau, Yves; Beck, Bernd; Schuffenhauer, Ansgar.

Afiliação

Simm J; KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium.
Humbeck L; Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany.
Zalewski A; Amgen Research (Munich) GmbH, Staffelseestraße 2, 81477, Munich, Germany.
Sturm N; Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002, Basel, Switzerland.
Heyndrickx W; Janssen Pharmaceutica N.V., Janssen Pharmaceutica, Turnhoutseweg 30, 2340, Beerse, Belgium.
Moreau Y; KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium.
Beck B; Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany.
Schuffenhauer A; Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002, Basel, Switzerland. ansgar.schuffenhauer@novartis.com.

J Cheminform ; 13(1): 96, 2021 Dec 07.

Article em En | MEDLINE | ID: mdl-34876230

ABSTRACT

ABSTRACT

With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting) bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

Palavras-chave

ChemFold; Cross-validation; Federated machine learning; Leader follower clustering; Locality-sensitive hashing; Scaffold network; Scaffold tree; Sphere exclusion clustering; Train-test-split

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Revista: J Cheminform Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Bélgica

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google