A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence.

Zheng, Xiaofan; Tomiura, Yoichi

Zheng, Xiaofan; Tomiura, Yoichi.

Affiliation

Zheng X; Graduate School of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, Fukuoka, Japan.
Tomiura Y; Graduate School of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, Fukuoka, Japan. tom@inf.kyushu-u.ac.jp.

J Cheminform ; 16(1): 71, 2024 Jun 19.

Article in En | MEDLINE | ID: mdl-38898528

ABSTRACT

ABSTRACT

Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction SCIENTIFIC CONTRIBUTION The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language.

Key words

ADMET molecular properties prediction; BERT; Odor descriptors; Pretraining; SMILES; Transformer model

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: J Cheminform Year: 2024 Document type: Article

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: J Cheminform Year: 2024 Document type: Article