SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.
J Chem Inf Model
; 61(4): 1560-1569, 2021 04 26.
Article
in En
| MEDLINE
| ID: mdl-33715361
ABSTRACT
Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https//github.com/XinhaoLi74/SmilesPE.
Full text:
1
Collection:
01-internacional
Database:
MEDLINE
Main subject:
Deep Learning
Type of study:
Prognostic_studies
Limits:
Humans
Language:
En
Journal:
J Chem Inf Model
Journal subject:
INFORMATICA MEDICA
/
QUIMICA
Year:
2021
Type:
Article
Affiliation country:
United States