SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

Li, Xinhao; Fourches, Denis

Li, Xinhao; Fourches, Denis.

Affiliation

Li X; Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695, United States.
Fourches D; Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695, United States.

J Chem Inf Model ; 61(4): 1560-1569, 2021 04 26.

Article in En | MEDLINE | ID: mdl-33715361

ABSTRACT

ABSTRACT

Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https//github.com/XinhaoLi74/SmilesPE.

Subject(s)

Deep Learning; Algorithms; Cheminformatics; Humans; Quantitative Structure-Activity Relationship

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Deep Learning Type of study: Prognostic_studies Limits: Humans Language: En Journal: J Chem Inf Model Journal subject: INFORMATICA MEDICA / QUIMICA Year: 2021 Type: Article Affiliation country: United States

Fulltext

XML

PubMed Links

Search on Google