Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules.

Wu, Cheng-Kun; Zhang, Xiao-Chen; Yang, Zhi-Jiang; Lu, Ai-Ping; Hou, Ting-Jun; Cao, Dong-Sheng

Wu, Cheng-Kun; Zhang, Xiao-Chen; Yang, Zhi-Jiang; Lu, Ai-Ping; Hou, Ting-Jun; Cao, Dong-Sheng.

Afiliação

Wu CK; State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, China.
Zhang XC; The College of Computer, National University of Defense Technology, China.
Yang ZJ; Xiangya School of Pharmaceutical Sciences, Central South University, Hunan, China.
Lu AP; Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong.
Hou TJ; College of Pharmaceutical Sciences, Zhejiang University, China.
Cao DS; Xiangya School of Pharmaceutical Sciences, Central South University, China.

Brief Bioinform ; 22(6)2021 11 05.

Article em En | MEDLINE | ID: mdl-34427296

ABSTRACT

ABSTRACT

Computational methods have become indispensable tools to accelerate the drug discovery process and alleviate the excessive dependence on time-consuming and labor-intensive experiments. Traditional feature-engineering approaches heavily rely on expert knowledge to devise useful features, which could be costly and sometimes biased. The emerging deep learning (DL) methods deliver a data-driven method to automatically learn expressive representations from complex raw data. Inspired by this, researchers have attempted to apply various deep neural network models to simplified molecular input line entry specification (SMILES) strings, which contain all the composition and structure information of molecules. However, current models usually suffer from the scarcity of labeled data. This results in a low generalization ability of SMILES-based DL models, which prevents them from competing with the state-of-the-art computational methods. In this study, we utilized the BiLSTM (bidirectional long short term merory) attention network (BAN) in which we employed a novel multi-step attention mechanism to facilitate the extracting of key features from the SMILES strings. Meanwhile, SMILES enumeration was utilized as a data augmentation method in the training phase to substantially increase the number of labeled data and enlarge the probability of mining more patterns from complex SMILES. We again took advantage of SMILES enumeration in the prediction phase to rectify model prediction bias and provide a more accurate prediction. Combined with the BAN model, our strategies can greatly improve the performance of latent features learned from SMILES strings. In 11 canonical absorption, distribution, metabolism, excretion and toxicity-related tasks, our method outperformed the state-of-the-art approaches.

Assuntos

Quimioinformática/métodos; Aprendizado Profundo; Descoberta de Drogas/métodos; Software; Algoritmos; Desenvolvimento de Medicamentos; Projetos de Pesquisa

Palavras-chave

SMILES; attention mechanism; data augmentation; deep learning; drug discovery

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Descoberta de Drogas / Aprendizado Profundo / Quimioinformática Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google