Your browser doesn't support javascript.
loading
Absorption Distribution Metabolism Excretion and Toxicity Property Prediction Utilizing a Pre-Trained Natural Language Processing Model and Its Applications in Early-Stage Drug Development.
Jung, Woojin; Goo, Sungwoo; Hwang, Taewook; Lee, Hyunjung; Kim, Young-Kuk; Chae, Jung-Woo; Yun, Hwi-Yeol; Jung, Sangkeun.
Afiliação
  • Jung W; College of Pharmacy, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Goo S; Department of Bio-AI convergence, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Hwang T; Department of Bio-AI convergence, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Lee H; Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Kim YK; Department of Bio-AI convergence, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Chae JW; Department of Bio-AI convergence, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Yun HY; Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of Korea.
  • Jung S; College of Pharmacy, Chungnam National University, Daejeon 34134, Republic of Korea.
Pharmaceuticals (Basel) ; 17(3)2024 Mar 17.
Article em En | MEDLINE | ID: mdl-38543168
ABSTRACT
Machine learning techniques are extensively employed in drug discovery, with a significant focus on developing QSAR models that interpret the structural information of potential drugs. In this study, the pre-trained natural language processing (NLP) model, ChemBERTa, was utilized in the drug discovery process. We proposed and evaluated four core model architectures as follows deep neural network (DNN), encoder, concatenation (concat), and pipe. The DNN model processes physicochemical properties as input, while the encoder model leverages the simplified molecular input line entry system (SMILES) along with NLP techniques. The latter two models, concat and pipe, incorporate both SMILES and physicochemical properties, operating in parallel and with sequential manners, respectively. We collected 5238 entries from DrugBank, including their physicochemical properties and absorption, distribution, metabolism, excretion, and toxicity (ADMET) features. The models' performance was assessed by the area under the receiver operating characteristic curve (AUROC), with the DNN, encoder, concat, and pipe models achieved 62.4%, 76.0%, 74.9%, and 68.2%, respectively. In a separate test with 84 experimental microsomal stability datasets, the AUROC scores for external data were 78% for DNN, 44% for the encoder, and 50% for concat, indicating that the DNN model had superior predictive capabilities for new data. This suggests that models based on structural information may require further optimization or alternative tokenization strategies. The application of natural language processing techniques to pharmaceutical challenges has demonstrated promising results, highlighting the need for more extensive data to enhance model generalization.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article