Search | VHL Regional Portal

ChatMol: interactive molecular discovery with natural language.

Zeng, Zheni; Yin, Bangchen; Wang, Shipeng; Liu, Jiarui; Yang, Cheng; Yao, Haishen; Sun, Xingzhi; Sun, Maosong; Xie, Guotong; Liu, Zhiyuan.

Bioinformatics ; 40(9)2024 Sep 02.

Article in English | MEDLINE | ID: mdl-39222004

ABSTRACT

MOTIVATION: Natural language is poised to become a key medium for human-machine interactions in the era of large language models. In the field of biochemistry, tasks such as property prediction and molecule mining are critically important yet technically challenging. Bridging molecular expressions in natural language and chemical language can significantly enhance the interpretability and ease of these tasks. Moreover, it can integrate chemical knowledge from various sources, leading to a deeper understanding of molecules. RESULTS: Recognizing these advantages, we introduce the concept of conversational molecular design, a novel task that utilizes natural language to describe and edit target molecules. To better accomplish this task, we develop ChatMol, a knowledgeable and versatile generative pretrained model. This model is enhanced by incorporating experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages. Several typical solutions including large language models (e.g. ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement approach. Case observations and analysis offer insights and directions for further exploration of natural-language interaction in molecular discovery. AVAILABILITY AND IMPLEMENTATION: Codes and data are provided in https://github.com/Ellenzzn/ChatMol/tree/main.

Subject(s)

Natural Language Processing , Humans , Software , Computational Biology/methods

Descriptor-augmented machine learning for enzyme-chemical interaction predictions.

Han, Yilei; Zhang, Haoye; Zeng, Zheni; Liu, Zhiyuan; Lu, Diannan; Liu, Zheng.

Synth Syst Biotechnol ; 9(2): 259-268, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38450325

ABSTRACT

Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.

Transcription between human-readable synthetic descriptions and machine-executable instructions: an application of the latest pre-training technology.

Zeng, Zheni; Nie, Yi-Chen; Ding, Ning; Ding, Qian-Jun; Ye, Wei-Ting; Yang, Cheng; Sun, Maosong; E, Weinan; Zhu, Rong; Liu, Zhiyuan.

Chem Sci ; 14(35): 9360-9373, 2023 Sep 13.

Article in English | MEDLINE | ID: mdl-37712039

ABSTRACT

AI has been widely applied in scientific scenarios, such as robots performing chemical synthetic actions to free researchers from monotonous experimental procedures. However, there exists a gap between human-readable natural language descriptions and machine-executable instructions, of which the former are typically in numerous chemical articles, and the latter are currently compiled manually by experts. We apply the latest technology of pre-trained models and achieve automatic transcription between descriptions and instructions. We design a concise and comprehensive schema of instructions and construct an open-source human-annotated dataset consisting of 3950 description-instruction pairs, with 9.2 operations in each instruction on average. We further propose knowledgeable pre-trained transcription models enhanced by multi-grained chemical knowledge. The performance of recent popular models and products showing great capability in automatic writing (e.g., ChatGPT) has also been explored. Experiments prove that our system improves the instruction compilation efficiency of researchers by at least 42%, and can generate fluent academic paragraphs of synthetic descriptions when given instructions, showing the great potential of pre-trained models in improving human productivity.

A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals.

Zeng, Zheni; Yao, Yuan; Liu, Zhiyuan; Sun, Maosong.

Nat Commun ; 13(1): 862, 2022 02 14.

Article in English | MEDLINE | ID: mdl-35165275

ABSTRACT

To accelerate biomedical research process, deep-learning systems are developed to automatically acquire knowledge about molecule entities by reading large-scale biomedical data. Inspired by humans that learn deep molecule knowledge from versatile reading on both molecule structure and biomedical text information, we propose a knowledgeable machine reading system that bridges both types of information in a unified deep-learning framework for comprehensive biomedical research assistance. We solve the problem that existing machine reading models can only process different types of data separately, and thus achieve a comprehensive and thorough understanding of molecule entities. By grasping meta-knowledge in an unsupervised fashion within and across different information sources, our system can facilitate various real-world biomedical applications, including molecular property prediction, biomedical relation extraction and so on. Experimental results show that our system even surpasses human professionals in the capability of molecular property comprehension, and also reveal its promising potential in facilitating automatic drug discovery and documentation in the future.

Subject(s)

Data Mining , Deep Learning , Drug Discovery/methods , Natural Language Processing , Reading , Algorithms , Biomedical Research , Electronic Data Processing , Humans , Information Storage and Retrieval , Molecular Structure

Knowledge Transfer via Pre-training for Recommendation: A Review and Prospect.

Zeng, Zheni; Xiao, Chaojun; Yao, Yuan; Xie, Ruobing; Liu, Zhiyuan; Lin, Fen; Lin, Leyu; Sun, Maosong.

Front Big Data ; 4: 602071, 2021.

Article in English | MEDLINE | ID: mdl-33817631

ABSTRACT

Recommender systems aim to provide item recommendations for users and are usually faced with data sparsity problems (e.g., cold start) in real-world scenarios. Recently pre-trained models have shown their effectiveness in knowledge transfer between domains and tasks, which can potentially alleviate the data sparsity problem in recommender systems. In this survey, we first provide a review of recommender systems with pre-training. In addition, we show the benefits of pre-training to recommender systems through experiments. Finally, we discuss several promising directions for future research of recommender systems with pre-training. The source code of our experiments will be available to facilitate future research.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL