Your browser doesn't support javascript.
loading
Fine-tuning large language models for chemical text mining.
Zhang, Wei; Wang, Qinggong; Kong, Xiangtai; Xiong, Jiacheng; Ni, Shengkun; Cao, Duanhua; Niu, Buying; Chen, Mingan; Li, Yameng; Zhang, Runze; Wang, Yitian; Zhang, Lehan; Li, Xutong; Xiong, Zhaoping; Shi, Qian; Huang, Ziming; Fu, Zunyun; Zheng, Mingyue.
Afiliación
  • Zhang W; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Wang Q; University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.
  • Kong X; Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China.
  • Xiong J; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Ni S; University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.
  • Cao D; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Niu B; University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.
  • Chen M; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Li Y; University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.
  • Zhang R; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Wang Y; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou Zhejiang 310058 China.
  • Zhang L; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Li X; University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.
  • Xiong Z; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
  • Shi Q; School of Physical Science and Technology, ShanghaiTech University Shanghai 201210 China.
  • Huang Z; Lingang Laboratory Shanghai 200031 China.
  • Fu Z; ProtonUnfold Technology Co., Ltd Suzhou China.
  • Zheng M; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China myzheng@simm.ac.cn fuzunyun@simm.ac.cn.
Chem Sci ; 15(27): 10600-10611, 2024 Jul 10.
Article en En | MEDLINE | ID: mdl-38994403
ABSTRACT
Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.

Texto completo: 1 Bases de datos: MEDLINE Idioma: En Revista: Chem Sci Año: 2024 Tipo del documento: Article

Texto completo: 1 Bases de datos: MEDLINE Idioma: En Revista: Chem Sci Año: 2024 Tipo del documento: Article