Your browser doesn't support javascript.
loading
Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space.
Wang, Jie; Shen, Zihao; Liao, Yichen; Yuan, Zhen; Li, Shiliang; He, Gaoqi; Lan, Man; Qian, Xuhong; Zhang, Kai; Li, Honglin.
Afiliação
  • Wang J; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China.
  • Shen Z; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China.
  • Liao Y; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China.
  • Yuan Z; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China.
  • Li S; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China.
  • He G; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China.
  • Lan M; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China.
  • Qian X; Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China.
  • Zhang K; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China.
  • Li H; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China.
Brief Bioinform ; 23(6)2022 11 19.
Article em En | MEDLINE | ID: mdl-36252922
ABSTRACT
Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of 'near-drug' properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords 'A61P, compound, structure' in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97-98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Mineração de Dados / Bases de Dados de Compostos Químicos Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2022 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Mineração de Dados / Bases de Dados de Compostos Químicos Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2022 Tipo de documento: Article