MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations.

Tang, Xiangru; Tran, Andrew; Tan, Jeffrey; Gerstein, Mark B

Tang, Xiangru; Tran, Andrew; Tan, Jeffrey; Gerstein, Mark B.

Afiliação

Tang X; Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, USA.
Tran A; Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, USA.
Tan J; Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, USA.
Gerstein MB; Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, USA.

Bioinformatics ; 40(Supplement_1): i357-i368, 2024 Jun 28.

Article em En | MEDLINE | ID: mdl-38940177

ABSTRACT

ABSTRACT

MOTIVATION The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain.

RESULTS:

We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION Our code, data, pre-trained model weights, and examples of using our model are all available at https//github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.

Assuntos

Processamento de Linguagem Natural; Aprendizado Profundo; Biologia Computacional/métodos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural Idioma: En Ano de publicação: 2024 Tipo de documento: Article