Pesquisa | BVS IEC

Complementary multi-modality molecular self-supervised learning via non-overlapping masking for property prediction.

Shen, Ao; Yuan, Mingzhi; Ma, Yingfan; Du, Jie; Wang, Manning.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38801702

RESUMO

Self-supervised learning plays an important role in molecular representation learning because labeled molecular data are usually limited in many tasks, such as chemical property prediction and virtual screening. However, most existing molecular pre-training methods focus on one modality of molecular data, and the complementary information of two important modalities, SMILES and graph, is not fully explored. In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities. Experimental results show that our framework achieves state-of-the-art performance in a series of molecular property prediction tasks, and a detailed ablation study demonstrates efficacy of the multi-modality framework and the masking strategy.

Assuntos

Aprendizado de Máquina Supervisionado , Algoritmos , Biologia Computacional/métodos

ProteinMAE: masked autoencoder for protein surface self-supervised learning.

Yuan, Mingzhi; Shen, Ao; Fu, Kexue; Guan, Jiaming; Ma, Yingfan; Qiao, Qin; Wang, Manning.

Bioinformatics ; 39(12)2023 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-38019955

RESUMO

SUMMARY: The biological functions of proteins are determined by the chemical and geometric properties of their surfaces. Recently, with the booming progress of deep learning, a series of learning-based surface descriptors have been proposed and achieved inspirational performance in many tasks such as protein design, protein-protein interaction prediction, etc. However, they are still limited by the problem of label scarcity, since the labels are typically obtained through wet experiments. Inspired by the great success of self-supervised learning in natural language processing and computer vision, we introduce ProteinMAE, a self-supervised framework specifically designed for protein surface representation to mitigate label scarcity. Specifically, we propose an efficient network and utilize a large number of accessible unlabeled protein data to pretrain it by self-supervised learning. Then we use the pretrained weights as initialization and fine-tune the network on downstream tasks. To demonstrate the effectiveness of our method, we conduct experiments on three different downstream tasks including binding site identification in protein surface, ligand-binding protein pocket classification, and protein-protein interaction prediction. The extensive experiments show that our method not only successfully improves the network's performance on all downstream tasks, but also achieves competitive performance with state-of-the-art methods. Moreover, our proposed network also exhibits significant advantages in terms of computational cost, which only requires less than a tenth of memory cost of previous methods. AVAILABILITY AND IMPLEMENTATION: https://github.com/phdymz/ProteinMAE.

Assuntos

Proteínas de Membrana , Processamento de Linguagem Natural , Sítios de Ligação , Domínios Proteicos , Aprendizado de Máquina Supervisionado

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA