Pesquisa | Portal Regional da BVS

Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models.

Tan, Yang; Li, Mingchen; Zhou, Bingxin; Zhong, Bozitao; Zheng, Lirong; Tan, Pan; Zhou, Ziyi; Yu, Huiqun; Fan, Guisheng; Hong, Liang.

J Chem Inf Model ; 64(16): 6338-6349, 2024 Aug 26.

Artigo em Inglês | MEDLINE | ID: mdl-39110130

RESUMO

Fine-tuning pretrained protein language models (PLMs) has emerged as a prominent strategy for enhancing downstream prediction tasks, often outperforming traditional supervised learning approaches. As a widely applied powerful technique in natural language processing, employing parameter-efficient fine-tuning techniques could potentially enhance the performance of PLMs. However, the direct transfer to life science tasks is nontrivial due to the different training strategies and data forms. To address this gap, we introduce SES-Adapter, a simple, efficient, and scalable adapter method for enhancing the representation learning of PLMs. SES-Adapter incorporates PLM embeddings with structural sequence embeddings to create structure-aware representations. We show that the proposed method is compatible with different PLM architectures and across diverse tasks. Extensive evaluations are conducted on 2 types of folding structures with notable quality differences, 9 state-of-the-art baselines, and 9 benchmark data sets across distinct downstream tasks. Results show that compared to vanilla PLMs, SES-Adapter improves downstream task performance by a maximum of 11% and an average of 3%, with significantly accelerated convergence speed by a maximum of 1034% and an average of 362%, the training efficiency is also improved by approximately 2 times. Moreover, positive optimization is observed even with low-quality predicted structures. The source code for SES-Adapter is available at https://github.com/tyang816/SES-Adapter.

Assuntos

Modelos Moleculares , Proteínas , Proteínas/química , Conformação Proteica , Processamento de Linguagem Natural

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

Tan, Yang; Li, Mingchen; Zhou, Ziyi; Tan, Pan; Yu, Huiqun; Fan, Guisheng; Hong, Liang.

J Cheminform ; 16(1): 92, 2024 Aug 02.

Artigo em Inglês | MEDLINE | ID: mdl-39095917

RESUMO

Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.

Protein Engineering with Lightweight Graph Denoising Neural Networks.

Zhou, Bingxin; Zheng, Lirong; Wu, Banghao; Tan, Yang; Lv, Outongyi; Yi, Kai; Fan, Guisheng; Hong, Liang.

J Chem Inf Model ; 64(9): 3650-3661, 2024 May 13.

Artigo em Inglês | MEDLINE | ID: mdl-38630581

RESUMO

Protein engineering faces challenges in finding optimal mutants from a massive pool of candidate mutants. In this study, we introduce a deep-learning-based data-efficient fitness prediction tool to steer protein engineering. Our methodology establishes a lightweight graph neural network scheme for protein structures, which efficiently analyzes the microenvironment of amino acids in wild-type proteins and reconstructs the distribution of the amino acid sequences that are more likely to pass natural selection. This distribution serves as a general guidance for scoring proteins toward arbitrary properties on any order of mutations. Our proposed solution undergoes extensive wet-lab experimental validation spanning diverse physicochemical properties of various proteins, including fluorescence intensity, antigen-antibody affinity, thermostability, and DNA cleavage activity. More than 40% of ProtLGN-designed single-site mutants outperform their wild-type counterparts across all studied proteins and targeted properties. More importantly, our model can bypass the negative epistatic effect to combine single mutation sites and form deep mutants with up to seven mutation sites in a single round, whose physicochemical properties are significantly improved. This observation provides compelling evidence of the structure-based model's potential to guide deep mutations in protein engineering. Overall, our approach emerges as a versatile tool for protein engineering, benefiting both the computational and bioengineering communities.

Assuntos

Redes Neurais de Computação , Engenharia de Proteínas , Engenharia de Proteínas/métodos , Mutação , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Modelos Moleculares , Conformação Proteica , Aprendizado Profundo

MedChatZH: A tuning LLM for traditional Chinese medicine consultations.

Tan, Yang; Zhang, Zhixing; Li, Mingchen; Pan, Fei; Duan, Hao; Huang, Zijie; Deng, Hua; Yu, Zhuohang; Yang, Chen; Shen, Guoyang; Qi, Peng; Yue, Chengyuan; Liu, Yuxian; Hong, Liang; Yu, Huiqun; Fan, Guisheng; Tang, Yun.

Comput Biol Med ; 172: 108290, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38503097

RESUMO

Generative Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, including Question-Answering (QA) and dialogue systems. However, most models are trained on English data and lack strong generalization in providing answers in Chinese. This limitation is especially evident in specialized domains like traditional Chinese medical QA, where performance suffers due to the absence of fine-tuning and high-quality datasets. To address this, we introduce MedChatZH, a dialogue model optimized for Chinese medical QA based on transformer decoder with LLaMA architecture. Continued pre-training on a curated corpus of Chinese medical books is followed by fine-tuning with a carefully selected medical instruction dataset, resulting in MedChatZH outperforming several Chinese dialogue baselines on a real-world medical dialogue dataset. Our model, code, and dataset are publicly available on GitHub (https://github.com/tyang816/MedChatZH) to encourage further research in traditional Chinese medicine and LLMs.

Assuntos

Educação Médica , Medicina Tradicional Chinesa , Idioma , Encaminhamento e Consulta , Processamento de Linguagem Natural , Inteligência Artificial

ScaffoldGVAE: scaffold generation and hopping of drug molecules via a variational autoencoder based on multi-view graph neural networks.

Hu, Chao; Li, Song; Yang, Chenxing; Chen, Jun; Xiong, Yi; Fan, Guisheng; Liu, Hao; Hong, Liang.

J Cheminform ; 15(1): 91, 2023 Oct 04.

Artigo em Inglês | MEDLINE | ID: mdl-37794460

RESUMO

In recent years, drug design has been revolutionized by the application of deep learning techniques, and molecule generation is a crucial aspect of this transformation. However, most of the current deep learning approaches do not explicitly consider and apply scaffold hopping strategy when performing molecular generation. In this work, we propose ScaffoldGVAE, a variational autoencoder based on multi-view graph neural networks, for scaffold generation and scaffold hopping of drug molecules. The model integrates several important components, such as node-central and edge-central message passing, side-chain embedding, and Gaussian mixture distribution of scaffolds. To assess the efficacy of our model, we conduct a comprehensive evaluation and comparison with baseline models based on seven general generative model evaluation metrics and four scaffold hopping generative model evaluation metrics. The results demonstrate that ScaffoldGVAE can explore the unseen chemical space and generate novel molecules distinct from known compounds. Especially, the scaffold hopped molecules generated by our model are validated by the evaluation of GraphDTA, LeDock, and MM/GBSA. The case study of generating inhibitors of LRRK2 for the treatment of PD further demonstrates the effectiveness of ScaffoldGVAE in generating novel compounds through scaffold hopping. This novel approach can also be applied to other protein targets of various diseases, thereby contributing to the future development of new drugs. Source codes and data are available at https://github.com/ecust-hc/ScaffoldGVAE .

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering.

Li, Mingchen; Kang, Liqi; Xiong, Yi; Wang, Yu Guang; Fan, Guisheng; Tan, Pan; Hong, Liang.

J Cheminform ; 15(1): 12, 2023 Feb 03.

Artigo em Inglês | MEDLINE | ID: mdl-36737798

RESUMO

Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (< 50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA