Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction.

Luo, Zeyu; Wang, Rui; Sun, Yawen; Liu, Junhao; Chen, Zongqing; Zhang, Yu-Juan

Luo, Zeyu; Wang, Rui; Sun, Yawen; Liu, Junhao; Chen, Zongqing; Zhang, Yu-Juan.

Afiliação

Luo Z; Chongqing Key Laboratory of Vector Insects, Chongqing Key Laboratory of Animal Biology, College of Life Science, Chongqing Normal University, Chongqing 401331, China.
Wang R; Chongqing Key Laboratory of Vector Insects, Chongqing Key Laboratory of Animal Biology, College of Life Science, Chongqing Normal University, Chongqing 401331, China.
Sun Y; Chongqing Key Laboratory of Vector Insects, Chongqing Key Laboratory of Animal Biology, College of Life Science, Chongqing Normal University, Chongqing 401331, China.
Liu J; Chongqing Key Laboratory of Vector Insects, Chongqing Key Laboratory of Animal Biology, College of Life Science, Chongqing Normal University, Chongqing 401331, China.
Chen Z; School of Mathematical Sciences, Chongqing Normal University, Chongqing 400047, China.
Zhang YJ; Chongqing Key Laboratory of Vector Insects, Chongqing Key Laboratory of Animal Biology, College of Life Science, Chongqing Normal University, Chongqing 401331, China.

Brief Bioinform ; 25(2)2024 Jan 22.

Article em En | MEDLINE | ID: mdl-38279650

ABSTRACT

ABSTRACT

As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https//github.com/yujuan-zhang/feature-representation-for-LLMs.

Assuntos

Biologia Computacional; Redes Neurais de Computação; Biologia Computacional/métodos; Sequência de Aminoácidos; Transporte Proteico

Palavras-chave

Res-VAE; feature representation; large language models; model interpretation; subcellular localization prediction

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Redes Neurais de Computação / Biologia Computacional Tipo de estudo: Prognostic_studies / Risk_factors_studies Idioma: En Revista: Brief Bioinform Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: China País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google