Pesquisa | Portal Regional da BVS

Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification.

Susanty, Meredita; Mursalim, Muhammad Khaerul Naim; Hertadi, Rukman; Purwarianti, Ayu; LE Rajab, Tati.

Comput Biol Chem ; 112: 108163, 2024 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-39098138

RESUMO

The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.

Assuntos

Proteínas , Proteínas/química , Modelos Logísticos , Aprendizado de Máquina , Simulação por Computador

Classifying alkaliphilic proteins using embeddings from protein language model.

Susanty, Meredita; Naim Mursalim, Muhammad Khaerul; Hertadi, Rukman; Purwarianti, Ayu; Rajab, Tati LE.

Comput Biol Med ; 173: 108385, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38547659

RESUMO

Alkaliphilic proteins have great potential as biocatalysts in biotechnology, especially for enzyme engineering. Extensive research has focused on exploring the enzymatic potential of alkaliphiles and characterizing alkaliphilic proteins. However, the current method employed for identifying these proteins that requires web lab experiment is time-consuming, labor-intensive, and expensive. Therefore, the development of a computational method for alkaliphilic protein identification would be invaluable for protein engineering and design. In this study, we present a novel approach that uses embeddings from a protein language model called ESM-2(3B) in a deep learning framework to classify alkaliphilic and non-alkaliphilic proteins. To our knowledge, this is the first attempt to employ embeddings from a pre-trained protein language model to classify alkaliphilic protein. A reliable dataset comprising 1,002 alkaliphilic and 1,866 non-alkaliphilic proteins was constructed for training and testing the proposed model. The proposed model, dubbed ALPACA, achieves performance scores of 0.88, 0.84, and 0.75 for accuracy, f1-score, and Matthew correlation coefficient respectively on independent dataset. ALPACA is likely to serve as a valuable resource for exploring protein alkalinity and its role in protein design and engineering.

Assuntos

Camelídeos Americanos , Animais , Proteínas , Idioma

BiCaps-DBP: Predicting DNA-binding proteins from protein sequences using Bi-LSTM and a 1D-capsule network.

Mursalim, Muhammad K N; Mengko, Tati L E R; Hertadi, Rukman; Purwarianti, Ayu; Susanty, Meredita.

Comput Biol Med ; 163: 107241, 2023 09.

Artigo em Inglês | MEDLINE | ID: mdl-37437362

RESUMO

Predicting DNA-binding proteins (DBPs) based solely on primary sequences is one of the most challenging problems in genome annotation. DBPs play a crucial role in various biological processes, including DNA replication, transcription, repair, and splicing. Some DBPs are essential in pharmaceutical research on various human cancers and autoimmune diseases. Existing experimental methods for identifying DBPs are time-consuming and costly. Therefore, developing a rapid and accurate computational technique is necessary to address the issue. This study introduces BiCaps-DBP, a deep learning-based method that improves DBP prediction performance by combining bidirectional long short-term memory with a 1D-capsule network. This study uses three training and independent datasets to evaluate the proposed model's generalizability and robustness. Based on three independent datasets, BiCaps-DBP achieved 1.05%, 5.79% and 0.40% higher accuracies than an existing predictor for PDB2272, PDB186 and PDB20000, respectively. These outcomes indicate that the proposed method is a promising DBP predictor.

Assuntos

Proteínas de Ligação a DNA , Genoma , Humanos , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Sequência de Aminoácidos

Temporal convolutional network for a Fast DNA mutation detection in breast cancer data.

Wisesty, Untari Novia; Mengko, Tati Rajab; Purwarianti, Ayu; Pancoro, Adi.

PLoS One ; 18(5): e0285981, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37228159

RESUMO

Early detection of breast cancer can be achieved through mutation detection in DNA sequences, which can be acquired through patient blood samples. Mutation detection can be performed using alignment and machine learning techniques. However, alignment techniques require reference sequences, and machine learning techniques still cannot predict index mutation and require supporting tools. Therefore, in this research, a Temporal Convolutional Network (TCN) model was proposed to detect the type and index mutation faster and without reference sequences and supporting tools. The architecture of the proposed TCN model is specifically designed for sequential labeling tasks on DNA sequence data. This allows for the detection of the mutation type of each nucleotide in the sequence, and if the nucleotide has a mutation, the index mutation can be obtained. The proposed model also uses 2-mers and 3-mers mapping techniques to improve detection performance. Based on the tests that have been carried out, the proposed TCN model can achieve the highest F1-score of 0.9443 for COSMIC dataset and 0.9629 for RSCM dataset, Additionally, the proposed TCN model can detect index mutation six times faster than BiLSTM model. Furthermore, the proposed model can detect type and index mutations based on the patient's DNA sequence, without the need for reference sequences or other additional tools.

Assuntos

Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , DNA , Aprendizado de Máquina , Mutação , Nucleotídeos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA