Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

MCNN_MC: Computational Prediction of Mitochondrial Carriers and Investigation of Bongkrekic Acid Toxicity Using Protein Language Models and Convolutional Neural Networks.

Malik, Muhammad Shahid; Chang, Yan-Yun; Liu, Yu-Chen; Le, Van The; Ou, Yu-Yen.

J Chem Inf Model ; 2024 Aug 12.

Artigo em Inglês | MEDLINE | ID: mdl-39133248

RESUMO

Mitochondrial carriers (MCs) are essential proteins that transport metabolites across mitochondrial membranes and play a critical role in cellular metabolism. ADP/ATP (adenosine diphosphate/adenosine triphosphate) is one of the most important carriers as it contributes to cellular energy production and is susceptible to the powerful toxin bongkrekic acid. This toxin has claimed several lives; for example, a recent foodborne outbreak in Taipei, Taiwan, has caused four deaths and sickened 30 people. The issue of bongkrekic acid poisoning has been a long-standing problem in Indonesia, with reports as early as 1895 detailing numerous deaths from contaminated coconut fermented cakes. In bioinformatics, significant advances have been made in understanding biological processes through computational methods; however, no established computational method has been developed for identifying mitochondrial carriers. We propose a computational bioinformatics approach for predicting MCs from a broader class of secondary active transporters with a focus on the ADP/ATP carrier and its interaction with bongkrekic acid. The proposed model combines protein language models (PLMs) with multiwindow scanning convolutional neural networks (mCNNs). While PLM embeddings capture contextual information within proteins, mCNN scans multiple windows to identify potential binding sites and extract local features. Our results show 96.66% sensitivity, 95.76% specificity, 96.12% accuracy, 91.83% Matthews correlation coefficient (MCC), 94.63% F1-Score, and 98.55% area under the curve (AUC). The results demonstrate the effectiveness of the proposed approach in predicting MCs and elucidating their functions, particularly in the context of bongkrekic acid toxicity. This study presents a valuable approach for identifying novel mitochondrial complexes, characterizing their functional roles, and understanding mitochondrial toxicology mechanisms. Our findings, that utilize computational methods to improve our understanding of cellular processes and drug-target interactions, contribute to the development of therapeutic strategies for mitochondrial disorders, reducing the devastating effects of bongkrekic acid poisoning.

2.

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling.

Liu, Yu-Chen; Lin, Yi-Jing; Chang, Yan-Yun; Chuang, Cheng-Che; Ou, Yu-Yen.

J Mol Biol ; 436(22): 168769, 2024 Aug 29.

Artigo em Inglês | MEDLINE | ID: mdl-39214282

RESUMO

Deciphering the mechanisms governing protein-DNA interactions is crucial for understanding key cellular processes and disease pathways. In this work, we present a powerful deep learning approach that significantly advances the computational prediction of DNA-interacting residues from protein sequences. Our method leverages the rich contextual representations learned by pre-trained protein language models, such as ProtTrans, to capture intrinsic biochemical properties and sequence motifs indicative of DNA binding sites. We then integrate these contextual embeddings with a multi-window convolutional neural network architecture, which scans across the sequence at varying window sizes to effectively identify both local and global binding patterns. Comprehensive evaluation on curated benchmark datasets demonstrates the remarkable performance of our approach, achieving an area under the ROC curve (AUC) of 0.89 - a substantial improvement over previous state-of-the-art sequence-based predictors. This showcases the immense potential of pairing advanced representation learning and deep neural network designs for uncovering the complex syntax governing protein-DNA interactions directly from primary sequences. Our work not only provides a robust computational tool for characterizing DNA-binding mechanisms, but also highlights the transformative opportunities at the intersection of language modeling, deep learning, and protein sequence analysis. The publicly available code and data further facilitate broader adoption and continued development of these techniques for accelerating mechanistic insights into vital biological processes and disease pathways. In addition, the code and data for this work are available at https://github.com/B1607/DIRP.

3.

ProtTrans and multi-window scanning convolutional neural networks for the prediction of protein-peptide interaction sites.

Le, Van-The; Zhan, Zi-Jun; Vu, Thi-Thu-Phuong; Malik, Muhammad-Shahid; Ou, Yu-Yen.

J Mol Graph Model ; 130: 108777, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-38642500

RESUMO

This study delves into the prediction of protein-peptide interactions using advanced machine learning techniques, comparing models such as sequence-based, standard CNNs, and traditional classifiers. Leveraging pre-trained language models and multi-view window scanning CNNs, our approach yields significant improvements, with ProtTrans standing out based on 2.1 billion protein sequences and 393 billion amino acids. The integrated model demonstrates remarkable performance, achieving an AUC of 0.856 and 0.823 on the PepBCL Set_1 and Set_2 datasets, respectively. Additionally, it attains a Precision of 0.564 in PepBCL Set 1 and 0.527 in PepBCL Set 2, surpassing the performance of previous methods. Beyond this, we explore the application of this model in cancer therapy, particularly in identifying peptide interactions for selective targeting of cancer cells, and other fields. The findings of this study contribute to bioinformatics, providing valuable insights for drug discovery and therapeutic development.

Assuntos

Biologia Computacional , Redes Neurais de Computação , Peptídeos , Proteínas , Peptídeos/química , Proteínas/química , Biologia Computacional/métodos , Humanos , Aprendizado de Máquina , Ligação Proteica , Sítios de Ligação , Algoritmos , Bases de Dados de Proteínas

4.

DeepPLM_mCNN: An approach for enhancing ion channel and ion transporter recognition by multi-window CNN based on features from pre-trained language models.

Le, Van-The; Malik, Muhammad-Shahid; Tseng, Yi-Hsuan; Lee, Yu-Cheng; Huang, Cheng-I; Ou, Yu-Yen.

Comput Biol Chem ; 110: 108055, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38555810

RESUMO

Accurate classification of membrane proteins like ion channels and transporters is critical for elucidating cellular processes and drug development. We present DeepPLM_mCNN, a novel framework combining Pretrained Language Models (PLMs) and multi-window convolutional neural networks (mCNNs) for effective classification of membrane proteins into ion channels and ion transporters. Our approach extracts informative features from protein sequences by utilizing various PLMs, including TAPE, ProtT5_XL_U50, ESM-1b, ESM-2_480, and ESM-2_1280. These PLM-derived features are then input into a mCNN architecture to learn conserved motifs important for classification. When evaluated on ion transporters, our best performing model utilizing ProtT5 achieved 90% sensitivity, 95.8% specificity, and 95.4% overall accuracy. For ion channels, we obtained 88.3% sensitivity, 95.7% specificity, and 95.2% overall accuracy using ESM-1b features. Our proposed DeepPLM_mCNN framework demonstrates significant improvements over previous methods on unseen test data. This study illustrates the potential of combining PLMs and deep learning for accurate computational identification of membrane proteins from sequence data alone. Our findings have important implications for membrane protein research and drug development targeting ion channels and transporters. The data and source codes in this study are publicly available at the following link: https://github.com/s1129108/DeepPLM_mCNN.

Assuntos

Canais Iônicos , Redes Neurais de Computação , Canais Iônicos/metabolismo , Canais Iônicos/química , Aprendizado Profundo , Transporte de Íons

5.

Integrating Pre-Trained protein language model and multiple window scanning deep learning networks for accurate identification of secondary active transporters in membrane proteins.

Shahid Malik, Muhammad; Ou, Yu-Yen.

Methods ; 220: 11-20, 2023 12.

Artigo em Inglês | MEDLINE | ID: mdl-37871661

RESUMO

Secondary active transporters play pivotal roles in regulating ion and molecule transport across cell membranes, with implications in diseases like cancer. However, studying transporters via biochemical experiments poses challenges. We propose an effective computational approach to identify secondary active transporters from membrane protein sequences using pre-trained language models and deep learning neural networks. Our dataset comprised 290 secondary active transporters and 5,420 other membrane proteins from UniProt. Three types of features were extracted - one-hot encodings, position-specific scoring matrix profiles, and contextual embeddings from the ProtTrans language model. A multi-window convolutional neural network architecture scanned the ProtTrans embeddings using varying window sizes to capture multi-scale sequence patterns. The proposed model combining ProtTrans embeddings and multi-window convolutional neural networks achieved 86% sensitivity, 99% specificity and 98% overall accuracy in identifying secondary active transporters, outperforming conventional machine learning approaches. This work demonstrates the promise of integrating pre-trained language models like ProtTrans with multi-scale deep neural networks to effectively interpret transporter sequences for functional analysis. Our approach enables more accurate computational identification of secondary active transporters, advancing membrane protein research.

Assuntos

Aprendizado Profundo , Proteínas de Membrana , Redes Neurais de Computação , Aprendizado de Máquina , Sequência de Aminoácidos

6.

Recent advances in features generation for membrane protein sequences: From multiple sequence alignment to pre-trained language models.

Ou, Yu-Yen; Ho, Quang-Thai; Chang, Heng-Ta.

Proteomics ; 23(23-24): e2200494, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-37863817

RESUMO

Membrane proteins play a crucial role in various cellular processes and are essential components of cell membranes. Computational methods have emerged as a powerful tool for studying membrane proteins due to their complex structures and properties that make them difficult to analyze experimentally. Traditional features for protein sequence analysis based on amino acid types, composition, and pair composition have limitations in capturing higher-order sequence patterns. Recently, multiple sequence alignment (MSA) and pre-trained language models (PLMs) have been used to generate features from protein sequences. However, the significant computational resources required for MSA-based features generation can be a major bottleneck for many applications. Several methods and tools have been developed to accelerate the generation of MSAs and reduce their computational cost, including heuristics and approximate algorithms. Additionally, the use of PLMs such as BERT has shown great potential in generating informative embeddings for protein sequence analysis. In this review, we provide an overview of traditional and more recent methods for generating features from protein sequences, with a particular focus on MSAs and PLMs. We highlight the advantages and limitations of these approaches and discuss the methods and tools developed to address the computational challenges associated with features generation. Overall, the advancements in computational methods and tools provide a promising avenue for gaining deeper insights into the function and properties of membrane proteins, which can have significant implications in drug discovery and personalized medicine.

Assuntos

Algoritmos , Proteínas de Membrana , Animais , Cavalos , Alinhamento de Sequência , Sequência de Aminoácidos , Análise de Sequência de Proteína , Biologia Computacional/métodos

7.

Disto-TRP: An approach for identifying transient receptor potential (TRP) channels using structural information generated by AlphaFold.

Muazzam Ali Shah, Syed; Ou, Yu-Yen.

Gene ; 871: 147435, 2023 Jun 30.

Artigo em Inglês | MEDLINE | ID: mdl-37075925

RESUMO

The ability to predict 3D protein structures computationally has significantly advanced biological research. The AlphaFold protein structure database, developed by DeepMind, has provided a wealth of predicted protein structures and has the potential to bring about revolutionary changes in the field of life sciences. However, directly determining the function of proteins from their structures remains a challenging task. The Distogram from AlphaFold is used in this study as a novel feature set to identify transient receptor potential (TRP) channels. Distograms feature vectors and pre-trained language model (BERT) features were combined to improve prediction performance for transient receptor potential (TRP) channels. The method proposed in this study demonstrated promising performance on many evaluation metrics. For five-fold cross-validation, the method achieved a Sensitivity (SN) of 87.00%, Specificity (SP) of 93.61%, Accuracy (ACC) of 93.39%, and a Matthews correlation coefficient (MCC) of 0.52. Additionally, on an independent dataset, the method obtained 100.00% SN, 95.54% SP, 95.73% ACC, and an MCC of 0.69. The results demonstrate the potential for using structural information to predict protein function. In the future, it is hoped that such structural information will be incorporated into artificial intelligence networks to explore more useful and valuable functional information in the biological field.

Assuntos

Canais de Potencial de Receptor Transitório , Canais de Potencial de Receptor Transitório/química , Canais de Potencial de Receptor Transitório/metabolismo , Inteligência Artificial , Bases de Dados de Proteínas

8.

MFPS_CNN: Multi-filter Pattern Scanning from Position-specific Scoring Matrix with Convolutional Neural Network for Efficient Prediction of Ion Transporters.

Nguyen, Trinh-Trung-Duong; Ho, Quang-Thai; Tarn, Yu-Chun; Ou, Yu-Yen.

Mol Inform ; 41(9): e2100271, 2022 09.

Artigo em Inglês | MEDLINE | ID: mdl-35322557

RESUMO

In cellular transportation mechanisms, the movement of ions across the cell membrane and its proper control are important for cells, especially for life processes. Ion transporters/pumps and ion channel proteins work as border guards controlling the incessant traffic of ions across cell membranes. We revisited the study of classification of transporters and ion channels from membrane proteins with a more efficient deep learning approach. Specifically, we applied multi-window scanning filters of convolutional neural networks on almost full-length position-specific scoring matrices for extracting useful information. In this way, we were able to retain important evolutionary information of the proteins. Our experiment results show that a convolutional neural network with a minimum number of convolutional layers can be enough to extract the conserved information of proteins which leads to higher performance. Our best prediction models were obtained after examining different data imbalanced handling techniques, and different protein encoding methods. We also showed that our models were superior to traditional deep learning approaches on the same datasets as well as other machine learning classification algorithms.

Assuntos

Algoritmos , Redes Neurais de Computação , Íons , Proteínas de Membrana , Matrizes de Pontuação de Posição Específica

9.

Using multiple convolutional window scanning of convolutional neural network for an efficient prediction of ATP-binding sites in transport proteins.

Nguyen, Trinh-Trung-Duong; Chen, Syun; Ho, Quang-Thai; Ou, Yu-Yen.

Proteins ; 90(7): 1486-1492, 2022 07.

Artigo em Inglês | MEDLINE | ID: mdl-35246878

RESUMO

Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three-dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position-specific scoring matrices as input. We also refined the use of a convolutional neural network, a well-known deep-learning architecture with impressive achievement on image and image-like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)-binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position-specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one-max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper-parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine-learning classification algorithms.

Assuntos

Trifosfato de Adenosina , Proteínas de Transporte , Algoritmos , Sítios de Ligação , Aprendizado de Máquina , Redes Neurais de Computação , Proteínas/química

10.

An Extensive Examination of Discovering 5-Methylcytosine Sites in Genome-Wide DNA Promoters Using Machine Learning Based Approaches.

Nguyen, Trinh-Trung-Duong; Tran, The-Anh; Le, Nguyen-Quoc-Khanh; Pham, Dinh-Minh; Ou, Yu-Yen.

IEEE/ACM Trans Comput Biol Bioinform ; 19(1): 87-94, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34014828

RESUMO

It is well-known that the major reason for the rapid proliferation of cancer cells are the hypomethylation of the whole cancer genome and the hypermethylation of the promoter of particular tumor suppressor genes. Locating 5-methylcytosine (5mC) sites in promoters is therefore a crucial step in further understanding of the relationship between promoter methylation and the regulation of mRNA gene expression. High throughput identification of DNA 5mC in wet lab is still time-consuming and labor-extensive. Thus, finding the 5mC site of genome-wide DNA promoters is still an important task. We compared the effectiveness of the most popular and strong machine learning techniques namely XGBoost, Random Forest, Deep Forest, and Deep Feedforward Neural Network in predicting the 5mC sites of genome-wide DNA promoters. A feature extraction method based on k-mers embeddings learned from a language model were also applied. Overall, the performance of all the surveyed models surpassed deep learning models of the latest studies on the same dataset employing other encoding scheme. Furthermore, the best model achieved AUC scores of 0.962 on both cross-validation and independent test data. We concluded that our approach was efficient for identifying 5mC sites of promoters with high performance.

Assuntos

5-Metilcitosina , Aprendizado de Máquina , DNA , Metilação de DNA/genética , Regiões Promotoras Genéticas/genética

11.

mCNN-ETC: identifying electron transporters and their functional families by using multiple windows scanning techniques in convolutional neural networks with evolutionary information of protein sequences.

Ho, Quang-Thai; Le, Nguyen Quoc Khanh; Ou, Yu-Yen.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34472594

RESUMO

In the past decade, convolutional neural networks (CNNs) have been used as powerful tools by scientists to solve visual data tasks. However, many efforts of convolutional neural networks in solving protein function prediction and extracting useful information from protein sequences have certain limitations. In this research, we propose a new method to improve the weaknesses of the previous method. mCNN-ETC is a deep learning model which can transform the protein evolutionary information into image-like data composed of 20 channels, which correspond to the 20 amino acids in the protein sequence. We constructed CNN layers with different scanning windows in parallel to enhance the useful pattern detection ability of the proposed model. Then we filtered specific patterns through the 1-max pooling layer before inputting them into the prediction layer. This research attempts to solve a basic problem in biology in terms of application: predicting electron transporters and classifying their corresponding complexes. The performance result reached an accuracy of 97.41%, which was nearly 6% higher than its predecessor. We have also published a web server on http://bio219.bioinfo.yzu.edu.tw, which can be used for research purposes free of charge.

Assuntos

Elétrons , Redes Neurais de Computação , Sequência de Aminoácidos , Evolução Biológica , Humanos , Proteínas/química

12.

Use Chou's 5-Steps Rule With Different Word Embedding Types to Boost Performance of Electron Transport Protein Prediction Model.

Nguyen, Trinh-Trung-Duong; Ho, Quang-Thai; Le, Nguyen-Quoc-Khanh; Phan, Van-Dinh; Ou, Yu-Yen.

IEEE/ACM Trans Comput Biol Bioinform ; 19(2): 1235-1244, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-32750894

RESUMO

Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.

Assuntos

Proteínas de Transporte , Biologia Computacional , Biologia Computacional/métodos , Transporte de Elétrons , Elétrons , Máquina de Vetores de Suporte

13.

Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model.

Nguyen, Trinh Trung Duong; Trinh, Van Ngu; Le, Nguyen Quoc Khanh; Ou, Yu-Yen.

Plant Mol Biol ; 107(6): 533-542, 2021 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-34843033

RESUMO

KEY MESSAGE: This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.

Assuntos

Adenina/análogos & derivados , Algoritmos , Modelos Biológicos , Redes Neurais de Computação , Adenina/metabolismo , Sequência de Bases , DNA de Plantas/genética , Bases de Dados Genéticas , Nucleotídeos/genética , Plantas/genética , Curva ROC , Inquéritos e Questionários

14.

Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders.

Taju, Semmy Wellem; Shah, Syed Muazzam Ali; Ou, Yu-Yen.

Anal Biochem ; 633: 114416, 2021 11 15.

Artigo em Inglês | MEDLINE | ID: mdl-34656612

RESUMO

Efflux proteins are the transport proteins expressed in the plasma membrane, which are involved in the movement of unwanted toxic substances through specific efflux pumps. Several studies based on computational approaches have been proposed to predict transport proteins and thereby to understand the mechanism of the movement of ions across cell membranes. However, few methods were developed to identify efflux proteins. This paper presents an approach based on the contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) with the Support Vector Machine (SVM) classifier. BERT is the most effective pre-trained language model that performs exceptionally well on several Natural Language Processing (NLP) tasks. Therefore, the contextualized representations from BERT were implemented to incorporate multiple interpretations of identical amino acids in the sequence. A dataset of efflux proteins with annotations was first established. The feature vectors were extracted by transferring protein data through the hidden layers of the pre-trained model. Our proposed method was trained on complete training datasets to identify efflux proteins and achieved the accuracies of 94.15% and 87.13% in the independent tests on membrane and transport datasets, respectively. This study opens a research avenue for the implementation of contextualized word embeddings in Bioinformatics and Computational Biology.

Assuntos

Proteínas de Transporte/análise , Biologia Computacional , Processamento de Linguagem Natural , Máquina de Vetores de Suporte

15.

TRP-BERT: Discrimination of transient receptor potential (TRP) channels using contextual representations from deep bidirectional transformer based on BERT.

Ali Shah, Syed Muazzam; Ou, Yu-Yen.

Comput Biol Med ; 137: 104821, 2021 10.

Artigo em Inglês | MEDLINE | ID: mdl-34508974

RESUMO

Transient receptor potential (TRP) channels are non-selective cation channels that act as ion channels and are primarily found on the plasma membrane of numerous animal cells. These channels are involved in the physiology and pathophysiology of a wide variety of biological processes, including inhibition and progression of cancer, pain initiation, inflammation, regulation of pressure, thermoregulation, secretion of salivary fluid, and homeostasis of Ca2+ and Mg2+. Increasing evidences indicate that mutations in the gene encoding TRP channels play an essential role in a broad array of diseases. Therefore, these channels are becoming popular as potential drug targets for several diseases. The diversified role of these channels demands a prediction model to classify TRP channels from other channel proteins (non-TRP channels). Therefore, we presented an approach based on the Support Vector Machine (SVM) classifier and contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a deeply bidirectional language model and a neural network approach to Natural Language Processing (NLP) that achieves outstanding performance on various NLP tasks. We apply BERT to generate contextualized representations for every single amino acid in a protein sequence. Interestingly, these representations are context-sensitive and vary for the same amino acid appearing in different positions in the sequence. Our proposed method showed 80.00% sensitivity, 96.03% specificity, 95.47% accuracy, and a 0.56 Matthews correlation coefficient (MCC) for an independent test set. We suggest that our proposed method could effectively classify TRP channels from non-TRP channels and assist biologists in identifying new potential TRP channels.

Assuntos

Canais de Potencial de Receptor Transitório , Sequência de Aminoácidos , Animais , Biologia Computacional , Processamento de Linguagem Natural , Redes Neurais de Computação , Máquina de Vetores de Suporte , Canais de Potencial de Receptor Transitório/genética

16.

Addressing data imbalance problems in ligand-binding site prediction using a variational autoencoder and a convolutional neural network.

Nguyen, Trinh-Trung-Duong; Nguyen, Duc-Khanh; Ou, Yu-Yen.

Brief Bioinform ; 22(6)2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-34322702

RESUMO

Since 2015, a fast growing number of deep learning-based methods have been proposed for protein-ligand binding site prediction and many have achieved promising performance. These methods, however, neglect the imbalanced nature of binding site prediction problems. Traditional data-based approaches for handling data imbalance employ linear interpolation of minority class samples. Such approaches may not be fully exploited by deep neural networks on downstream tasks. We present a novel technique for balancing input classes by developing a deep neural network-based variational autoencoder (VAE) that aims to learn important attributes of the minority classes concerning nonlinear combinations. After learning, the trained VAE was used to generate new minority class samples that were later added to the original data to create a balanced dataset. Finally, a convolutional neural network was used for classification, for which we assumed that the nonlinearity could be fully integrated. As a case study, we applied our method to the identification of FAD- and FMN-binding sites of electron transport proteins. Compared with the best classifiers that use traditional machine learning algorithms, our models obtained a great improvement on sensitivity while maintaining similar or higher levels of accuracy and specificity. We also demonstrate that our method is better than other data imbalance handling techniques, such as SMOTE, ADASYN, and class weight adjustment. Additionally, our models also outperform existing predictors in predicting the same binding types. Our method is general and can be applied to other data types for prediction problems with moderate-to-heavy data imbalances.

Assuntos

Redes Neurais de Computação , Algoritmos , Aprendizado Profundo , Ligantes

17.

ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations.

Taju, Semmy Wellem; Shah, Syed Muazzam Ali; Ou, Yu-Yen.

Comput Biol Chem ; 93: 107537, 2021 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-34217007

RESUMO

MOTIVATION: Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence. RESULTS: The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.

Assuntos

Proteínas de Transporte/metabolismo , Processamento de Linguagem Natural , Máquina de Vetores de Suporte , Sequência de Aminoácidos , Transporte Biológico Ativo , Proteínas de Transporte/química

18.

DeepSIRT: A deep neural network for identification of sirtuin targets and their subcellular localizations.

Shah, Syed Muazzam Ali; Taju, Semmy Wellem; Dlamini, Bongani Brian; Ou, Yu-Yen.

Comput Biol Chem ; 93: 107514, 2021 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-34058657

RESUMO

Sirtuins are a family of proteins that play a key role in regulating a wide range of cellular processes including DNA regulation, metabolism, aging/longevity, cell survival, apoptosis, and stress resistance. Sirtuins are protein deacetylases and include in the class III family of histone deacetylase enzymes (HDACs). The class III HDACs contains seven members of the sirtuin family from SIRT1 to SIRT7. The seven members of the sirtuin family have various substrates and are present in nearly all subcellular localizations including the nucleus, cytoplasm, and mitochondria. In this study, a deep neural network approach using one-dimensional Convolutional Neural Networks (CNN) was proposed to build a prediction model that can accurately identify the outcome of the sirtuin protein by targeting their subcellular localizations. Therefore, the function and localization of sirtuin targets were analyzed and annotated to compartmentalize into distinct subcellular localizations. We further reduced the sequence similarity between protein sequences and three feature extraction methods were applied in datasets. Finally, the proposed method has been tested and compared with various machine-learning algorithms. The proposed method is validated on two independent datasets and showed an average of up to 85.77 % sensitivity, 97.32 % specificity, and 0.82 MCC for seven members of the sirtuin family of proteins.

Assuntos

Aprendizado Profundo , Redes Neurais de Computação , Sirtuínas/análise , Humanos

19.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.

Le, Nguyen Quoc Khanh; Ho, Quang-Thai; Nguyen, Trinh-Trung-Duong; Ou, Yu-Yen.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33539511

RESUMO

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

Assuntos

Biologia Computacional/métodos , DNA/genética , Aprendizado Profundo , Elementos Facilitadores Genéticos , Modelos Biológicos , Processamento de Linguagem Natural , Simulação por Computador , Confiabilidade dos Dados , Humanos , Multilinguismo , Semântica , Sensibilidade e Especificidade , Transcrição Gênica

20.

GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models.

Ali Shah, Syed Muazzam; Taju, Semmy Wellem; Ho, Quang-Thai; Nguyen, Trinh-Trung-Duong; Ou, Yu-Yen.

Comput Biol Med ; 131: 104259, 2021 04.

Artigo em Inglês | MEDLINE | ID: mdl-33581474

RESUMO

Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.

Assuntos

Proteínas Facilitadoras de Transporte de Glucose , Processamento de Linguagem Natural , Glucose , Idioma , Semântica

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA