Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34322702

RESUMO

Since 2015, a fast growing number of deep learning-based methods have been proposed for protein-ligand binding site prediction and many have achieved promising performance. These methods, however, neglect the imbalanced nature of binding site prediction problems. Traditional data-based approaches for handling data imbalance employ linear interpolation of minority class samples. Such approaches may not be fully exploited by deep neural networks on downstream tasks. We present a novel technique for balancing input classes by developing a deep neural network-based variational autoencoder (VAE) that aims to learn important attributes of the minority classes concerning nonlinear combinations. After learning, the trained VAE was used to generate new minority class samples that were later added to the original data to create a balanced dataset. Finally, a convolutional neural network was used for classification, for which we assumed that the nonlinearity could be fully integrated. As a case study, we applied our method to the identification of FAD- and FMN-binding sites of electron transport proteins. Compared with the best classifiers that use traditional machine learning algorithms, our models obtained a great improvement on sensitivity while maintaining similar or higher levels of accuracy and specificity. We also demonstrate that our method is better than other data imbalance handling techniques, such as SMOTE, ADASYN, and class weight adjustment. Additionally, our models also outperform existing predictors in predicting the same binding types. Our method is general and can be applied to other data types for prediction problems with moderate-to-heavy data imbalances.


Assuntos
Redes Neurais de Computação , Algoritmos , Aprendizado Profundo , Ligantes
2.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33539511

RESUMO

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.


Assuntos
Biologia Computacional/métodos , DNA/genética , Aprendizado Profundo , Elementos Facilitadores Genéticos , Modelos Biológicos , Processamento de Linguagem Natural , Simulação por Computador , Confiabilidade dos Dados , Humanos , Multilinguismo , Semântica , Sensibilidade e Especificidade , Transcrição Gênica
3.
Methods ; 207: 90-96, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-36174933

RESUMO

Adaptor proteins (APs) are a family of proteins that aids in intracellular membrane trafficking, and their impairments or defects are closely related to various disorders. Traditional methods to identify and classify APs require time and complex techniques, which were then advanced by machine learning and computational approaches to facilitate the APs recognition task. However, most studies focused on recognizing separate ones in the APs family or the APs in general with non-APs, lacking one comprehensive strategy to distinguish the complexes of AP subtypes. Herein, we proposed a novel method to implement one novel task as discriminating the AP complexes in the APs family, utilizing an interpretable deep neural network architecture on sequence-based encoding features. This work also introduced a benchmark data set of AP complexes originating from the UniProt and GeneOntology databases. To assess the robustness of our proposed method, we compared our performance to various machine learning algorithms and feature extraction strategies. Furthermore, the interpretation of the model's prediction performance was implemented using t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approximation and projection (UMAP), and SHapley Additive exPlanations (SHAP) analysis to show the distribution of AP complexes on optimal features. The promising performance of our architecture can assist scientists not only in AP complexes distinction but also in general protein sequences. Moreover, we have also made our work publicly on GitHub https://github.com/khanhlee/adaptor-dnn.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Aprendizado de Máquina , Algoritmos , Sequência de Aminoácidos , Proteínas
4.
Proteins ; 90(7): 1486-1492, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35246878

RESUMO

Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three-dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position-specific scoring matrices as input. We also refined the use of a convolutional neural network, a well-known deep-learning architecture with impressive achievement on image and image-like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)-binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position-specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one-max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper-parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine-learning classification algorithms.


Assuntos
Trifosfato de Adenosina , Proteínas de Transporte , Algoritmos , Sítios de Ligação , Aprendizado de Máquina , Redes Neurais de Computação , Proteínas/química
5.
Plant Mol Biol ; 107(6): 533-542, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34843033

RESUMO

KEY MESSAGE: This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.


Assuntos
Adenina/análogos & derivados , Algoritmos , Modelos Biológicos , Redes Neurais de Computação , Adenina/metabolismo , Sequência de Bases , DNA de Plantas/genética , Bases de Dados Genéticas , Nucleotídeos/genética , Plantas/genética , Curva ROC , Inquéritos e Questionários
6.
Bioinformatics ; 34(18): 3111-3117, 2018 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-29668844

RESUMO

Motivation: Efflux protein plays a key role in pumping xenobiotics out of the cells. The prediction of efflux family proteins involved in transport process of compounds is crucial for understanding family structures, functions and energy dependencies. Many methods have been proposed to classify efflux pump transporters without considerations of any pump specific of efflux protein families. In other words, efflux proteins protect cells from extrusion of foreign chemicals. Moreover, almost all efflux protein families have the same structure based on the analysis of significant motifs. The motif sequences consisting of the same amount of residues will have high degrees of residue similarity and thus will affect the classification process. Consequently, it is challenging but vital to recognize the structures and determine energy dependencies of efflux protein families. In order to efficiently identify efflux protein families with considering about pump specific, we developed a 2 D convolutional neural network (2 D CNN) model called DeepEfflux. DeepEfflux tried to capture the motifs of sequences around hidden target residues to use as hidden features of families. In addition, the 2 D CNN model uses a position-specific scoring matrix (PSSM) as an input. Three different datasets, each for one family of efflux protein, was fed into DeepEfflux, and then a 5-fold cross validation approach was used to evaluate the training performance. Results: The model evaluation results show that DeepEfflux outperforms traditional machine learning algorithms. Furthermore, the accuracy of 96.02%, 94.89% and 90.34% for classes A, B and C, respectively, in the independent test results show that our model can perform well and can be used as a reliable tool for identifying families of efflux proteins in transporters. Availability and implementation: The online version of deepefflux is available at http://deepefflux.irit.fr. The source code of deepefflux is available both on the deepefflux website and at http://140.138.155.216/deepefflux/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas/química , Algoritmos , Aprendizado de Máquina , Redes Neurais de Computação , Matrizes de Pontuação de Posição Específica , Transporte Proteico , Proteínas/metabolismo , Software
7.
Anal Biochem ; 577: 73-81, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31022378

RESUMO

Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.


Assuntos
Biologia Computacional/métodos , Proteínas de Membrana Transportadoras/química , Sequência de Aminoácidos , Humanos , Processamento de Linguagem Natural , Especificidade por Substrato , Máquina de Vetores de Suporte
8.
Mol Inform ; 41(9): e2100271, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35322557

RESUMO

In cellular transportation mechanisms, the movement of ions across the cell membrane and its proper control are important for cells, especially for life processes. Ion transporters/pumps and ion channel proteins work as border guards controlling the incessant traffic of ions across cell membranes. We revisited the study of classification of transporters and ion channels from membrane proteins with a more efficient deep learning approach. Specifically, we applied multi-window scanning filters of convolutional neural networks on almost full-length position-specific scoring matrices for extracting useful information. In this way, we were able to retain important evolutionary information of the proteins. Our experiment results show that a convolutional neural network with a minimum number of convolutional layers can be enough to extract the conserved information of proteins which leads to higher performance. Our best prediction models were obtained after examining different data imbalanced handling techniques, and different protein encoding methods. We also showed that our models were superior to traditional deep learning approaches on the same datasets as well as other machine learning classification algorithms.


Assuntos
Algoritmos , Redes Neurais de Computação , Íons , Proteínas de Membrana , Matrizes de Pontuação de Posição Específica
9.
IEEE/ACM Trans Comput Biol Bioinform ; 19(2): 1235-1244, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-32750894

RESUMO

Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.


Assuntos
Proteínas de Transporte , Biologia Computacional , Biologia Computacional/métodos , Transporte de Elétrons , Elétrons , Máquina de Vetores de Suporte
10.
Artigo em Inglês | MEDLINE | ID: mdl-34014828

RESUMO

It is well-known that the major reason for the rapid proliferation of cancer cells are the hypomethylation of the whole cancer genome and the hypermethylation of the promoter of particular tumor suppressor genes. Locating 5-methylcytosine (5mC) sites in promoters is therefore a crucial step in further understanding of the relationship between promoter methylation and the regulation of mRNA gene expression. High throughput identification of DNA 5mC in wet lab is still time-consuming and labor-extensive. Thus, finding the 5mC site of genome-wide DNA promoters is still an important task. We compared the effectiveness of the most popular and strong machine learning techniques namely XGBoost, Random Forest, Deep Forest, and Deep Feedforward Neural Network in predicting the 5mC sites of genome-wide DNA promoters. A feature extraction method based on k-mers embeddings learned from a language model were also applied. Overall, the performance of all the surveyed models surpassed deep learning models of the latest studies on the same dataset employing other encoding scheme. Furthermore, the best model achieved AUC scores of 0.962 on both cross-validation and independent test data. We concluded that our approach was efficient for identifying 5mC sites of promoters with high performance.


Assuntos
5-Metilcitosina , Aprendizado de Máquina , DNA , Metilação de DNA/genética , Regiões Promotoras Genéticas/genética
11.
Gene ; 787: 145643, 2021 Jun 30.
Artigo em Inglês | MEDLINE | ID: mdl-33848577

RESUMO

Krüppel-like factors (KLF) refer to a group of conserved zinc finger-containing transcription factors that are involved in various physiological and biological processes, including cell proliferation, differentiation, development, and apoptosis. Some bioinformatics methods such as sequence similarity searches, multiple sequence alignment, phylogenetic reconstruction, and gene synteny analysis have also been proposed to broaden our knowledge of KLF proteins. In this study, we proposed a novel computational approach by using machine learning on features calculated from primary sequences. To detail, our XGBoost-based model is efficient in identifying KLF proteins, with accuracy of 96.4% and MCC of 0.704. It also holds a promising performance when testing our model on an independent dataset. Therefore, our model could serve as an useful tool to identify new KLF proteins and provide necessary information for biologists and researchers in KLF proteins. Our machine learning source codes as well as datasets are freely available at https://github.com/khanhlee/KLF-XGB.


Assuntos
Biologia Computacional , Fatores de Transcrição Kruppel-Like/química , Algoritmos , Sequência de Aminoácidos , Animais , Biologia Computacional/métodos , Bases de Dados de Proteínas , Humanos , Fatores de Transcrição Kruppel-Like/análise , Fatores de Transcrição Kruppel-Like/genética , Aprendizado de Máquina , Modelos Biológicos
12.
Comput Biol Med ; 131: 104258, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33601085

RESUMO

The electron transport chain is a series of protein complexes embedded in the process of cellular respiration, which is an important process to transfer electrons and other macromolecules throughout the cell. Identifying Flavin Adenine Dinucleotide (FAD) binding sites in the electron transport chain is vital since it helps biological researchers precisely understand how electrons are produced and are transported in cells. This study distills and analyzes the contextualized word embedding from pre-trained BERT models to explore similarities in natural language and protein sequences. Thereby, we propose a new approach based on Pre-training of Bidirectional Encoder Representations from Transformers (BERT), Position-specific Scoring Matrix profiles (PSSM), Amino Acid Index database (AAIndex) to predict FAD-binding sites from the transport proteins which are found in nature recently. Our proposed approach archives 85.14% accuracy and improves accuracy by 11%, with Matthew's correlation coefficient of 0.39 compared to the previous method on the same independent set. We also deploy a web server that identifies FAD-binding sites in electron transporters available for academics at http://140.138.155.216/fadbert/.


Assuntos
Aminoácidos , Flavina-Adenina Dinucleotídeo , Sequência de Aminoácidos , Sítios de Ligação , Fontes de Energia Elétrica , Flavina-Adenina Dinucleotídeo/metabolismo
13.
Comput Biol Med ; 130: 104212, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33454535

RESUMO

Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.


Assuntos
Aminoácidos , Aprendizado de Máquina , Sequência de Aminoácidos , Glicosilação , Canais Iônicos
14.
Comput Biol Med ; 131: 104259, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33581474

RESUMO

Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.


Assuntos
Proteínas Facilitadoras de Transporte de Glucose , Processamento de Linguagem Natural , Glucose , Idioma , Semântica
15.
Mol Inform ; 39(10): e2000033, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32598045

RESUMO

We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5-fold cross-validation processes, seven types of sequence-based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross-validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.


Assuntos
Biologia Computacional/métodos , Complexos Multiproteicos/classificação , Complexos Multiproteicos/metabolismo , Sequência de Aminoácidos , Transporte de Elétrons , Humanos , Processamento de Linguagem Natural , Máquina de Vetores de Suporte , Processamento de Texto
16.
BMC Med Genomics ; 13(Suppl 10): 155, 2020 10 22.
Artigo em Inglês | MEDLINE | ID: mdl-33087125

RESUMO

BACKGROUND: Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists. METHODS: In this research, we employed a word embedding technique to create hybrid features which was proved to efficiently identify tumor necrosis factors given cytokine sequences. We segmented each protein sequence into protein words and created corresponding word embedding for each word. Then, word embedding-based vector for each sequence was created and input into machine learning classification models. When extracting feature sets, we not only diversified segmentation sizes of protein sequence but also conducted different combinations among split grams to find the best features which generated the optimal prediction. Furthermore, our methodology follows a well-defined procedure to build a reliable classification tool. RESULTS: With our proposed hybrid features, prediction models obtain more promising performance compared to seven prominent sequenced-based feature kinds. Results from 10 independent runs on the surveyed dataset show that on an average, our optimal models obtain an area under the curve of 0.984 and 0.998 on 5-fold cross-validation and independent test, respectively. CONCLUSIONS: These results show that biologists can use our model to identify tumor necrosis factors from other cytokines efficiently. Moreover, this study proves that natural language processing techniques can be applied reasonably to help biologists solve bioinformatics problems efficiently.


Assuntos
Biologia Computacional , Aprendizado de Máquina , Fatores de Necrose Tumoral/metabolismo , Sequência de Aminoácidos , Humanos , Processamento de Linguagem Natural , Fatores de Necrose Tumoral/química
17.
J Mol Graph Model ; 92: 86-93, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31344547

RESUMO

Membrane proteins, the most important drug targets, account for around 30% of total proteins encoded by the genome of living organisms. An important role of these proteins is to bind adenosine triphosphate (ATP), facilitating crucial biological processes such as metabolism and cell signaling. There are several reports elucidating ATP-binding sites within proteins. However, such studies on membrane proteins are limited. Our prediction tool, DeepATP, combines evolutionary information in the form of Position Specific Scoring Matrix and two-dimensional Convolutional Neural Network to predict ATP-binding sites in membrane proteins with an MCC of 0.89 and an AUC of 99%. Compared to recently published ATP-binding site predictors and classifiers that use traditional machine learning algorithms, our approach performs significantly better. We suggest this method as a reliable tool for biologists for ATP-binding site prediction in membrane proteins.


Assuntos
Trifosfato de Adenosina/química , Sítios de Ligação , Proteínas de Membrana/química , Modelos Teóricos , Redes Neurais de Computação , Trifosfato de Adenosina/metabolismo , Algoritmos , Motivos de Aminoácidos , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Aprendizado de Máquina , Proteínas de Membrana/metabolismo , Matrizes de Pontuação de Posição Específica , Curva ROC , Reprodutibilidade dos Testes , Navegador
18.
J Mol Graph Model ; 73: 166-178, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28285094

RESUMO

The electron transport proteins have an important role in storing and transferring electrons in cellular respiration, which is the most proficient process through which cells gather energy from consumed food. According to the molecular functions, the electron transport chain components could be formed with five complexes with several different electron carriers and functions. Therefore, identifying the molecular functions in the electron transport chain is vital for helping biologists understand the electron transport chain process and energy production in cells. This work includes two phases for discriminating electron transport proteins from transport proteins and classifying categories of five complexes in electron transport proteins. In the first phase, the performances from PSSM with AAIndex feature set were successful in identifying electron transport proteins in transport proteins with achieved sensitivity of 73.2%, specificity of 94.1%, and accuracy of 91.3%, with MCC of 0.64 for independent data set. With the second phase, our method can approach a precise model for identifying of five complexes with different molecular functions in electron transport proteins. The PSSM with AAIndex properties in five complexes achieved MCC of 0.51, 0.47, 0.42, 0.74, and 1.00 for independent data set, respectively. We suggest that our study could be a power model for determining new proteins that belongs into which molecular function of electron transport proteins.


Assuntos
Proteínas de Transporte/metabolismo , Algoritmos , Aminoácidos/química , Transporte de Elétrons
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA