Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 404
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 21(3): 1047-1057, 2020 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-31067315

RESUMEN

With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.


Asunto(s)
ADN/química , Aprendizaje Automático , Proteínas/química , ARN/química , Análisis de Secuencia/métodos , Algoritmos , Internet
2.
Brief Bioinform ; 20(2): 638-658, 2019 03 25.
Artículo en Inglés | MEDLINE | ID: mdl-29897410

RESUMEN

Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.


Asunto(s)
Biología Computacional , Péptido Hidrolasas/metabolismo , Aprendizaje Automático , Proteolisis , Proteoma , Especificidad por Sustrato
3.
Brief Bioinform ; 20(6): 2185-2199, 2019 11 27.
Artículo en Inglés | MEDLINE | ID: mdl-30351377

RESUMEN

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.


Asunto(s)
Biología Computacional , Lisina/metabolismo , Aprendizaje Automático , Malonatos/metabolismo , Animales , Humanos
4.
Brief Bioinform ; 20(6): 2267-2290, 2019 11 27.
Artículo en Inglés | MEDLINE | ID: mdl-30285084

RESUMEN

Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.


Asunto(s)
Biología Computacional/métodos , Lisina/metabolismo , Aprendizaje Automático , Procesamiento Proteico-Postraduccional , Algoritmos , Estudios de Factibilidad
5.
Brief Bioinform ; 20(6): 2150-2166, 2019 11 27.
Artículo en Inglés | MEDLINE | ID: mdl-30184176

RESUMEN

The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.


Asunto(s)
Benchmarking , Biología Computacional , Péptido Hidrolasas/metabolismo , Investigación , Algoritmos , Aprendizaje Automático , Especificidad por Sustrato
6.
Genomics ; 112(1): 837-847, 2020 01.
Artículo en Inglés | MEDLINE | ID: mdl-31150762

RESUMEN

BACKGROUND: Glioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks. RESULTS: As a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers. CONCLUSION: Machine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.


Asunto(s)
Neoplasias Encefálicas/genética , Neoplasias Encefálicas/metabolismo , Glioma/genética , Glioma/metabolismo , Aprendizaje Automático , Mapeo de Interacción de Proteínas , Neoplasias Encefálicas/diagnóstico , Expresión Génica , Ontología de Genes , Glioma/diagnóstico , Estadificación de Neoplasias
7.
Mol Genet Genomics ; 295(2): 261-274, 2020 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-31894399

RESUMEN

Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.


Asunto(s)
Desarrollo de Medicamentos/tendencias , Genoma Humano/genética , Genómica/tendencias , Algoritmos , Biología Computacional/tendencias , Humanos , Programas Informáticos
8.
Bioinformatics ; 35(3): 398-406, 2019 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-30010789

RESUMEN

Motivation: A cell contains numerous protein molecules. One of the fundamental goals in cell biology is to determine their subcellular locations, which can provide useful clues about their functions. Knowledge of protein subcellular localization is also indispensable for prioritizing and selecting the right targets for drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called 'pLoc-mAnimal' was developed for identifying the subcellular localization of animal proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with the multi-label systems in which some proteins, called 'multiplex proteins', may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mAnimal was trained by an extremely skewed dataset in which some subset (subcellular location) was about 128 times the size of the other subsets. Accordingly, such an uneven training dataset will inevitably cause a biased consequence. Results: To alleviate such biased consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mAnimal by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mAnimal, the existing state-of-the-art predictor, in identifying the subcellular localization of animal proteins. Availability and implementation: To maximize the convenience for the vast majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mAnimal/, by which users can easily get their desired results without the need to go through the complicated mathematics. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional , Bases de Datos de Proteínas , Proteínas , Secuencia de Aminoácidos , Animales , Transporte de Proteínas , Fracciones Subcelulares
9.
Bioinformatics ; 35(17): 2957-2965, 2019 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30649179

RESUMEN

MOTIVATION: Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS: In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION: The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Regiones Promotoras Genéticas , Programas Informáticos , Teorema de Bayes , Sitio de Iniciación de la Transcripción
10.
Bioinformatics ; 35(12): 2017-2028, 2019 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-30388198

RESUMEN

MOTIVATION: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. RESULTS: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. AVAILABILITY AND IMPLEMENTATION: http://bastion3.erc.monash.edu/. CONTACT: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Automático , Algoritmos , Secuencia de Aminoácidos , Proteínas Bacterianas , Biología Computacional , Bacterias Gramnegativas , Programas Informáticos
11.
Anal Biochem ; 588: 113477, 2020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-31654612

RESUMEN

Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent. Herein, we report a classifier named iProtease-PseAAC (2L) for identifying proteases and their classes. The predictor is developed employing the flow of 5-step rule, initiating from the collection of benchmark dataset and terminating at the development of predictor. Rigorous verification and validation tests are performed and metrics are collected to calculate the authenticity of the trained model. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. The average accuracy for level-2 i.e. protease classification is 95.77%. Based on the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases and their classes using a given protein sequence.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Péptido Hidrolasas/clasificación , Proteínas/clasificación , Programas Informáticos , Bases de Datos de Proteínas
12.
Curr Genomics ; 21(7): 536-545, 2020 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-33214770

RESUMEN

INTRODUCTION: Hydroxylation is one of the most important post-translational modifications (PTM) in cellular functions and is linked to various diseases. The addition of one of the hydroxyl groups (OH) to the lysine sites produces hydroxylysine when undergoes chemical modification. METHODS: The method which is used in this study for identifying hydroxylysine sites based on powerful mathematical and statistical methodology incorporating the sequence-order effect and composition of each object within protein sequences. This predictor is called "iHyd-LysSite (EPSV)" (identifying hydroxylysine sites by extracting enhanced position and sequence variant technique). The prediction of hydroxylysine sites by experimental methods is difficult, laborious and highly expensive. In silico technique is an alternative approach to identify hydroxylysine sites in proteins. RESULTS: The experimental results require that the predictive model should have high sensitivity and specificity values and must be more accurate. The self-consistency, independent, 10-fold cross-validation and jackknife tests are performed for validation purposes. These tests are resulted by using three renowned classifiers, Neural Networks (NN), Random Forest (RF) and Support Vector Machine (SVM) with the demanding prediction rate. The overall predictive outcomes are extraordinarily superior to the results obtained by previous predictors. The proposed model contributed an excellent prediction rate in the system for NN, RF, and SVM classifiers. The sensitivity and specificity results using all these classifiers for jackknife test are 96.08%, 94.99%, 98.16% and 97.52%, 98.52%, 80.95%. CONCLUSION: The results obtained by the proposed tool show that this method may meet the future demand of hydroxylysine sites with a better prediction rate over the existing methods.

13.
Genomics ; 2019 Aug 30.
Artículo en Inglés | MEDLINE | ID: mdl-31476433

RESUMEN

This article has been withdrawn at the request of the Editor-in-Chief. After a thorough investigation, the Editor has concluded that the acceptance of this article was partly based upon the positive advice of two illegitimate reviewer reports. The reports were submitted from email accounts which were provided to the journal as suggested reviewers during the submission of the article. Although purportedly real reviewer accounts, the Editor has concluded that these were not of appropriate, independent reviewers. Also, the article duplicates significant parts of a paper that had already appeared in Current Medicinal Chemistry 26 (2019) 4918-4943 https://doi.org/10.2174/0929867326666190507082559. This represents a clear violation of the fundamentals of peer review, our publishing policies, and publishing ethics standards. Apologies are offered to the readers of the journal that this was not detected during the submission process. The full Elsevier Policy on Article Withdrawal can be found at https://www.elsevier.com/about/our-business/policies/article-withdrawal.

14.
Genomics ; 2019 Sep 05.
Artículo en Inglés | MEDLINE | ID: mdl-31494196

RESUMEN

During the last three decades or so, many efforts have been made to study the protein cleavage sites by some disease-causing enzyme, such as HIV (human immunodeficiency virus) protease and SARS (severe acute respiratory syndrome) coronavirus main proteinase. It has become increasingly clear via this minireview that the motivation driving the aforementioned studies is quite wise, and that the results acquired through these studies are very rewarding, particularly for developing peptide drugs.

15.
Genomics ; 2019 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-31472241

RESUMEN

This article has been withdrawn at the request of the Editor-in-Chief. After a thorough investigation, the Editor has concluded that the acceptance of this article was partly based upon the positive advice of two illegitimate reviewer reports. The reports were submitted from email accounts which were provided to the journal as suggested reviewers during the submission of the article. Although purportedly real reviewer accounts, the Editor has concluded that these were not of appropriate, independent reviewers. Also, the article duplicates significant parts of a paper that had already appeared in Current Medicinal Chemistry 26 (2019) 4918-4943 https://doi.org/10.2174/0929867326666190507082559. This represents a clear violation of the fundamentals of peer review, our publishing policies, and publishing ethics standards. Apologies are offered to the readers of the journal that this was not detected during the submission process. The full Elsevier Policy on Article Withdrawal can be found at http://www.elsevier.com/locate/withdrawalpolicy.

16.
Genomics ; 111(6): 1274-1282, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-30179658

RESUMEN

A cell contains numerous protein molecules. One of the fundamental goals in molecular cell biology is to determine their subcellular locations since this information is extremely important to both basic research and drug development. In this paper, we report a novel and very powerful predictor called "pLoc_bal-mHum" for predicting the subcellular localization of human proteins based on their sequence information alone. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the new predictor is remarkably superior to the existing state-of-the-art predictor in identifying the subcellular localization of human proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mHum/, by which users can easily get their desired results without the need to go through the detailed mathematics.


Asunto(s)
Proteínas/análisis , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Células/química , Humanos
17.
Genomics ; 111(4): 886-892, 2019 07.
Artículo en Inglés | MEDLINE | ID: mdl-29842950

RESUMEN

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called "pLoc-mGpos" was developed for identifying the subcellular localization of Gram-positive bacterial proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mGpos was trained by an extremely skewed dataset in which some subset (subcellular location) was over 11 times the size of the other subsets. Accordingly, it cannot avoid the bias consequence caused by such an uneven training dataset. To alleviate such bias consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mGpos by quasi-balancing the training dataset. Rigorous target jackknife tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mGpos, the existing state-of-the-art predictor in identifying the subcellular localization of Gram-positive bacterial proteins. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mGpos/, by which users can easily get their desired results without the need to go through the detailed mathematics.


Asunto(s)
Proteínas Bacterianas/genética , Bacterias Grampositivas/genética , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Proteínas Bacterianas/metabolismo , Membrana Celular/metabolismo , Citoplasma/metabolismo , Transporte de Proteínas
18.
Genomics ; 111(1): 96-102, 2019 01.
Artículo en Inglés | MEDLINE | ID: mdl-29360500

RESUMEN

N6-methyladenine (6mA) is one kind of post-replication modification (PTM or PTRM) occurring in a wide range of DNA sequences. Accurate identification of its sites will be very helpful for revealing the biological functions of 6mA, but it is time-consuming and expensive to determine them by experiments alone. Unfortunately, so far, no bioinformatics tool is available to do so. To fill in such an empty area, we have proposed a novel predictor called iDNA6mA-PseKNC that is established by incorporating nucleotide physicochemical properties into Pseudo K-tuple Nucleotide Composition (PseKNC). It has been observed via rigorous cross-validations that the predictor's sensitivity (Sn), specificity (Sp), accuracy (Acc), and stability (MCC) are 93%, 100%, 96%, and 0.93, respectively. For the convenience of most experimental scientists, a user-friendly web server for iDNA6mA-PseKNC has been established at http://lin-group.cn/server/iDNA6mA-PseKNC, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.


Asunto(s)
Adenosina/análogos & derivados , Biología Computacional , Nucleótidos/química , Adenosina/análisis , Adenosina/química , Algoritmos , Animales , Secuencia de Bases , ADN/química , Exactitud de los Datos , Bases de Datos Genéticas , Genoma Bacteriano , Genoma de los Helmintos , Genoma de Planta , Sensibilidad y Especificidad , Programas Informáticos , Validación de Programas de Computación
19.
Genomics ; 111(6): 1785-1793, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-30529532

RESUMEN

The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.


Asunto(s)
Secuencia de Bases , Regiones Promotoras Genéticas , Análisis de Secuencia de ADN , Programas Informáticos , Sitio de Iniciación de la Transcripción , Activación Transcripcional
20.
BMC Bioinformatics ; 20(1): 112, 2019 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-30841845

RESUMEN

BACKGROUND: As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites). RESULTS: In this study, we propose a positive unlabelled (PU) learning-based method, PA2DE (V2.0), based on the AlphaMax algorithm for protein glycosylation site prediction. The predictive performance of this proposed method was evaluated by a range of glycosylation data collected over a ten-year period based on an interval of three years. Experiments using both benchmarking and independent tests show that our method outperformed the representative supervised-learning algorithms (including support vector machines and random forests) and one-class learners, as well as currently available prediction methods in terms of F1 score, accuracy and AUC measures. In addition, we developed an online web server as an implementation of the optimized model (available at http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/ ) to facilitate community-wide efforts for accurate prediction of protein glycosylation sites. CONCLUSION: The proposed PU learning approach achieved a competitive predictive performance compared with currently available methods. This PU learning schema may also be effectively employed and applied to address the prediction problems of other important types of protein PTM site and functional sites.


Asunto(s)
Biología Computacional/métodos , Proteoma/metabolismo , Coloración y Etiquetado , Bases de Datos de Proteínas , Glicosilación , Humanos , Procesamiento Proteico-Postraduccional , Curva ROC , Máquina de Vectores de Soporte
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA