Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
1.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36617463

RESUMO

DNA and RNA sequencing technologies have revolutionized biology and biomedical sciences, sequencing full genomes and transcriptomes at very high speeds and reasonably low costs. RNA sequencing (RNA-Seq) enables transcript identification and quantification, but once sequencing has concluded researchers can be easily overwhelmed with questions such as how to go from raw data to differential expression (DE), pathway analysis and interpretation. Several pipelines and procedures have been developed to this effect. Even though there is no unique way to perform RNA-Seq analysis, it usually follows these steps: 1) raw reads quality check, 2) alignment of reads to a reference genome, 3) aligned reads' summarization according to an annotation file, 4) DE analysis and 5) gene set analysis and/or functional enrichment analysis. Each step requires researchers to make decisions, and the wide variety of options and resulting large volumes of data often lead to interpretation challenges. There also seems to be insufficient guidance on how best to obtain relevant information and derive actionable knowledge from transcription experiments. In this paper, we explain RNA-Seq steps in detail and outline differences and similarities of different popular options, as well as advantages and disadvantages. We also discuss non-coding RNA analysis, multi-omics, meta-transcriptomics and the use of artificial intelligence methods complementing the arsenal of tools available to researchers. Lastly, we perform a complete analysis from raw reads to DE and functional enrichment analysis, visually illustrating how results are not absolute truths and how algorithmic decisions can greatly impact results and interpretation.


Assuntos
Inteligência Artificial , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Transcriptoma , Análise de Sequência de RNA/métodos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA/genética
2.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34729589

RESUMO

Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Aprendizado de Máquina Supervisionado
3.
Clin Trials ; 21(1): 51-66, 2024 02.
Artigo em Inglês | MEDLINE | ID: mdl-37937606

RESUMO

Numerous successful gene-targeted therapies are arising for the treatment of a variety of rare diseases. At the same time, current treatment options for neurofibromatosis 1 and schwannomatosis are limited and do not directly address loss of gene/protein function. In addition, treatments have mostly focused on symptomatic tumors, but have failed to address multisystem involvement in these conditions. Gene-targeted therapies hold promise to address these limitations. However, despite intense interest over decades, multiple preclinical and clinical issues need to be resolved before they become a reality. The optimal approaches to gene-, mRNA-, or protein restoration and to delivery to the appropriate cell types remain elusive. Preclinical models that recapitulate manifestations of neurofibromatosis 1 and schwannomatosis need to be refined. The development of validated assays for measuring neurofibromin and merlin activity in animal and human tissues will be critical for early-stage trials, as will the selection of appropriate patients, based on their individual genotypes and risk/benefit balance. Once the safety of gene-targeted therapy for symptomatic tumors has been established, the possibility of addressing a wide range of symptoms, including non-tumor manifestations, should be explored. As preclinical efforts are underway, it will be essential to educate both clinicians and those affected by neurofibromatosis 1/schwannomatosis about the risks and benefits of gene-targeted therapy for these conditions.


Assuntos
Neurilemoma , Neurofibromatoses , Neurofibromatose 1 , Neurofibromatose 2 , Neoplasias Cutâneas , Animais , Humanos , Neurofibromatose 1/genética , Neurofibromatose 1/terapia , Neurofibromatose 2/diagnóstico , Neurofibromatose 2/genética , Neurofibromatose 2/patologia , Neurofibromatoses/genética , Neurofibromatoses/terapia , Neurofibromatoses/diagnóstico , Neurilemoma/genética , Neurilemoma/terapia , Neurilemoma/diagnóstico
4.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33774670

RESUMO

Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.


Assuntos
Algoritmos , Biologia Computacional/métodos , Aprendizado de Máquina , Proteínas Citotóxicas Formadoras de Poros/farmacologia , Bactérias/classificação , Bactérias/efeitos dos fármacos , Biofilmes/efeitos dos fármacos , Biofilmes/crescimento & desenvolvimento , Bases de Dados Factuais , Fungos/classificação , Fungos/efeitos dos fármacos , Proteínas Citotóxicas Formadoras de Poros/classificação , Proteínas Citotóxicas Formadoras de Poros/metabolismo , Máquina de Vetores de Suporte , Vírus/efeitos dos fármacos
5.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32599617

RESUMO

Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.


Assuntos
Bactérias , Proteínas de Bactérias/genética , Bases de Dados de Proteínas , Aprendizado Profundo , Genoma Bacteriano , Fatores de Virulência/genética , Bactérias/genética , Bactérias/patogenicidade
6.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33212503

RESUMO

Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.


Assuntos
Biologia Computacional , Bases de Dados de Proteínas , Aprendizado Profundo , Proteoma , Software , beta-Lactamases/genética
7.
Nucleic Acids Res ; 49(D1): D651-D659, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33084862

RESUMO

Gram-negative bacteria utilize secretion systems to export substrates into their surrounding environment or directly into neighboring cells. These substrates are proteins that function to promote bacterial survival: by facilitating nutrient collection, disabling competitor species or, for pathogens, to disable host defenses. Following a rapid development of computational techniques, a growing number of substrates have been discovered and subsequently validated by wet lab experiments. To date, several online databases have been developed to catalogue these substrates but they have limited user options for in-depth analysis, and typically focus on a single type of secreted substrate. We therefore developed a universal platform, BastionHub, that incorporates extensive functional modules to facilitate substrate analysis and integrates the five major Gram-negative secreted substrate types (i.e. from types I-IV and VI secretion systems). To our knowledge, BastionHub is not only the most comprehensive online database available, it is also the first to incorporate substrates secreted by type I or type II secretion systems. By providing the most up-to-date details of secreted substrates and state-of-the-art prediction and visualized relationship analysis tools, BastionHub will be an important platform that can assist biologists in uncovering novel substrates and formulating new hypotheses. BastionHub is freely available at http://bastionhub.erc.monash.edu/.


Assuntos
Bases de Dados como Assunto , Bactérias Gram-Negativas/metabolismo , Curadoria de Dados , Anotação de Sequência Molecular , Especificidade por Substrato
8.
Brief Bioinform ; 21(4): 1119-1135, 2020 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31204427

RESUMO

Human leukocyte antigen class I (HLA-I) molecules are encoded by major histocompatibility complex (MHC) class I loci in humans. The binding and interaction between HLA-I molecules and intracellular peptides derived from a variety of proteolytic mechanisms play a crucial role in subsequent T-cell recognition of target cells and the specificity of the immune response. In this context, tools that predict the likelihood for a peptide to bind to specific HLA class I allotypes are important for selecting the most promising antigenic targets for immunotherapy. In this article, we comprehensively review a variety of currently available tools for predicting the binding of peptides to a selection of HLA-I allomorphs. Specifically, we compare their calculation methods for the prediction score, employed algorithms, evaluation strategies and software functionalities. In addition, we have evaluated the prediction performance of the reviewed tools based on an independent validation data set, containing 21 101 experimentally verified ligands across 19 HLA-I allotypes. The benchmarking results show that MixMHCpred 2.0.1 achieves the best performance for predicting peptides binding to most of the HLA-I allomorphs studied, while NetMHCpan 4.0 and NetMHCcons 1.1 outperform the other machine learning-based and consensus-based tools, respectively. Importantly, it should be noted that a peptide predicted with a higher binding score for a specific HLA allotype does not necessarily imply it will be immunogenic. That said, peptide-binding predictors are still very useful in that they can help to significantly reduce the large number of epitope candidates that need to be experimentally verified. Several other factors, including susceptibility to proteasome cleavage, peptide transport into the endoplasmic reticulum and T-cell receptor repertoire, also contribute to the immunogenicity of peptide antigens, and some of them can be considered by some predictors. Therefore, integrating features derived from these additional factors together with HLA-binding properties by using machine-learning algorithms may increase the prediction accuracy of immunogenic peptides. As such, we anticipate that this review and benchmarking survey will assist researchers in selecting appropriate prediction tools that best suit their purposes and provide useful guidelines for the development of improved antigen predictors in the future.


Assuntos
Biologia Computacional/métodos , Antígenos de Histocompatibilidade Classe I/metabolismo , Algoritmos , Conjuntos de Dados como Assunto , Antígenos de Histocompatibilidade Classe I/química , Humanos , Aprendizado de Máquina , Reprodutibilidade dos Testes
9.
Brief Bioinform ; 21(3): 1069-1079, 2020 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-31161204

RESUMO

Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence-structural-functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.


Assuntos
Bases de Dados de Proteínas , Mutação , Processamento de Proteína Pós-Traducional , Proteínas/química , Conformação Proteica
10.
Brief Bioinform ; 21(3): 1047-1057, 2020 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-31067315

RESUMO

With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.


Assuntos
DNA/química , Aprendizado de Máquina , Proteínas/química , RNA/química , Análise de Sequência/métodos , Algoritmos , Internet
11.
Brief Bioinform ; 20(6): 2185-2199, 2019 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-30351377

RESUMO

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.


Assuntos
Biologia Computacional , Lisina/metabolismo , Aprendizado de Máquina , Malonatos/metabolismo , Animais , Humanos
12.
Brief Bioinform ; 20(3): 931-951, 2019 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-29186295

RESUMO

In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naïve Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.


Assuntos
Proteínas de Bactérias/metabolismo , Sistemas de Secreção Bacterianos , Aprendizado de Máquina , Algoritmos , Teorema de Bayes , Máquina de Vetores de Suporte
13.
Brief Bioinform ; 20(6): 2267-2290, 2019 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-30285084

RESUMO

Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.


Assuntos
Biologia Computacional/métodos , Lisina/metabolismo , Aprendizado de Máquina , Processamento de Proteína Pós-Traducional , Algoritmos , Estudos de Viabilidade
14.
Brief Bioinform ; 20(6): 2150-2166, 2019 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-30184176

RESUMO

The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.


Assuntos
Benchmarking , Biologia Computacional , Peptídeo Hidrolases/metabolismo , Pesquisa , Algoritmos , Aprendizado de Máquina , Especificidade por Substrato
15.
Bioinformatics ; 36(15): 4276-4282, 2020 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-32426818

RESUMO

MOTIVATION: Different from traditional linear RNAs (containing 5' and 3' ends), circular RNAs (circRNAs) are a special type of RNAs that have a closed ring structure. Accumulating evidence has indicated that circRNAs can directly bind proteins and participate in a myriad of different biological processes. RESULTS: For identifying the interaction of circRNAs with 37 different types of circRNA-binding proteins (RBPs), we develop an ensemble neural network, termed PASSION, which is based on the concatenated artificial neural network (ANN) and hybrid deep neural network frameworks. Specifically, the input of the ANN is the optimal feature subset for each RBP, which has been selected from six types of feature encoding schemes through incremental feature selection and application of the XGBoost algorithm. In turn, the input of the hybrid deep neural network is a stacked codon-based scheme. Benchmarking experiments indicate that the ensemble neural network reaches the average best area under the curve (AUC) of 0.883 across the 37 circRNA datasets when compared with XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression and Naive Bayes. Moreover, each of the 37 RBP models is extensively tested by performing independent tests, with the varying sequence similarity thresholds of 0.8, 0.7, 0.6 and 0.5, respectively. The corresponding average AUC obtained are 0.883, 0.876, 0.868 and 0.883, respectively, highlighting the effectiveness and robustness of PASSION. Extensive benchmarking experiments demonstrate that PASSION achieves a competitive performance for identifying binding sites between circRNA and RBPs, when compared with several state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION: A user-friendly web server of PASSION is publicly accessible at http://flagship.erc.monash.edu/PASSION/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
RNA Circular , Proteínas de Ligação a RNA , Teorema de Bayes , Sítios de Ligação , Redes Neurais de Computação , Proteínas de Ligação a RNA/metabolismo
16.
Bioinformatics ; 36(3): 704-712, 2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-31393553

RESUMO

MOTIVATION: Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. RESULTS: In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. AVAILABILITY AND IMPLEMENTATION: http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Aprendizado de Máquina , Biologia Computacional , Peptídeos , Proteínas
17.
Bioinformatics ; 36(4): 1057-1065, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31566664

RESUMO

MOTIVATION: Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the 'life and death' cellular processes, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events. RESULTS: We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites. AVAILABILITY AND IMPLEMENTATION: The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Caspases , Metaloproteases , Software , Especificidade por Substrato
18.
BMC Bioinformatics ; 21(1): 493, 2020 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-33129275

RESUMO

BACKGROUND: Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine-receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases-notably autoimmune, inflammatory and infectious diseases-and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. "Gold Standard" negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection. RESULTS: We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance. CONCLUSIONS: A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections-with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.


Assuntos
Receptores de Citocinas/metabolismo , Algoritmos , Humanos , Aprendizado de Máquina , Matrizes de Pontuação de Posição Específica , Reprodutibilidade dos Testes
19.
Brief Bioinform ; 19(1): 148-161, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-27777222

RESUMO

Bacterial effector proteins secreted by various protein secretion systems play crucial roles in host-pathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into host-pathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.


Assuntos
Bactérias/metabolismo , Proteínas de Bactérias/metabolismo , Sistemas de Secreção Bacterianos/genética , Genoma Bacteriano , Matrizes de Pontuação de Posição Específica , Algoritmos , Sequência de Aminoácidos , Aminoácidos/metabolismo , Bactérias/crescimento & desenvolvimento , Proteínas de Bactérias/classificação , Proteínas de Bactérias/genética , Regulação Bacteriana da Expressão Gênica , Interações Hospedeiro-Patógeno , Humanos , Software
20.
Bioinformatics ; 35(17): 2957-2965, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30649179

RESUMO

MOTIVATION: Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS: In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION: The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Regiões Promotoras Genéticas , Software , Teorema de Bayes , Sítio de Iniciação de Transcrição
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA