Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
Brief Bioinform ; 22(2): 2126-2140, 2021 03 22.
Article in English | MEDLINE | ID: mdl-32363397

ABSTRACT

Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.


Subject(s)
Escherichia coli/genetics , Machine Learning , Promoter Regions, Genetic , Datasets as Topic , Genes, Bacterial , Reproducibility of Results
2.
Brief Bioinform ; 20(3): 931-951, 2019 05 21.
Article in English | MEDLINE | ID: mdl-29186295

ABSTRACT

In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naïve Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.


Subject(s)
Bacterial Proteins/metabolism , Bacterial Secretion Systems , Machine Learning , Algorithms , Bayes Theorem , Support Vector Machine
3.
Bioinformatics ; 35(12): 2017-2028, 2019 06 01.
Article in English | MEDLINE | ID: mdl-30388198

ABSTRACT

MOTIVATION: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. RESULTS: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. AVAILABILITY AND IMPLEMENTATION: http://bastion3.erc.monash.edu/. CONTACT: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Machine Learning , Algorithms , Amino Acid Sequence , Bacterial Proteins , Computational Biology , Gram-Negative Bacteria , Software
4.
Bioinformatics ; 34(15): 2546-2555, 2018 08 01.
Article in English | MEDLINE | ID: mdl-29547915

ABSTRACT

Motivation: Many Gram-negative bacteria use type VI secretion systems (T6SS) to export effector proteins into adjacent target cells. These secreted effectors (T6SEs) play vital roles in the competitive survival in bacterial populations, as well as pathogenesis of bacteria. Although various computational analyses have been previously applied to identify effectors secreted by certain bacterial species, there is no universal method available to accurately predict T6SS effector proteins from the growing tide of bacterial genome sequence data. Results: We extracted a wide range of features from T6SE protein sequences and comprehensively analyzed the prediction performance of these features through unsupervised and supervised learning. By integrating these features, we subsequently developed a two-layer SVM-based ensemble model with fine-grain optimized parameters, to identify potential T6SEs. We further validated the predictive model using an independent dataset, which showed that the proposed model achieved an impressive performance in terms of ACC (0.943), F-value (0.946), MCC (0.892) and AUC (0.976). To demonstrate applicability, we employed this method to correctly identify two very recently validated T6SE proteins, which represent challenging prediction targets because they significantly differed from previously known T6SEs in terms of their sequence similarity and cellular function. Furthermore, a genome-wide prediction across 12 bacterial species, involving in total 54 212 protein sequences, was carried out to distinguish 94 putative T6SE candidates. We envisage both this information and our publicly accessible web server will facilitate future discoveries of novel T6SEs. Availability and implementation: http://bastion6.erc.monash.edu/. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Bacterial Proteins/metabolism , Gram-Negative Bacteria/metabolism , Sequence Analysis, Protein/methods , Software , Type VI Secretion Systems/metabolism , Amino Acid Sequence , Bacterial Proteins/chemistry , Computational Biology/methods , Internet , Machine Learning , Sequence Analysis, DNA/methods , Type VI Secretion Systems/chemistry
5.
BMC Bioinformatics ; 19(Suppl 1): 39, 2018 02 19.
Article in English | MEDLINE | ID: mdl-29504897

ABSTRACT

BACKGROUND: Since many proteins become functional only after they interact with their partner proteins and form protein complexes, it is essential to identify the sets of proteins that form complexes. Therefore, several computational methods have been proposed to predict complexes from the topology and structure of experimental protein-protein interaction (PPI) network. These methods work well to predict complexes involving at least three proteins, but generally fail at identifying complexes involving only two different proteins, called heterodimeric complexes or heterodimers. There is however an urgent need for efficient methods to predict heterodimers, since the majority of known protein complexes are precisely heterodimers. RESULTS: In this paper, we use three promising kernel functions, Min kernel and two pairwise kernels, which are Metric Learning Pairwise Kernel (MLPK) and Tensor Product Pairwise Kernel (TPPK). We also consider the normalization forms of Min kernel. Then, we combine Min kernel or its normalization form and one of the pairwise kernels by plugging. We applied kernels based on PPI, domain, phylogenetic profile, and subcellular localization properties to predicting heterodimers. Then, we evaluate our method by employing C-Support Vector Classification (C-SVC), carrying out 10-fold cross-validation, and calculating the average F-measures. The results suggest that the combination of normalized-Min-kernel and MLPK leads to the best F-measure and improved the performance of our previous work, which had been the best existing method so far. CONCLUSIONS: We propose new methods to predict heterodimers, using a machine learning-based approach. We train a support vector machine (SVM) to discriminate interacting vs non-interacting protein pairs, based on informations extracted from PPI, domain, phylogenetic profiles and subcellular localization. We evaluate in detail new kernel functions to encode these data, and report prediction performance that outperforms the state-of-the-art.


Subject(s)
Algorithms , Multiprotein Complexes/chemistry , Dimerization , Multiprotein Complexes/classification , Phylogeny , Protein Domains , Protein Interaction Maps , Protein Multimerization , Support Vector Machine
6.
Brief Bioinform ; 17(2): 270-82, 2016 Mar.
Article in English | MEDLINE | ID: mdl-26177815

ABSTRACT

Coiled-coils refer to a bundle of helices coiled together like strands of a rope. It has been estimated that nearly 3% of protein-encoding regions of genes harbour coiled-coil domains (CCDs). Experimental studies have confirmed that CCDs play a fundamental role in subcellular infrastructure and controlling trafficking of eukaryotic cells. Given the importance of coiled-coils, multiple bioinformatics tools have been developed to facilitate the systematic and high-throughput prediction of CCDs in proteins. In this article, we review and compare 12 sequence-based bioinformatics approaches and tools for coiled-coil prediction. These approaches can be categorized into two classes: coiled-coil detection and coiled-coil oligomeric state prediction. We evaluated and compared these methods in terms of their input/output, algorithm, prediction performance, validation methods and software utility. All the independent testing data sets are available at http://lightning.med.monash.edu/coiledcoil/. In addition, we conducted a case study of nine human polyglutamine (PolyQ) disease-related proteins and predicted CCDs and oligomeric states using various predictors. Prediction results for CCDs were highly variable among different predictors. Only two peptides from two proteins were confirmed to be CCDs by majority voting. Both domains were predicted to form dimeric coiled-coils using oligomeric state prediction. We anticipate that this comprehensive analysis will be an insightful resource for structural biologists with limited prior experience in bioinformatics tools, and for bioinformaticians who are interested in designing novel approaches for coiled-coil and its oligomeric state prediction.


Subject(s)
Algorithms , Models, Chemical , Models, Molecular , Proteins/chemistry , Proteins/ultrastructure , Sequence Analysis, Protein/methods , Computer Simulation , Dimerization , Protein Conformation , Protein Domains , Software
7.
BMC Bioinformatics ; 17(1): 487, 2016 Nov 25.
Article in English | MEDLINE | ID: mdl-27887571

ABSTRACT

BACKGROUND: Dicer is necessary for the process of mature microRNA (miRNA) formation because the Dicer enzyme cleaves pre-miRNA correctly to generate miRNA with correct seed regions. Nonetheless, the mechanism underlying the selection of a Dicer cleavage site is still not fully understood. To date, several studies have been conducted to solve this problem, for example, a recent discovery indicates that the loop/bulge structure plays a central role in the selection of Dicer cleavage sites. In accordance with this breakthrough, a support vector machine (SVM)-based method called PHDCleav was developed to predict Dicer cleavage sites which outperforms other methods based on random forest and naive Bayes. PHDCleav, however, tests only whether a position in the shift window belongs to a loop/bulge structure. RESULT: In this paper, we used the length of loop/bulge structures (in addition to their presence or absence) to develop an improved method, LBSizeCleav, for predicting Dicer cleavage sites. To evaluate our method, we used 810 empirically validated sequences of human pre-miRNAs and performed fivefold cross-validation. In both 5p and 3p arms of pre-miRNAs, LBSizeCleav showed greater prediction accuracy than PHDCleav did. This result suggests that the length of loop/bulge structures is useful for prediction of Dicer cleavage sites. CONCLUSION: We developed a novel algorithm for feature space mapping based on the length of a loop/bulge for predicting Dicer cleavage sites. The better performance of our method indicates the usefulness of the length of loop/bulge structures for such predictions.


Subject(s)
Algorithms , MicroRNAs/genetics , RNA Precursors/genetics , Ribonuclease III/metabolism , Software , Support Vector Machine , Base Sequence , Bayes Theorem , DEAD-box RNA Helicases , Humans , Nucleic Acid Conformation , RNA Precursors/metabolism
8.
BMC Bioinformatics ; 17: 113, 2016 Mar 01.
Article in English | MEDLINE | ID: mdl-26932529

ABSTRACT

BACKGROUND: Drug discovery and design are important research fields in bioinformatics. Enumeration of chemical compounds is essential not only for the purpose, but also for analysis of chemical space and structure elucidation. In our previous study, we developed enumeration methods BfsSimEnum and BfsMulEnum for tree-like chemical compounds using a tree-structure to represent a chemical compound, which is limited to acyclic chemical compounds only. RESULTS: In this paper, we extend the methods, and develop BfsBenNaphEnum that can enumerate tree-like chemical compounds containing benzene rings and naphthalene rings, which include benzene isomers and naphthalene isomers such as ortho, meta, and para, by treating a benzene ring as an atom with valence six, instead of a ring of six carbon atoms, and treating a naphthalene ring as two benzene rings having a special bond. We compare our method with MOLGEN 5.0, which is a well-known general purpose structure generator, to enumerate chemical structures from a set of chemical formulas in terms of the number of enumerated structures and the computational time. The result suggests that our proposed method can reduce the computational time efficiently. CONCLUSIONS: We propose the enumeration method BfsBenNaphEnum for tree-like chemical compounds containing benzene rings and naphthalene rings as cyclic structures. BfsBenNaphEnum was from 50 times to 5,000,000 times faster than MOLGEN 5.0 for instances with 8 to 14 carbon atoms in our experiments.


Subject(s)
Algorithms , Benzene/chemistry , Chemistry, Pharmaceutical/methods , Computational Biology/methods , Naphthalenes/chemistry , Stereoisomerism
9.
BMC Bioinformatics ; 16: 128, 2015 Apr 24.
Article in English | MEDLINE | ID: mdl-25907438

ABSTRACT

BACKGROUND: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees. RESULTS: Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs. CONCLUSIONS: We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.


Subject(s)
Algorithms , Data Compression/methods , Polysaccharides/chemistry , RNA/chemistry , Computational Biology/methods , Humans
10.
Methods ; 67(3): 380-5, 2014 Jun 01.
Article in English | MEDLINE | ID: mdl-24486717

ABSTRACT

In this paper, we study domain compositions of proteins via compression of whole proteins in an organism for the sake of obtaining the entropy that the individual contains. We suppose that a protein is a multiset of domains. Since gene duplication and fusion have occurred through evolutionary processes, the same domains and the same compositions of domains appear in multiple proteins, which enables us to compress a proteome by using references to proteins for duplicated and fused proteins. Such a network with references to at most two proteins is modeled as a directed hypergraph. We propose a heuristic approach by combining the Edmonds algorithm and an integer linear programming, and apply our procedure to 14 proteomes of Dictyostelium discoideum, Escherichia coli, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Oryza sativa, Danio rerio, Xenopus laevis, Gallus gallus, Mus musculus, Pan troglodytes, and Homo sapiens. The compressed size using both of duplication and fusion was smaller than that using only duplication, which suggests the importance of fusion events in evolution of a proteome.


Subject(s)
Protein Structure, Tertiary , Proteome , Proteomics/methods , Algorithms , Animals , Databases, Protein , Humans , Sequence Analysis, Protein
11.
BMC Bioinformatics ; 15 Suppl 2: S6, 2014.
Article in English | MEDLINE | ID: mdl-24564744

ABSTRACT

BACKGROUND: Protein complexes play important roles in biological systems such as gene regulatory networks and metabolic pathways. Most methods for predicting protein complexes try to find protein complexes with size more than three. It, however, is known that protein complexes with smaller sizes occupy a large part of whole complexes for several species. In our previous work, we developed a method with several feature space mappings and the domain composition kernel for prediction of heterodimeric protein complexes, which outperforms existing methods. RESULTS: We propose methods for prediction of heterotrimeric protein complexes by extending techniques in the previous work on the basis of the idea that most heterotrimeric protein complexes are not likely to share the same protein with each other. We make use of the discriminant function in support vector machines (SVMs), and design novel feature space mappings for the second phase. As the second classifier, we examine SVMs and relevance vector machines (RVMs). We perform 10-fold cross-validation computational experiments. The results suggest that our proposed two-phase methods and SVM with the extended features outperform the existing method NWE, which was reported to outperform other existing methods such as MCL, MCODE, DPClus, CMC, COACH, RRW, and PPSampler for prediction of heterotrimeric protein complexes. CONCLUSIONS: We propose two-phase prediction methods with the extended features, the domain composition kernel, SVMs and RVMs. The two-phase method with the extended features and the domain composition kernel using SVM as the second classifier is particularly useful for prediction of heterotrimeric protein complexes.


Subject(s)
Multiprotein Complexes/analysis , Support Vector Machine , Discriminant Analysis , Protein Multimerization
12.
ScientificWorldJournal ; 2014: 240673, 2014.
Article in English | MEDLINE | ID: mdl-25093200

ABSTRACT

Proteins in living organisms express various important functions by interacting with other proteins and molecules. Therefore, many efforts have been made to investigate and predict protein-protein interactions (PPIs). Analysis of strengths of PPIs is also important because such strengths are involved in functionality of proteins. In this paper, we propose several feature space mappings from protein pairs using protein domain information to predict strengths of PPIs. Moreover, we perform computational experiments employing two machine learning methods, support vector regression (SVR) and relevance vector machine (RVM), for dataset obtained from biological experiments. The prediction results showed that both SVR and RVM with our proposed features outperformed the best existing method.


Subject(s)
Protein Interaction Mapping , Artificial Intelligence , Computational Biology , Protein Interaction Domains and Motifs
13.
ScientificWorldJournal ; 2013: 632030, 2013.
Article in English | MEDLINE | ID: mdl-23737722

ABSTRACT

Because every disease has its unique survival pattern, it is necessary to find a suitable model to simulate followups. DNA microarray is a useful technique to detect thousands of gene expressions at one time and is usually employed to classify different types of cancer. We propose combination methods of penalized regression models and nonnegative matrix factorization (NMF) for predicting survival. We tried L1- (lasso), L2- (ridge), and L1-L2 combined (elastic net) penalized regression for diffuse large B-cell lymphoma (DLBCL) patients' microarray data and found that L1-L2 combined method predicts survival best with the smallest logrank P value. Furthermore, 80% of selected genes have been reported to correlate with carcinogenesis or lymphoma. Through NMF we found that DLBCL patients can be divided into 4 groups clearly, and it implies that DLBCL may have 4 subtypes which have a little different survival patterns. Next we excluded some patients who were indicated hard to classify in NMF and executed three penalized regression models again. We found that the performance of survival prediction has been improved with lower logrank P values. Therefore, we conclude that after preselection of patients by NMF, penalized regression models can predict DLBCL patients' survival successfully.


Subject(s)
Biomarkers, Tumor/analysis , Lymphoma, Large B-Cell, Diffuse/metabolism , Lymphoma, Large B-Cell, Diffuse/mortality , Neoplasm Proteins/analysis , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Proportional Hazards Models , Survival Analysis , Adult , Aged , Algorithms , Female , Gene Expression Profiling/statistics & numerical data , Humans , Incidence , Male , Middle Aged , Prevalence , Risk Assessment/methods , Survival Rate
14.
BMC Bioinformatics ; 11 Suppl 11: S4, 2010 Dec 14.
Article in English | MEDLINE | ID: mdl-21172054

ABSTRACT

BACKGROUND: A bisection-type algorithm for the grammar-based compression of tree-structured data has been proposed recently. In this framework, an elementary ordered-tree grammar (EOTG) and an elementary unordered-tree grammar (EUTG) were defined, and an approximation algorithm was proposed. RESULTS: In this paper, we propose an integer programming-based method that finds the minimum context-free grammar (CFG) for a given string under the condition that at most two symbols appear on the right-hand side of each production rule. Next, we extend this method to find the minimum EOTG and EUTG grammars for given ordered and unordered trees, respectively. Then, we conduct computational experiments for the ordered and unordered artificial trees. Finally, we apply our methods to pattern extraction of glycan tree structures. CONCLUSIONS: We propose integer programming-based methods that find the minimum CFG, EOTG, and EUTG for given strings, ordered and unordered trees. Our proposed methods for trees are useful for extracting patterns of glycan tree structures.


Subject(s)
Algorithms , Data Compression/methods , Polysaccharides/chemistry , Computational Biology/methods
15.
Genome Inform ; 24: 218-29, 2010.
Article in English | MEDLINE | ID: mdl-22081602

ABSTRACT

For several decades, many methods have been developed for predicting organic synthesis paths. However these methods have non-polynomial computational time. In this paper, we propose a bottom-up dynamic programming algorithm to predict synthesis paths of target tree-structured compounds. In this approach, we transform the synthesis problem of tree-structured compounds to the generation problem of unordered trees by regarding tree-structured compounds and chemical reactions as unordered trees and rules, respectively. In order to represent rules corresponding to chemical reactions, we employ a subclass of NLC (Node Label Controlled) grammars. We also give some computational results on this algorithm.


Subject(s)
Computational Biology/methods , Algorithms , Chemistry, Organic/methods , Drug Design , Models, Theoretical , Software , Systems Biology , Time Factors
16.
J Bioinform Comput Biol ; 17(3): 1940007, 2019 06.
Article in English | MEDLINE | ID: mdl-31288636

ABSTRACT

Deep learning technologies are permeating every field from image and speech recognition to computational and systems biology. However, the application of convolutional neural networks (CCNs) to "omics" data poses some difficulties, such as the processing of complex networks structures as well as its integration with transcriptome data. Here, we propose a CNN approach that combines spectral clustering information processing to classify lung cancer. The developed spectral-convolutional neural network based method achieves success in integrating protein interaction network data and gene expression profiles to classify lung cancer. The performed computational experiments suggest that in terms of accuracy the predictive performance of our proposed method was better than those of other machine learning methods such as SVM or Random Forest. Moreover, the computational results also indicate that the underlying protein network structure assists to enhance the predictions. Data and CNN code can be downloaded from the link: https://sites.google.com/site/nacherlab/analysis.


Subject(s)
Lung Neoplasms/genetics , Lung Neoplasms/metabolism , Neural Networks, Computer , Protein Interaction Maps , Transcriptome , Algorithms , Cluster Analysis , Humans , Machine Learning , Random Allocation , Reproducibility of Results , Support Vector Machine
17.
J Comput Biol ; 25(10): 1071-1090, 2018 10.
Article in English | MEDLINE | ID: mdl-30074414

ABSTRACT

Controlling complex networks through a small number of controller vertices is of great importance in wide-ranging research fields. Recently, a new approach based on the minimum feedback vertex set (MFVS) has been proposed to find such vertices in directed networks in which the target states are restricted to steady states. However, multiple MFVS configurations may exist and thus the selection of vertices may depend on algorithms and input data representations. Our attempts to address this ambiguity led us to adopt an existing approach that classifies vertices into three categories. This approach has been successfully applied to maximum matching-based and minimum dominating set-based controllability analysis frameworks. In this article, we present an algorithm as well as its implementation to compute and evaluate the critical, intermittent, and redundant vertices under the MFVS-based framework, where these three categories include vertices belonging to all MFVSs, some (but not all) MFVSs, and none of the MFVSs, respectively. The results of computational experiments using artificially generated networks and real-world biological networks suggest that the proposed algorithm is useful for identifying these three kinds of vertices for relatively large-scale networks, and that the fraction of critical and intermittent vertices is considerably small. Moreover, an analysis of the signal pathways indicates that critical and intermittent MFVSs tend to be enriched by essential genes.


Subject(s)
Algorithms , Gene Regulatory Networks , Signal Transduction , Animals , Computational Biology/methods , Computer Simulation , Humans , Models, Biological
18.
BMC Syst Biol ; 12(Suppl 1): 37, 2018 04 11.
Article in English | MEDLINE | ID: mdl-29671405

ABSTRACT

BACKGROUND: Current technology has demonstrated that mutation and deregulation of non-coding RNAs (ncRNAs) are associated with diverse human diseases and important biological processes. Therefore, developing a novel computational method for predicting potential ncRNA-disease associations could benefit pathologists in understanding the correlation between ncRNAs and disease diagnosis, treatment, and prevention. However, only a few studies have investigated these associations in pathogenesis. RESULTS: This study utilizes a disease-target-ncRNA tripartite network, and computes prediction scores between each disease-ncRNA pair by integrating biological information derived from pairwise similarity based upon sequence expressions with weights obtained from a multi-layer resource allocation technique. Our proposed algorithm was evaluated based on a 5-fold-cross-validation with optimal kernel parameter tuning. In addition, we achieved an average AUC that varies from 0.75 without link cut to 0.57 with link cut methods, which outperforms a previous method using the same evaluation methodology. Furthermore, the algorithm predicted 23 ncRNA-disease associations supported by other independent biological experimental studies. CONCLUSIONS: Taken together, these results demonstrate the capability and accuracy of predicting further biological significant associations between ncRNAs and diseases and highlight the importance of adding biological sequence information to enhance predictions.


Subject(s)
Computational Biology/methods , Disease/genetics , RNA, Untranslated/genetics , Algorithms , Databases, Genetic , Humans , Neoplasms/genetics
19.
PLoS One ; 13(4): e0195545, 2018.
Article in English | MEDLINE | ID: mdl-29698482

ABSTRACT

The prediction of protein complexes from protein-protein interactions (PPIs) is a well-studied problem in bioinformatics. However, the currently available PPI data is not enough to describe all known protein complexes. In this paper, we express the problem of determining the minimum number of (additional) required protein-protein interactions as a graph theoretic problem under the constraint that each complex constitutes a connected component in a PPI network. For this problem, we develop two computational methods: one is based on integer linear programming (ILPMinPPI) and the other one is based on an existing greedy-type approximation algorithm (GreedyMinPPI) originally developed in the context of communication and social networks. Since the former method is only applicable to datasets of small size, we apply the latter method to a combination of the CYC2008 protein complex dataset and each of eight PPI datasets (STRING, MINT, BioGRID, IntAct, DIP, BIND, WI-PHI, iRefIndex). The results show that the minimum number of additional required PPIs ranges from 51 (STRING) to 964 (BIND), and that even the four best PPI databases, STRING (51), BioGRID (67), WI-PHI (93) and iRefIndex (85), do not include enough PPIs to form all CYC2008 protein complexes. We also demonstrate that the proposed problem framework and our solutions can enhance the prediction accuracy of existing PPI prediction methods. ILPMinPPI can be freely downloaded from http://sunflower.kuicr.kyoto-u.ac.jp/~nakajima/.


Subject(s)
Protein Interaction Mapping/methods , Proteins/chemistry , Proteins/metabolism , Algorithms , Computational Biology , Computer Simulation
20.
Sci Rep ; 7: 41031, 2017 01 23.
Article in English | MEDLINE | ID: mdl-28112271

ABSTRACT

Bacteria translocate effector molecules to host cells through highly evolved secretion systems. By definition, the function of these effector proteins is to manipulate host cell biology and the sequence, structural and functional annotations of these effector proteins will provide a better understanding of how bacterial secretion systems promote bacterial survival and virulence. Here we developed a knowledgebase, termed SecretEPDB (Bacterial Secreted Effector Protein DataBase), for effector proteins of type III secretion system (T3SS), type IV secretion system (T4SS) and type VI secretion system (T6SS). SecretEPDB provides enriched annotations of the aforementioned three classes of effector proteins by manually extracting and integrating structural and functional information from currently available databases and the literature. The database is conservative and strictly curated to ensure that every effector protein entry is supported by experimental evidence that demonstrates it is secreted by a T3SS, T4SS or T6SS. The annotations of effector proteins documented in SecretEPDB are provided in terms of protein characteristics, protein function, protein secondary structure, Pfam domains, metabolic pathway and evolutionary details. It is our hope that this integrated knowledgebase will serve as a useful resource for biological investigation and the generation of new hypotheses for research efforts aimed at bacterial secretion systems.


Subject(s)
Bacteria/metabolism , Bacterial Proteins/metabolism , Databases, Factual , Type III Secretion Systems/metabolism , Type IV Secretion Systems/metabolism , Type VI Secretion Systems/metabolism , Virulence Factors/metabolism , Bacterial Proteins/chemistry , Bacterial Proteins/genetics , Evolution, Molecular , Host-Pathogen Interactions , Internet , Protein Structure, Secondary , Virulence Factors/chemistry , Virulence Factors/genetics
SELECTION OF CITATIONS
SEARCH DETAIL