Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 36
Filtrar
1.
Brief Bioinform ; 22(2): 2126-2140, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32363397

RESUMO

Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.


Assuntos
Escherichia coli/genética , Aprendizado de Máquina , Regiões Promotoras Genéticas , Conjuntos de Dados como Assunto , Genes Bacterianos , Reprodutibilidade dos Testes
2.
J Bioinform Comput Biol ; 17(3): 1940007, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-31288636

RESUMO

Deep learning technologies are permeating every field from image and speech recognition to computational and systems biology. However, the application of convolutional neural networks (CCNs) to "omics" data poses some difficulties, such as the processing of complex networks structures as well as its integration with transcriptome data. Here, we propose a CNN approach that combines spectral clustering information processing to classify lung cancer. The developed spectral-convolutional neural network based method achieves success in integrating protein interaction network data and gene expression profiles to classify lung cancer. The performed computational experiments suggest that in terms of accuracy the predictive performance of our proposed method was better than those of other machine learning methods such as SVM or Random Forest. Moreover, the computational results also indicate that the underlying protein network structure assists to enhance the predictions. Data and CNN code can be downloaded from the link: https://sites.google.com/site/nacherlab/analysis.


Assuntos
Neoplasias Pulmonares/genética , Neoplasias Pulmonares/metabolismo , Redes Neurais de Computação , Mapas de Interação de Proteínas , Transcriptoma , Algoritmos , Análise por Conglomerados , Humanos , Aprendizado de Máquina , Distribuição Aleatória , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte
3.
Brief Bioinform ; 20(3): 931-951, 2019 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-29186295

RESUMO

In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naïve Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.


Assuntos
Proteínas de Bactérias/metabolismo , Sistemas de Secreção Bacterianos , Aprendizado de Máquina , Algoritmos , Teorema de Bayes , Máquina de Vetores de Suporte
4.
Bioinformatics ; 35(12): 2017-2028, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30388198

RESUMO

MOTIVATION: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. RESULTS: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. AVAILABILITY AND IMPLEMENTATION: http://bastion3.erc.monash.edu/. CONTACT: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado de Máquina , Algoritmos , Sequência de Aminoácidos , Proteínas de Bactérias , Biologia Computacional , Bactérias Gram-Negativas , Software
5.
J Comput Biol ; 25(10): 1071-1090, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-30074414

RESUMO

Controlling complex networks through a small number of controller vertices is of great importance in wide-ranging research fields. Recently, a new approach based on the minimum feedback vertex set (MFVS) has been proposed to find such vertices in directed networks in which the target states are restricted to steady states. However, multiple MFVS configurations may exist and thus the selection of vertices may depend on algorithms and input data representations. Our attempts to address this ambiguity led us to adopt an existing approach that classifies vertices into three categories. This approach has been successfully applied to maximum matching-based and minimum dominating set-based controllability analysis frameworks. In this article, we present an algorithm as well as its implementation to compute and evaluate the critical, intermittent, and redundant vertices under the MFVS-based framework, where these three categories include vertices belonging to all MFVSs, some (but not all) MFVSs, and none of the MFVSs, respectively. The results of computational experiments using artificially generated networks and real-world biological networks suggest that the proposed algorithm is useful for identifying these three kinds of vertices for relatively large-scale networks, and that the fraction of critical and intermittent vertices is considerably small. Moreover, an analysis of the signal pathways indicates that critical and intermittent MFVSs tend to be enriched by essential genes.


Assuntos
Algoritmos , Redes Reguladoras de Genes , Transdução de Sinais , Animais , Biologia Computacional/métodos , Simulação por Computador , Humanos , Modelos Biológicos
6.
BMC Syst Biol ; 12(Suppl 1): 37, 2018 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-29671405

RESUMO

BACKGROUND: Current technology has demonstrated that mutation and deregulation of non-coding RNAs (ncRNAs) are associated with diverse human diseases and important biological processes. Therefore, developing a novel computational method for predicting potential ncRNA-disease associations could benefit pathologists in understanding the correlation between ncRNAs and disease diagnosis, treatment, and prevention. However, only a few studies have investigated these associations in pathogenesis. RESULTS: This study utilizes a disease-target-ncRNA tripartite network, and computes prediction scores between each disease-ncRNA pair by integrating biological information derived from pairwise similarity based upon sequence expressions with weights obtained from a multi-layer resource allocation technique. Our proposed algorithm was evaluated based on a 5-fold-cross-validation with optimal kernel parameter tuning. In addition, we achieved an average AUC that varies from 0.75 without link cut to 0.57 with link cut methods, which outperforms a previous method using the same evaluation methodology. Furthermore, the algorithm predicted 23 ncRNA-disease associations supported by other independent biological experimental studies. CONCLUSIONS: Taken together, these results demonstrate the capability and accuracy of predicting further biological significant associations between ncRNAs and diseases and highlight the importance of adding biological sequence information to enhance predictions.


Assuntos
Biologia Computacional/métodos , Doença/genética , RNA não Traduzido/genética , Algoritmos , Bases de Dados Genéticas , Humanos , Neoplasias/genética
7.
PLoS One ; 13(4): e0195545, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29698482

RESUMO

The prediction of protein complexes from protein-protein interactions (PPIs) is a well-studied problem in bioinformatics. However, the currently available PPI data is not enough to describe all known protein complexes. In this paper, we express the problem of determining the minimum number of (additional) required protein-protein interactions as a graph theoretic problem under the constraint that each complex constitutes a connected component in a PPI network. For this problem, we develop two computational methods: one is based on integer linear programming (ILPMinPPI) and the other one is based on an existing greedy-type approximation algorithm (GreedyMinPPI) originally developed in the context of communication and social networks. Since the former method is only applicable to datasets of small size, we apply the latter method to a combination of the CYC2008 protein complex dataset and each of eight PPI datasets (STRING, MINT, BioGRID, IntAct, DIP, BIND, WI-PHI, iRefIndex). The results show that the minimum number of additional required PPIs ranges from 51 (STRING) to 964 (BIND), and that even the four best PPI databases, STRING (51), BioGRID (67), WI-PHI (93) and iRefIndex (85), do not include enough PPIs to form all CYC2008 protein complexes. We also demonstrate that the proposed problem framework and our solutions can enhance the prediction accuracy of existing PPI prediction methods. ILPMinPPI can be freely downloaded from http://sunflower.kuicr.kyoto-u.ac.jp/~nakajima/.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Proteínas/metabolismo , Algoritmos , Biologia Computacional , Simulação por Computador
8.
Bioinformatics ; 34(15): 2546-2555, 2018 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-29547915

RESUMO

Motivation: Many Gram-negative bacteria use type VI secretion systems (T6SS) to export effector proteins into adjacent target cells. These secreted effectors (T6SEs) play vital roles in the competitive survival in bacterial populations, as well as pathogenesis of bacteria. Although various computational analyses have been previously applied to identify effectors secreted by certain bacterial species, there is no universal method available to accurately predict T6SS effector proteins from the growing tide of bacterial genome sequence data. Results: We extracted a wide range of features from T6SE protein sequences and comprehensively analyzed the prediction performance of these features through unsupervised and supervised learning. By integrating these features, we subsequently developed a two-layer SVM-based ensemble model with fine-grain optimized parameters, to identify potential T6SEs. We further validated the predictive model using an independent dataset, which showed that the proposed model achieved an impressive performance in terms of ACC (0.943), F-value (0.946), MCC (0.892) and AUC (0.976). To demonstrate applicability, we employed this method to correctly identify two very recently validated T6SE proteins, which represent challenging prediction targets because they significantly differed from previously known T6SEs in terms of their sequence similarity and cellular function. Furthermore, a genome-wide prediction across 12 bacterial species, involving in total 54 212 protein sequences, was carried out to distinguish 94 putative T6SE candidates. We envisage both this information and our publicly accessible web server will facilitate future discoveries of novel T6SEs. Availability and implementation: http://bastion6.erc.monash.edu/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas de Bactérias/metabolismo , Bactérias Gram-Negativas/metabolismo , Análise de Sequência de Proteína/métodos , Software , Sistemas de Secreção Tipo VI/metabolismo , Sequência de Aminoácidos , Proteínas de Bactérias/química , Biologia Computacional/métodos , Internet , Aprendizado de Máquina , Análise de Sequência de DNA/métodos , Sistemas de Secreção Tipo VI/química
9.
BMC Bioinformatics ; 19(Suppl 1): 39, 2018 02 19.
Artigo em Inglês | MEDLINE | ID: mdl-29504897

RESUMO

BACKGROUND: Since many proteins become functional only after they interact with their partner proteins and form protein complexes, it is essential to identify the sets of proteins that form complexes. Therefore, several computational methods have been proposed to predict complexes from the topology and structure of experimental protein-protein interaction (PPI) network. These methods work well to predict complexes involving at least three proteins, but generally fail at identifying complexes involving only two different proteins, called heterodimeric complexes or heterodimers. There is however an urgent need for efficient methods to predict heterodimers, since the majority of known protein complexes are precisely heterodimers. RESULTS: In this paper, we use three promising kernel functions, Min kernel and two pairwise kernels, which are Metric Learning Pairwise Kernel (MLPK) and Tensor Product Pairwise Kernel (TPPK). We also consider the normalization forms of Min kernel. Then, we combine Min kernel or its normalization form and one of the pairwise kernels by plugging. We applied kernels based on PPI, domain, phylogenetic profile, and subcellular localization properties to predicting heterodimers. Then, we evaluate our method by employing C-Support Vector Classification (C-SVC), carrying out 10-fold cross-validation, and calculating the average F-measures. The results suggest that the combination of normalized-Min-kernel and MLPK leads to the best F-measure and improved the performance of our previous work, which had been the best existing method so far. CONCLUSIONS: We propose new methods to predict heterodimers, using a machine learning-based approach. We train a support vector machine (SVM) to discriminate interacting vs non-interacting protein pairs, based on informations extracted from PPI, domain, phylogenetic profiles and subcellular localization. We evaluate in detail new kernel functions to encode these data, and report prediction performance that outperforms the state-of-the-art.


Assuntos
Algoritmos , Complexos Multiproteicos/química , Dimerização , Complexos Multiproteicos/classificação , Filogenia , Domínios Proteicos , Mapas de Interação de Proteínas , Multimerização Proteica , Máquina de Vetores de Suporte
10.
Sci Rep ; 7: 41031, 2017 01 23.
Artigo em Inglês | MEDLINE | ID: mdl-28112271

RESUMO

Bacteria translocate effector molecules to host cells through highly evolved secretion systems. By definition, the function of these effector proteins is to manipulate host cell biology and the sequence, structural and functional annotations of these effector proteins will provide a better understanding of how bacterial secretion systems promote bacterial survival and virulence. Here we developed a knowledgebase, termed SecretEPDB (Bacterial Secreted Effector Protein DataBase), for effector proteins of type III secretion system (T3SS), type IV secretion system (T4SS) and type VI secretion system (T6SS). SecretEPDB provides enriched annotations of the aforementioned three classes of effector proteins by manually extracting and integrating structural and functional information from currently available databases and the literature. The database is conservative and strictly curated to ensure that every effector protein entry is supported by experimental evidence that demonstrates it is secreted by a T3SS, T4SS or T6SS. The annotations of effector proteins documented in SecretEPDB are provided in terms of protein characteristics, protein function, protein secondary structure, Pfam domains, metabolic pathway and evolutionary details. It is our hope that this integrated knowledgebase will serve as a useful resource for biological investigation and the generation of new hypotheses for research efforts aimed at bacterial secretion systems.


Assuntos
Bactérias/metabolismo , Proteínas de Bactérias/metabolismo , Bases de Dados Factuais , Sistemas de Secreção Tipo III/metabolismo , Sistemas de Secreção Tipo IV/metabolismo , Sistemas de Secreção Tipo VI/metabolismo , Fatores de Virulência/metabolismo , Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Evolução Molecular , Interações Hospedeiro-Patógeno , Internet , Estrutura Secundária de Proteína , Fatores de Virulência/química , Fatores de Virulência/genética
11.
BMC Bioinformatics ; 17(1): 487, 2016 Nov 25.
Artigo em Inglês | MEDLINE | ID: mdl-27887571

RESUMO

BACKGROUND: Dicer is necessary for the process of mature microRNA (miRNA) formation because the Dicer enzyme cleaves pre-miRNA correctly to generate miRNA with correct seed regions. Nonetheless, the mechanism underlying the selection of a Dicer cleavage site is still not fully understood. To date, several studies have been conducted to solve this problem, for example, a recent discovery indicates that the loop/bulge structure plays a central role in the selection of Dicer cleavage sites. In accordance with this breakthrough, a support vector machine (SVM)-based method called PHDCleav was developed to predict Dicer cleavage sites which outperforms other methods based on random forest and naive Bayes. PHDCleav, however, tests only whether a position in the shift window belongs to a loop/bulge structure. RESULT: In this paper, we used the length of loop/bulge structures (in addition to their presence or absence) to develop an improved method, LBSizeCleav, for predicting Dicer cleavage sites. To evaluate our method, we used 810 empirically validated sequences of human pre-miRNAs and performed fivefold cross-validation. In both 5p and 3p arms of pre-miRNAs, LBSizeCleav showed greater prediction accuracy than PHDCleav did. This result suggests that the length of loop/bulge structures is useful for prediction of Dicer cleavage sites. CONCLUSION: We developed a novel algorithm for feature space mapping based on the length of a loop/bulge for predicting Dicer cleavage sites. The better performance of our method indicates the usefulness of the length of loop/bulge structures for such predictions.


Assuntos
Algoritmos , MicroRNAs/genética , Precursores de RNA/genética , Ribonuclease III/metabolismo , Software , Máquina de Vetores de Suporte , Sequência de Bases , Teorema de Bayes , RNA Helicases DEAD-box , Humanos , Conformação de Ácido Nucleico , Precursores de RNA/metabolismo
12.
J Comput Biol ; 23(8): 625-40, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-27348756

RESUMO

Enumeration of chemical structures satisfying given conditions is an important step in the discovery of new compounds and drugs, as well as the elucidation of the structure. One of the most frequently used conditions in the enumeration is the number of chemical elements that corresponds to the chemical formula. In this work, we propose a novel efficient enumeration algorithm, BfsStructEnum, which allows users to define desired cyclic structures and enumerates all nonredundant chemical compounds containing only defined structures as cyclic structures from a given chemical formula. To evaluate the performance, we confirm the number of enumerated structures of BfsStructEnum and MOLGEN 5.0, the latest version of a general-purpose structure generator. We also compare the computation time of BfsStructEnum with that of MOLGEN 5.0. The findings show that, given the same number of enumerated structures as MOLGEN 5.0, BfsStructEnum is significantly faster. By compressing a cyclic structure into a single node and representing chemical compounds by tree structures instead of normal graphs, the enumeration can be executed more efficiently.


Assuntos
Algoritmos , Química Farmacêutica/métodos , Desenho de Fármacos , Biologia Computacional
13.
Proc Math Phys Eng Sci ; 472(2187): 20150551, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27118908

RESUMO

Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.

14.
BMC Bioinformatics ; 17: 113, 2016 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-26932529

RESUMO

BACKGROUND: Drug discovery and design are important research fields in bioinformatics. Enumeration of chemical compounds is essential not only for the purpose, but also for analysis of chemical space and structure elucidation. In our previous study, we developed enumeration methods BfsSimEnum and BfsMulEnum for tree-like chemical compounds using a tree-structure to represent a chemical compound, which is limited to acyclic chemical compounds only. RESULTS: In this paper, we extend the methods, and develop BfsBenNaphEnum that can enumerate tree-like chemical compounds containing benzene rings and naphthalene rings, which include benzene isomers and naphthalene isomers such as ortho, meta, and para, by treating a benzene ring as an atom with valence six, instead of a ring of six carbon atoms, and treating a naphthalene ring as two benzene rings having a special bond. We compare our method with MOLGEN 5.0, which is a well-known general purpose structure generator, to enumerate chemical structures from a set of chemical formulas in terms of the number of enumerated structures and the computational time. The result suggests that our proposed method can reduce the computational time efficiently. CONCLUSIONS: We propose the enumeration method BfsBenNaphEnum for tree-like chemical compounds containing benzene rings and naphthalene rings as cyclic structures. BfsBenNaphEnum was from 50 times to 5,000,000 times faster than MOLGEN 5.0 for instances with 8 to 14 carbon atoms in our experiments.


Assuntos
Algoritmos , Benzeno/química , Química Farmacêutica/métodos , Biologia Computacional/métodos , Naftalenos/química , Estereoisomerismo
15.
Biomark Med ; 10(6): 621-32, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-26947205

RESUMO

Many studies on biomarker discovery have been done by analyzing mutations in DNA sequences and differences in gene expression patterns. As a new branch of the latter approach, the concept of network biomarkers has been proposed, in which expression data of small subnetworks are used as markers. Furthermore, network biomarkers have been extended to dynamical network biomarkers, in which time series expression data of subnetworks are used as markers. On the other hand, the methodologies in complex networks have also been applied to biomarker discovery. For example, various centrality measures and the concept of observability have been applied. In this article, we review these new approaches for biomarker discovery with focusing on the computational/methodological aspects.


Assuntos
Biomarcadores/metabolismo , Modelos Biológicos , Algoritmos , Humanos , Mapas de Interação de Proteínas , Análise Serial de Tecidos
16.
Brief Bioinform ; 17(2): 270-82, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26177815

RESUMO

Coiled-coils refer to a bundle of helices coiled together like strands of a rope. It has been estimated that nearly 3% of protein-encoding regions of genes harbour coiled-coil domains (CCDs). Experimental studies have confirmed that CCDs play a fundamental role in subcellular infrastructure and controlling trafficking of eukaryotic cells. Given the importance of coiled-coils, multiple bioinformatics tools have been developed to facilitate the systematic and high-throughput prediction of CCDs in proteins. In this article, we review and compare 12 sequence-based bioinformatics approaches and tools for coiled-coil prediction. These approaches can be categorized into two classes: coiled-coil detection and coiled-coil oligomeric state prediction. We evaluated and compared these methods in terms of their input/output, algorithm, prediction performance, validation methods and software utility. All the independent testing data sets are available at http://lightning.med.monash.edu/coiledcoil/. In addition, we conducted a case study of nine human polyglutamine (PolyQ) disease-related proteins and predicted CCDs and oligomeric states using various predictors. Prediction results for CCDs were highly variable among different predictors. Only two peptides from two proteins were confirmed to be CCDs by majority voting. Both domains were predicted to form dimeric coiled-coils using oligomeric state prediction. We anticipate that this comprehensive analysis will be an insightful resource for structural biologists with limited prior experience in bioinformatics tools, and for bioinformaticians who are interested in designing novel approaches for coiled-coil and its oligomeric state prediction.


Assuntos
Algoritmos , Modelos Químicos , Modelos Moleculares , Proteínas/química , Proteínas/ultraestrutura , Análise de Sequência de Proteína/métodos , Simulação por Computador , Dimerização , Conformação Proteica , Domínios Proteicos , Software
17.
BMC Med Genomics ; 8 Suppl 2: S15, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26044861

RESUMO

Enumeration of chemical compounds greatly assists designing and finding new drugs, and determining chemical structures from mass spectrometry. In our previous study, we developed efficient algorithms, BfsSimEnum and BfsMulEnum for enumerating tree-like chemical compounds without and with multiple bonds, respectively. For many instances, our previously proposed algorithms were able to enumerate chemical structures faster than other existing methods.


Assuntos
Algoritmos , Compostos Orgânicos/química , Fatores de Tempo
18.
BMC Bioinformatics ; 16: 128, 2015 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-25907438

RESUMO

BACKGROUND: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees. RESULTS: Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs. CONCLUSIONS: We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.


Assuntos
Algoritmos , Compressão de Dados/métodos , Polissacarídeos/química , RNA/química , Biologia Computacional/métodos , Humanos
19.
ScientificWorldJournal ; 2014: 240673, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25093200

RESUMO

Proteins in living organisms express various important functions by interacting with other proteins and molecules. Therefore, many efforts have been made to investigate and predict protein-protein interactions (PPIs). Analysis of strengths of PPIs is also important because such strengths are involved in functionality of proteins. In this paper, we propose several feature space mappings from protein pairs using protein domain information to predict strengths of PPIs. Moreover, we perform computational experiments employing two machine learning methods, support vector regression (SVR) and relevance vector machine (RVM), for dataset obtained from biological experiments. The prediction results showed that both SVR and RVM with our proposed features outperformed the best existing method.


Assuntos
Mapeamento de Interação de Proteínas , Inteligência Artificial , Biologia Computacional , Domínios e Motivos de Interação entre Proteínas
20.
BMC Bioinformatics ; 15 Suppl 2: S6, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24564744

RESUMO

BACKGROUND: Protein complexes play important roles in biological systems such as gene regulatory networks and metabolic pathways. Most methods for predicting protein complexes try to find protein complexes with size more than three. It, however, is known that protein complexes with smaller sizes occupy a large part of whole complexes for several species. In our previous work, we developed a method with several feature space mappings and the domain composition kernel for prediction of heterodimeric protein complexes, which outperforms existing methods. RESULTS: We propose methods for prediction of heterotrimeric protein complexes by extending techniques in the previous work on the basis of the idea that most heterotrimeric protein complexes are not likely to share the same protein with each other. We make use of the discriminant function in support vector machines (SVMs), and design novel feature space mappings for the second phase. As the second classifier, we examine SVMs and relevance vector machines (RVMs). We perform 10-fold cross-validation computational experiments. The results suggest that our proposed two-phase methods and SVM with the extended features outperform the existing method NWE, which was reported to outperform other existing methods such as MCL, MCODE, DPClus, CMC, COACH, RRW, and PPSampler for prediction of heterotrimeric protein complexes. CONCLUSIONS: We propose two-phase prediction methods with the extended features, the domain composition kernel, SVMs and RVMs. The two-phase method with the extended features and the domain composition kernel using SVM as the second classifier is particularly useful for prediction of heterotrimeric protein complexes.


Assuntos
Complexos Multiproteicos/análise , Máquina de Vetores de Suporte , Análise Discriminante , Multimerização Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...