Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
1.
Neural Netw ; 166: 51-69, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37480769

RESUMO

This paper proposes a three-stage online deep learning model for time series based on the ensemble deep random vector functional link (edRVFL). The edRVFL stacks multiple randomized layers to enhance the single-layer RVFL's representation ability. Each hidden layer's representation is utilized for training an output layer, and the ensemble of all output layers forms the edRVFL's output. However, the original edRVFL is not designed for online learning, and the randomized nature of the features is harmful to extracting meaningful temporal features. In order to address the limitations and extend the edRVFL to an online learning mode, this paper proposes a dynamic edRVFL consisting of three online components, the online decomposition, the online training, and the online dynamic ensemble. First, an online decomposition is utilized as a feature engineering block for the edRVFL. Then, an online learning algorithm is designed to learn the edRVFL. Finally, an online dynamic ensemble method, which can measure the change in the distribution, is proposed for aggregating all layers' outputs. This paper evaluates and compares the proposed model with state-of-the-art methods on sixteen time series.


Assuntos
Educação a Distância , Redes Neurais de Computação , Algoritmos , Engenharia , Fatores de Tempo
2.
Artigo em Inglês | MEDLINE | ID: mdl-37022418

RESUMO

Alzheimer's disease (AD) is one of the most known causes of dementia which can be characterized by continuous deterioration in the cognitive skills of elderly people. It is a non-reversible disorder that can only be cured if detected early, which is known as mild cognitive impairment (MCI). The most common biomarkers to diagnose AD are structural atrophy and accumulation of plaques and tangles, which can be detected using magnetic resonance imaging (MRI) and positron emission tomography (PET) scans. Therefore, the present paper proposes wavelet transform-based multimodality fusion of MRI and PET scans to incorporate structural and metabolic information for the early detection of this life-taking neurodegenerative disease. Further, the deep learning model, ResNet-50, extracts the fused images' features. The random vector functional link (RVFL) with only one hidden layer is used to classify the extracted features. The weights and biases of the original RVFL network are being optimized by using an evolutionary algorithm to get optimum accuracy. All the experiments and comparisons are performed over the publicly available Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to demonstrate the suggested algorithm's efficacy.

3.
IEEE J Biomed Health Inform ; 27(10): 4995-5003, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-36260567

RESUMO

As per the latest statistics, Alzheimer's disease (AD) has become a global burden over the following decades. Identifying AD at the intermediate stage became challenging, with mild cognitive impairment (MCI) utilizing credible biomarkers and robust learning approaches. Neuroimaging techniques like magnetic resonance imaging (MRI) and positron emission tomography (PET) are practical research approaches that provide structural atrophies and metabolic variations. With the help of MRI and PET scans, metabolic and structural changes in AD patients can be visible even ten years before the disease's onset. This paper proposes a novel wavelet packet transform-based structural and metabolic image fusion approach using MRI and PET scans. An eight-layer trained CNN extracts features from multiple layers and these features are fed to an ensemble of non-iterative random vector functional link (RVFL) models. The RVFL network incorporates the s-membership fuzzy function as an activation function that helps overcome outliers. Lastly, outputs of all the customized RVFL classifiers are averaged and fed to the RVFL classifier to make the final decision. Experiments are performed over Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and classification is made over CN vs. AD vs. MCI. The model performance obtained is decent enough to prove the effectiveness of the fusion-based ensemble approach.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Humanos , Doença de Alzheimer/diagnóstico , Neuroimagem/métodos , Imageamento por Ressonância Magnética/métodos , Redes Neurais de Computação , Tomografia por Emissão de Pósitrons/métodos
4.
Neural Netw ; 153: 496-517, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35816861

RESUMO

Random Forest is an ensemble of decision trees based on the bagging and random subspace concepts. As suggested by Breiman, the strength of unstable learners and the diversity among them are the ensemble models' core strength. In this paper, we propose two approaches known as oblique and rotation double random forests. In the first approach, we propose rotation based double random forest. In rotation based double random forests, transformation or rotation of the feature space is generated at each node. At each node different random feature subspace is chosen for evaluation, hence the transformation at each node is different. Different transformations result in better diversity among the base learners and hence, better generalization performance. With the double random forest as base learner, the data at each node is transformed via two different transformations namely, principal component analysis and linear discriminant analysis. In the second approach, we propose oblique double random forest. Decision trees in random forest and double random forest are univariate, and this results in the generation of axis parallel split which fails to capture the geometric structure of the data. Also, the standard random forest may not grow sufficiently large decision trees resulting in suboptimal performance. To capture the geometric properties and to grow the decision trees of sufficient depth, we propose oblique double random forest. The oblique double random forest models are multivariate decision trees. At each non-leaf node, multisurface proximal support vector machine generates the optimal plane for better generalization performance. Also, different regularization techniques (Tikhonov regularization, axis-parallel split regularization, Null space regularization) are employed for tackling the small sample size problems in the decision trees of oblique double random forest. The proposed ensembles of decision trees produce trees with bigger size compared to the standard ensembles of decision trees as bagging is used at each non-leaf node which results in improved performance. The evaluation of the baseline models and the proposed oblique and rotation double random forest models is performed on benchmark 121 UCI datasets and real-world fisheries datasets. Both statistical analysis and the experimental results demonstrate the efficacy of the proposed oblique and rotation double random forest models compared to the baseline models on the benchmark datasets.


Assuntos
Algoritmos , Máquina de Vetores de Suporte , Análise de Componente Principal , Rotação
5.
IEEE J Biomed Health Inform ; 26(10): 4996-5003, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-35737622

RESUMO

Deep learning for electroencephalogram-based classification is confronted with data scarcity, due to the time-consuming and expensive data collection procedure. Data augmentation has been shown as an effective way to improve data efficiency. In addition, contrastive learning has recently been shown to hold great promise in learning effective representations without human supervision, which has the potential to improve the electroencephalogram-based recognition performance with limited labeled data. However, heavy data augmentation is a key ingredient of contrastive learning. In view of the limited number of sample-based data augmentation in electroencephalogram processing, three methods, performance-measure-based time warp, frequency noise addition and frequency masking, are proposed based on the characteristics of electroencephalogram signal. These methods are parameter learning free, easy to implement, and can be applied to individual samples. In the experiment, the proposed data augmentation methods are evaluated on three electroencephalogram-based classification tasks, including situation awareness recognition, motor imagery classification and brain-computer interface steady-state visually evoked potentials speller system. Results demonstrated that the convolutional models trained with the proposed data augmentation methods yielded significantly improved performance over baselines. In overall, this work provides more potential methods to cope with the problem of limited data and boost the classification performance in electroencephalogram processing.


Assuntos
Interfaces Cérebro-Computador , Algoritmos , Eletroencefalografia/métodos , Potenciais Evocados , Humanos , Imaginação/fisiologia
6.
IEEE J Biomed Health Inform ; 26(10): 4966-4975, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-35522640

RESUMO

Hospitals can predetermine the admission rate and facilitate resource allocation based on valid emergency requests and bed capacity estimation. The excess unoccupied beds can be determined with the help of forecasting the number of discharged patients. Extracting predictive features and mining the temporal patterns from historical observations are crucial for accurate and reliable forecasts. Machine learning algorithms have demonstrated the ability to learn temporal knowledge and make predictions for unseen inputs. This paper utilizes several machine learning algorithms to forecast the inpatient discharges of Singapore hospitals and compare them with statistical methods. A novel ensemble deep learning algorithm based on random vector functional links is established to predict inpatient discharges. The ensemble deep learning framework is optimized in a greedy layer-wise fashion. Several forecasting metrics and statistical tests are utilized to demonstrate the proposed method's superiority. The proposed algorithm statistically outperforms the benchmark with a ranking of 1.875. Finally, practical implications and future directions are discussed.


Assuntos
Pacientes Internados , Alta do Paciente , Hospitais , Humanos , Aprendizado de Máquina , Singapura
7.
J Theor Biol ; 270(1): 56-62, 2011 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-21056045

RESUMO

Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.


Assuntos
Algoritmos , Sequência de Aminoácidos/genética , Proteínas Anticongelantes/análise , Biologia Computacional/métodos , Proteínas/classificação , Aminoácidos/química , Proteínas Anticongelantes/genética , Inteligência Artificial , Fenômenos Químicos , Estrutura Secundária de Proteína/genética , Estrutura Terciária de Proteína/genética , Proteínas/genética , Curva ROC
8.
J Biomol Struct Dyn ; 28(3): 405-14, 2010 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-20919755

RESUMO

Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.


Assuntos
Motivos de Aminoácidos , Bases de Dados de Proteínas , Evolução Molecular , Conformação Proteica , Proteínas/química , Software , Sequência de Aminoácidos , Modelos Moleculares , Proteínas/classificação , Proteínas/genética , Alinhamento de Sequência/métodos
9.
Protein Pept Lett ; 17(12): 1473-9, 2010 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-20666727

RESUMO

Apoptosis is an essential process for controlling tissue homeostasis by regulating a physiological balance between cell proliferation and cell death. The subcellular locations of proteins performing the cell death are determined by mostly independent cellular mechanisms. The regular bioinformatics tools to predict the subcellular locations of such apoptotic proteins do often fail. This work proposes a model for the sorting of proteins that are involved in apoptosis, allowing us to both the prediction of their subcellular locations as well as the molecular properties that contributed to it. We report a novel hybrid Genetic Algorithm (GA)/Support Vector Machine (SVM) approach to predict apoptotic protein sequences using 119 sequence derived properties like frequency of amino acid groups, secondary structure, and physicochemical properties. GA is used for selecting a near-optimal subset of informative features that is most relevant for the classification. Jackknife cross-validation is applied to test the predictive capability of the proposed method on 317 apoptosis proteins. Our method achieved 85.80% accuracy using all 119 features and 89.91% accuracy for 25 features selected by GA. Our models were examined by a test dataset of 98 apoptosis proteins and obtained an overall accuracy of 90.34%. The results show that the proposed approach is promising; it is able to select small subsets of features and still improves the classification accuracy. Our model can contribute to the understanding of programmed cell death and drug discovery. The software and dataset are available at http://www.inb.uni-luebeck.de/tools-demos/apoptosis/GASVM.


Assuntos
Proteínas Reguladoras de Apoptose/química , Algoritmos , Inteligência Artificial , Transporte Proteico
10.
Amino Acids ; 39(5): 1385-91, 2010 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-20411285

RESUMO

Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.


Assuntos
Algoritmos , Aminoácidos/química , Domínio Catalítico , Físico-Química , Bases de Dados Factuais , Estrutura Molecular , Peso Molecular
11.
Amino Acids ; 39(3): 777-83, 2010 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-20186553

RESUMO

Lipocalins are functionally diverse proteins that are composed of 120-180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew's correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/lipopred.htm.


Assuntos
Lipocalinas/química , Alinhamento de Sequência/métodos , Bases de Dados de Proteínas , Humanos , Estrutura Terciária de Proteína , Alinhamento de Sequência/instrumentação , Homologia de Sequência de Aminoácidos
12.
Protein Pept Lett ; 17(4): 423-30, 2010 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-20044918

RESUMO

X-ray crystallography is the most widely used method for protein 3-dimensional structure determination. Selection of target protein that can yield high quality crystal for X-ray crystallography is a challenging task. Prediction of protein crystallization propensity from sequence information is useful for the selection of target protein for crystallization. Recently, support vector machines have been widely used to solve various biological problems. In this work, we present a SVMCRYS method which use support vector machine to classify protein sequence into 'amenable to crystallization' and 'resistant to crystallization'. SVMCRYS was trained on a dataset containing 728 sequences that gave diffraction quality crystal and 728 sequences where work had been stopped before obtaining crystal. The performance of SVMCRYS method was compared with other sequence-based crystallization prediction methods such as SECRET, CRYSTALP, OB-Score, ParCrys and XtalPred using three different datasets. SVMCRYS achieved better prediction rate with higher sensitivity and specificity. Our analysis suggests that SVMCRYS can be used to predict proteins which are amenable to crystallization and proteins which are difficult for crystallization. The SVMCRYS software, dataset and feature set can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/svmcrys.htm.


Assuntos
Algoritmos , Sequência de Aminoácidos , Inteligência Artificial , Cristalografia por Raios X/métodos , Proteínas/química , Bases de Dados de Proteínas , Ressonância Magnética Nuclear Biomolecular , Proteínas/metabolismo , Curva ROC , Reprodutibilidade dos Testes , Relação Estrutura-Atividade
13.
Biochem Biophys Res Commun ; 391(3): 1306-11, 2010 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-19995554

RESUMO

Eukaryotic protein secretion generally occurs via the classical secretory pathway that traverses the ER and Golgi apparatus. Secreted proteins usually contain a signal sequence with all the essential information required to target them for secretion. However, some proteins like fibroblast growth factors (FGF-1, FGF-2), interleukins (IL-1 alpha, IL-1 beta), galectins and thioredoxin are exported by an alternative pathway. This is known as leaderless or non-classical secretion and works without a signal sequence. Most computational methods for the identification of secretory proteins use the signal peptide as indicator and are therefore not able to identify substrates of non-classical secretion. In this work, we report a random forest method, SPRED, to identify secretory proteins from protein sequences irrespective of N-terminal signal peptides, thus allowing also correct classification of non-classical secretory proteins. Training was performed on a dataset containing 600 extracellular proteins and 600 cytoplasmic and/or nuclear proteins. The algorithm was tested on 180 extracellular proteins and 1380 cytoplasmic and/or nuclear proteins. We obtained 85.92% accuracy from training and 82.18% accuracy from testing. Since SPRED does not use N-terminal signals, it can detect non-classical secreted proteins by filtering those secreted proteins with an N-terminal signal by using SignalP. SPRED predicted 15 out of 19 experimentally verified non-classical secretory proteins. By scanning the entire human proteome we identified 566 protein sequences potentially undergoing non-classical secretion. The dataset and standalone version of the SPRED software is available at http://www.inb.uni-luebeck.de/tools-demos/spred/spred.


Assuntos
Inteligência Artificial , Genoma Humano , Proteínas/metabolismo , Proteoma , Análise de Sequência de Proteína/métodos , Animais , Humanos , Proteínas/química , Proteínas/genética
14.
J Theor Biol ; 259(3): 533-40, 2009 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-19406131

RESUMO

We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.


Assuntos
Perfilação da Expressão Gênica , Genes Neoplásicos , Neoplasias/classificação , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos , Reconhecimento Automatizado de Padrão/métodos , Animais , Biologia Computacional , Expressão Gênica , Humanos , Probabilidade
15.
J Biomol Struct Dyn ; 26(6): 679-86, 2009 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-19385697

RESUMO

DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achieved better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.


Assuntos
Algoritmos , Proteínas de Ligação a DNA/análise , Bases de Dados de Proteínas , Aminoácidos/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/metabolismo , Interações Hidrofóbicas e Hidrofílicas , Reprodutibilidade dos Testes
16.
Biochem Biophys Res Commun ; 384(2): 155-9, 2009 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-19394310

RESUMO

Identification of functionally important sites (FIS) in proteins is a critical problem and can have profound importance where protein structural information is limited. Machine learning techniques have been very useful in successful classification of many important biological problems. In this paper, we adopt the sparse kernel least squares classifiers (SKLSC) approach for classification and/or prediction of FIS using protein sequence derived features. The SKLSC algorithm was applied to 5435 FIS that have been extracted from 312 reliable alignments for a wide range of protein families. We obtained 68.28% sensitivity and 68.66% specificity for training dataset and 65.34% sensitivity and 66.88% specificity for testing dataset. Further, large scale benchmarking study using alignments of 101 protein families containing 1899 FIS showed that our method achieved an average approximately 70% sensitivity in predicting different types of FIS, such as active sites, metal, ligand or protein binding sites. Our findings also indicate that active sites and metal binding sites are comparably easier to predict compared to the ligand and protein binding sites. Despite moderate success, our results suggest the usefulness and potential of SKLSC approach in prediction of FIS using only protein sequence derived information.


Assuntos
Sítios de Ligação , Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Domínio Catalítico , Análise dos Mínimos Quadrados , Proteínas/classificação
17.
Bioinformatics ; 25(2): 204-10, 2009 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-19038986

RESUMO

MOTIVATION: So far various bioinformatics and machine learning techniques applied for identification of sequence and functionally conserved residues in proteins. Although few computational methods are available for the prediction of structurally conserved residues from protein structure, almost all methods require homologous structural information and structure-based alignments, which still prove to be a bottleneck in protein structure comparison studies. In this work, we developed a neural network approach for identification of structurally important residues from a single protein structure without using homologous structural information and structural alignment. RESULTS: A neural network ensemble (NNE) method that utilizes negative correlation learning (NCL) approach was developed for identification of structurally conserved residues (SCRs) in proteins using features that represent amino acid conservation and composition, physico-chemical properties and structural properties. The NCL-NNE method was applied to 6042 SCRs that have been extracted from 496 protein domains. This method obtained high prediction sensitivity (92.8%) and quality (Matthew's correlation coefficient is 0.852) in identification of SCRs. Further benchmarking using 60 protein domains containing 1657 SCRs that were not part of the training and testing datasets shows that the NCL-NNE can correctly predict SCRs with approximately 90% sensitivity. These results suggest the usefulness of NCL-NNE for facilitating the identification of SCRs utilizing information derived from a single protein structure. Therefore, this method could be extremely effective in large-scale benchmarking studies where reliable structural homologs and alignments are limited.


Assuntos
Redes Neurais de Computação , Proteínas/química , Biologia Computacional/métodos , Sequência Conservada , Bases de Dados de Proteínas , Conformação Proteica , Estrutura Terciária de Proteína , Proteínas/genética
18.
J Theor Biol ; 253(2): 375-80, 2008 Jul 21.
Artigo em Inglês | MEDLINE | ID: mdl-18423492

RESUMO

Determination of protein structural class solely from sequence information is a challenging task. Several attempts to solve this problem using various methods can be found in literature. We present support vector machine (SVM) approach where probability-based decision is used along with class-wise optimized feature sets. This approach has two distinguishing characteristics from earlier attempts: (1) it uses class-wise optimized features and (2) decisions of different SVM classifiers are coupled with probability estimates to make the final prediction. The algorithm was tested on three datasets, containing 498 domains, 1092 domains and 5261 domains. Ten-fold external cross-validation was performed to assess the performance of the algorithm. Significantly high accuracy of 92.89% was obtained for the 498-dataset. We achieved 54.67% accuracy for the dataset with 1092 domains, which is better than the previously reported best accuracy of 53.8%. We obtained 59.43% prediction accuracy for the larger and less redundant 5261-dataset. We also investigated the advantage of using class-wise features over union of these features (conventional approach) in one-vs.-all SVM framework. Our results clearly show the advantage of using class-wise optimized features. Brief analysis of the selected class-wise features indicates their biological significance.


Assuntos
Conformação Proteica , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Biologia Computacional/métodos , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão/métodos
19.
Biochem Biophys Res Commun ; 367(3): 630-4, 2008 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-18206645

RESUMO

Identification of catalytic residues can provide valuable insights into protein function. With the increasing number of protein 3D structures having been solved by X-ray crystallography and NMR techniques, it is highly desirable to develop an efficient method to identify their catalytic sites. In this paper, we present an SVM method for the identification of catalytic residues using sequence and structural features. The algorithm was applied to the 2096 catalytic residues derived from Catalytic Site Atlas database. We obtained overall prediction accuracy of 88.6% from 10-fold cross validation and 95.76% from resubstitution test. Testing on the 254 catalytic residues shows our method can correctly predict all 254 residues. This result suggests the usefulness of our approach for facilitating the identification of catalytic residues from protein structures.


Assuntos
Algoritmos , Domínio Catalítico , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas , Proteínas/química , Valor Preditivo dos Testes , Conformação Proteica , Reprodutibilidade dos Testes
20.
Nucleic Acids Res ; 36(Database issue): D218-21, 2008 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-17933773

RESUMO

Structural motifs are important for the integrity of a protein fold and can be employed to design and rationalize protein engineering and folding experiments. Such conserved segments represent the conserved core of a family or superfamily and can be crucial for the recognition of potential new members in sequence and structure databases. We present a database, MegaMotifBase, that compiles a set of important structural segments or motifs for protein structures. Motifs are recognized on the basis of both sequence conservation and preservation of important structural features such as amino acid preference, solvent accessibility, secondary structural content, hydrogen-bonding pattern and residue packing. This database provides 3D orientation patterns of the identified motifs in terms of inter-motif distances and torsion angles. Important applications of structural motifs are also provided in several crucial areas such as similar sequence and structure search, multiple sequence alignment and homology modeling. MegaMotifBase can be a useful resource to gain knowledge about structure and functional relationship of proteins. The database can be accessed from the URL http://caps.ncbs.res.in/MegaMotifbase/index.html.


Assuntos
Motivos de Aminoácidos , Bases de Dados de Proteínas , Proteínas/classificação , Sequência de Aminoácidos , Sequência Conservada , Internet , Proteínas/química , Análise de Sequência de Proteína , Homologia Estrutural de Proteína
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...