Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
1.
Neural Netw ; 166: 51-69, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37480769

RESUMO

This paper proposes a three-stage online deep learning model for time series based on the ensemble deep random vector functional link (edRVFL). The edRVFL stacks multiple randomized layers to enhance the single-layer RVFL's representation ability. Each hidden layer's representation is utilized for training an output layer, and the ensemble of all output layers forms the edRVFL's output. However, the original edRVFL is not designed for online learning, and the randomized nature of the features is harmful to extracting meaningful temporal features. In order to address the limitations and extend the edRVFL to an online learning mode, this paper proposes a dynamic edRVFL consisting of three online components, the online decomposition, the online training, and the online dynamic ensemble. First, an online decomposition is utilized as a feature engineering block for the edRVFL. Then, an online learning algorithm is designed to learn the edRVFL. Finally, an online dynamic ensemble method, which can measure the change in the distribution, is proposed for aggregating all layers' outputs. This paper evaluates and compares the proposed model with state-of-the-art methods on sixteen time series.


Assuntos
Educação a Distância , Redes Neurais de Computação , Algoritmos , Engenharia , Fatores de Tempo
2.
IEEE J Biomed Health Inform ; 27(10): 4995-5003, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-36260567

RESUMO

As per the latest statistics, Alzheimer's disease (AD) has become a global burden over the following decades. Identifying AD at the intermediate stage became challenging, with mild cognitive impairment (MCI) utilizing credible biomarkers and robust learning approaches. Neuroimaging techniques like magnetic resonance imaging (MRI) and positron emission tomography (PET) are practical research approaches that provide structural atrophies and metabolic variations. With the help of MRI and PET scans, metabolic and structural changes in AD patients can be visible even ten years before the disease's onset. This paper proposes a novel wavelet packet transform-based structural and metabolic image fusion approach using MRI and PET scans. An eight-layer trained CNN extracts features from multiple layers and these features are fed to an ensemble of non-iterative random vector functional link (RVFL) models. The RVFL network incorporates the s-membership fuzzy function as an activation function that helps overcome outliers. Lastly, outputs of all the customized RVFL classifiers are averaged and fed to the RVFL classifier to make the final decision. Experiments are performed over Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and classification is made over CN vs. AD vs. MCI. The model performance obtained is decent enough to prove the effectiveness of the fusion-based ensemble approach.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Humanos , Doença de Alzheimer/diagnóstico , Neuroimagem/métodos , Imageamento por Ressonância Magnética/métodos , Redes Neurais de Computação , Tomografia por Emissão de Pósitrons/métodos
3.
Artigo em Inglês | MEDLINE | ID: mdl-37022418

RESUMO

Alzheimer's disease (AD) is one of the most known causes of dementia which can be characterized by continuous deterioration in the cognitive skills of elderly people. It is a non-reversible disorder that can only be cured if detected early, which is known as mild cognitive impairment (MCI). The most common biomarkers to diagnose AD are structural atrophy and accumulation of plaques and tangles, which can be detected using magnetic resonance imaging (MRI) and positron emission tomography (PET) scans. Therefore, the present paper proposes wavelet transform-based multimodality fusion of MRI and PET scans to incorporate structural and metabolic information for the early detection of this life-taking neurodegenerative disease. Further, the deep learning model, ResNet-50, extracts the fused images' features. The random vector functional link (RVFL) with only one hidden layer is used to classify the extracted features. The weights and biases of the original RVFL network are being optimized by using an evolutionary algorithm to get optimum accuracy. All the experiments and comparisons are performed over the publicly available Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to demonstrate the suggested algorithm's efficacy.

4.
IEEE J Biomed Health Inform ; 26(10): 4996-5003, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-35737622

RESUMO

Deep learning for electroencephalogram-based classification is confronted with data scarcity, due to the time-consuming and expensive data collection procedure. Data augmentation has been shown as an effective way to improve data efficiency. In addition, contrastive learning has recently been shown to hold great promise in learning effective representations without human supervision, which has the potential to improve the electroencephalogram-based recognition performance with limited labeled data. However, heavy data augmentation is a key ingredient of contrastive learning. In view of the limited number of sample-based data augmentation in electroencephalogram processing, three methods, performance-measure-based time warp, frequency noise addition and frequency masking, are proposed based on the characteristics of electroencephalogram signal. These methods are parameter learning free, easy to implement, and can be applied to individual samples. In the experiment, the proposed data augmentation methods are evaluated on three electroencephalogram-based classification tasks, including situation awareness recognition, motor imagery classification and brain-computer interface steady-state visually evoked potentials speller system. Results demonstrated that the convolutional models trained with the proposed data augmentation methods yielded significantly improved performance over baselines. In overall, this work provides more potential methods to cope with the problem of limited data and boost the classification performance in electroencephalogram processing.


Assuntos
Interfaces Cérebro-Computador , Algoritmos , Eletroencefalografia/métodos , Potenciais Evocados , Humanos , Imaginação/fisiologia
5.
IEEE J Biomed Health Inform ; 26(10): 4966-4975, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-35522640

RESUMO

Hospitals can predetermine the admission rate and facilitate resource allocation based on valid emergency requests and bed capacity estimation. The excess unoccupied beds can be determined with the help of forecasting the number of discharged patients. Extracting predictive features and mining the temporal patterns from historical observations are crucial for accurate and reliable forecasts. Machine learning algorithms have demonstrated the ability to learn temporal knowledge and make predictions for unseen inputs. This paper utilizes several machine learning algorithms to forecast the inpatient discharges of Singapore hospitals and compare them with statistical methods. A novel ensemble deep learning algorithm based on random vector functional links is established to predict inpatient discharges. The ensemble deep learning framework is optimized in a greedy layer-wise fashion. Several forecasting metrics and statistical tests are utilized to demonstrate the proposed method's superiority. The proposed algorithm statistically outperforms the benchmark with a ranking of 1.875. Finally, practical implications and future directions are discussed.


Assuntos
Pacientes Internados , Alta do Paciente , Hospitais , Humanos , Aprendizado de Máquina , Singapura
6.
Neural Netw ; 153: 496-517, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35816861

RESUMO

Random Forest is an ensemble of decision trees based on the bagging and random subspace concepts. As suggested by Breiman, the strength of unstable learners and the diversity among them are the ensemble models' core strength. In this paper, we propose two approaches known as oblique and rotation double random forests. In the first approach, we propose rotation based double random forest. In rotation based double random forests, transformation or rotation of the feature space is generated at each node. At each node different random feature subspace is chosen for evaluation, hence the transformation at each node is different. Different transformations result in better diversity among the base learners and hence, better generalization performance. With the double random forest as base learner, the data at each node is transformed via two different transformations namely, principal component analysis and linear discriminant analysis. In the second approach, we propose oblique double random forest. Decision trees in random forest and double random forest are univariate, and this results in the generation of axis parallel split which fails to capture the geometric structure of the data. Also, the standard random forest may not grow sufficiently large decision trees resulting in suboptimal performance. To capture the geometric properties and to grow the decision trees of sufficient depth, we propose oblique double random forest. The oblique double random forest models are multivariate decision trees. At each non-leaf node, multisurface proximal support vector machine generates the optimal plane for better generalization performance. Also, different regularization techniques (Tikhonov regularization, axis-parallel split regularization, Null space regularization) are employed for tackling the small sample size problems in the decision trees of oblique double random forest. The proposed ensembles of decision trees produce trees with bigger size compared to the standard ensembles of decision trees as bagging is used at each non-leaf node which results in improved performance. The evaluation of the baseline models and the proposed oblique and rotation double random forest models is performed on benchmark 121 UCI datasets and real-world fisheries datasets. Both statistical analysis and the experimental results demonstrate the efficacy of the proposed oblique and rotation double random forest models compared to the baseline models on the benchmark datasets.


Assuntos
Algoritmos , Máquina de Vetores de Suporte , Análise de Componente Principal , Rotação
7.
J Theor Biol ; 270(1): 56-62, 2011 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-21056045

RESUMO

Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.


Assuntos
Algoritmos , Sequência de Aminoácidos/genética , Proteínas Anticongelantes/análise , Biologia Computacional/métodos , Proteínas/classificação , Aminoácidos/química , Proteínas Anticongelantes/genética , Inteligência Artificial , Fenômenos Químicos , Estrutura Secundária de Proteína/genética , Estrutura Terciária de Proteína/genética , Proteínas/genética , Curva ROC
8.
Biochem Biophys Res Commun ; 391(3): 1306-11, 2010 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-19995554

RESUMO

Eukaryotic protein secretion generally occurs via the classical secretory pathway that traverses the ER and Golgi apparatus. Secreted proteins usually contain a signal sequence with all the essential information required to target them for secretion. However, some proteins like fibroblast growth factors (FGF-1, FGF-2), interleukins (IL-1 alpha, IL-1 beta), galectins and thioredoxin are exported by an alternative pathway. This is known as leaderless or non-classical secretion and works without a signal sequence. Most computational methods for the identification of secretory proteins use the signal peptide as indicator and are therefore not able to identify substrates of non-classical secretion. In this work, we report a random forest method, SPRED, to identify secretory proteins from protein sequences irrespective of N-terminal signal peptides, thus allowing also correct classification of non-classical secretory proteins. Training was performed on a dataset containing 600 extracellular proteins and 600 cytoplasmic and/or nuclear proteins. The algorithm was tested on 180 extracellular proteins and 1380 cytoplasmic and/or nuclear proteins. We obtained 85.92% accuracy from training and 82.18% accuracy from testing. Since SPRED does not use N-terminal signals, it can detect non-classical secreted proteins by filtering those secreted proteins with an N-terminal signal by using SignalP. SPRED predicted 15 out of 19 experimentally verified non-classical secretory proteins. By scanning the entire human proteome we identified 566 protein sequences potentially undergoing non-classical secretion. The dataset and standalone version of the SPRED software is available at http://www.inb.uni-luebeck.de/tools-demos/spred/spred.


Assuntos
Inteligência Artificial , Genoma Humano , Proteínas/metabolismo , Proteoma , Análise de Sequência de Proteína/métodos , Animais , Humanos , Proteínas/química , Proteínas/genética
9.
Bioinformatics ; 25(2): 204-10, 2009 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-19038986

RESUMO

MOTIVATION: So far various bioinformatics and machine learning techniques applied for identification of sequence and functionally conserved residues in proteins. Although few computational methods are available for the prediction of structurally conserved residues from protein structure, almost all methods require homologous structural information and structure-based alignments, which still prove to be a bottleneck in protein structure comparison studies. In this work, we developed a neural network approach for identification of structurally important residues from a single protein structure without using homologous structural information and structural alignment. RESULTS: A neural network ensemble (NNE) method that utilizes negative correlation learning (NCL) approach was developed for identification of structurally conserved residues (SCRs) in proteins using features that represent amino acid conservation and composition, physico-chemical properties and structural properties. The NCL-NNE method was applied to 6042 SCRs that have been extracted from 496 protein domains. This method obtained high prediction sensitivity (92.8%) and quality (Matthew's correlation coefficient is 0.852) in identification of SCRs. Further benchmarking using 60 protein domains containing 1657 SCRs that were not part of the training and testing datasets shows that the NCL-NNE can correctly predict SCRs with approximately 90% sensitivity. These results suggest the usefulness of NCL-NNE for facilitating the identification of SCRs utilizing information derived from a single protein structure. Therefore, this method could be extremely effective in large-scale benchmarking studies where reliable structural homologs and alignments are limited.


Assuntos
Redes Neurais de Computação , Proteínas/química , Biologia Computacional/métodos , Sequência Conservada , Bases de Dados de Proteínas , Conformação Proteica , Estrutura Terciária de Proteína , Proteínas/genética
10.
Amino Acids ; 39(5): 1385-91, 2010 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-20411285

RESUMO

Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.


Assuntos
Algoritmos , Aminoácidos/química , Domínio Catalítico , Físico-Química , Bases de Dados Factuais , Estrutura Molecular , Peso Molecular
11.
Amino Acids ; 39(3): 777-83, 2010 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-20186553

RESUMO

Lipocalins are functionally diverse proteins that are composed of 120-180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew's correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/lipopred.htm.


Assuntos
Lipocalinas/química , Alinhamento de Sequência/métodos , Bases de Dados de Proteínas , Humanos , Estrutura Terciária de Proteína , Alinhamento de Sequência/instrumentação , Homologia de Sequência de Aminoácidos
12.
Nucleic Acids Res ; 36(Database issue): D218-21, 2008 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-17933773

RESUMO

Structural motifs are important for the integrity of a protein fold and can be employed to design and rationalize protein engineering and folding experiments. Such conserved segments represent the conserved core of a family or superfamily and can be crucial for the recognition of potential new members in sequence and structure databases. We present a database, MegaMotifBase, that compiles a set of important structural segments or motifs for protein structures. Motifs are recognized on the basis of both sequence conservation and preservation of important structural features such as amino acid preference, solvent accessibility, secondary structural content, hydrogen-bonding pattern and residue packing. This database provides 3D orientation patterns of the identified motifs in terms of inter-motif distances and torsion angles. Important applications of structural motifs are also provided in several crucial areas such as similar sequence and structure search, multiple sequence alignment and homology modeling. MegaMotifBase can be a useful resource to gain knowledge about structure and functional relationship of proteins. The database can be accessed from the URL http://caps.ncbs.res.in/MegaMotifbase/index.html.


Assuntos
Motivos de Aminoácidos , Bases de Dados de Proteínas , Proteínas/classificação , Sequência de Aminoácidos , Sequência Conservada , Internet , Proteínas/química , Análise de Sequência de Proteína , Homologia Estrutural de Proteína
13.
Biochem Biophys Res Commun ; 384(2): 155-9, 2009 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-19394310

RESUMO

Identification of functionally important sites (FIS) in proteins is a critical problem and can have profound importance where protein structural information is limited. Machine learning techniques have been very useful in successful classification of many important biological problems. In this paper, we adopt the sparse kernel least squares classifiers (SKLSC) approach for classification and/or prediction of FIS using protein sequence derived features. The SKLSC algorithm was applied to 5435 FIS that have been extracted from 312 reliable alignments for a wide range of protein families. We obtained 68.28% sensitivity and 68.66% specificity for training dataset and 65.34% sensitivity and 66.88% specificity for testing dataset. Further, large scale benchmarking study using alignments of 101 protein families containing 1899 FIS showed that our method achieved an average approximately 70% sensitivity in predicting different types of FIS, such as active sites, metal, ligand or protein binding sites. Our findings also indicate that active sites and metal binding sites are comparably easier to predict compared to the ligand and protein binding sites. Despite moderate success, our results suggest the usefulness and potential of SKLSC approach in prediction of FIS using only protein sequence derived information.


Assuntos
Sítios de Ligação , Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Domínio Catalítico , Análise dos Mínimos Quadrados , Proteínas/classificação
14.
J Theor Biol ; 259(3): 533-40, 2009 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-19406131

RESUMO

We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.


Assuntos
Perfilação da Expressão Gênica , Genes Neoplásicos , Neoplasias/classificação , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos , Reconhecimento Automatizado de Padrão/métodos , Animais , Biologia Computacional , Expressão Gênica , Humanos , Probabilidade
15.
J Biomol Struct Dyn ; 26(6): 679-86, 2009 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-19385697

RESUMO

DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achieved better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.


Assuntos
Algoritmos , Proteínas de Ligação a DNA/análise , Bases de Dados de Proteínas , Aminoácidos/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/metabolismo , Interações Hidrofóbicas e Hidrofílicas , Reprodutibilidade dos Testes
16.
Biochem Biophys Res Commun ; 367(3): 630-4, 2008 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-18206645

RESUMO

Identification of catalytic residues can provide valuable insights into protein function. With the increasing number of protein 3D structures having been solved by X-ray crystallography and NMR techniques, it is highly desirable to develop an efficient method to identify their catalytic sites. In this paper, we present an SVM method for the identification of catalytic residues using sequence and structural features. The algorithm was applied to the 2096 catalytic residues derived from Catalytic Site Atlas database. We obtained overall prediction accuracy of 88.6% from 10-fold cross validation and 95.76% from resubstitution test. Testing on the 254 catalytic residues shows our method can correctly predict all 254 residues. This result suggests the usefulness of our approach for facilitating the identification of catalytic residues from protein structures.


Assuntos
Algoritmos , Domínio Catalítico , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas , Proteínas/química , Valor Preditivo dos Testes , Conformação Proteica , Reprodutibilidade dos Testes
17.
Bioinformatics ; 23(5): 637-8, 2007 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-17237055

RESUMO

UNLABELLED: SMotif is a server that identifies important structural segments or motifs for a given protein structure(s) based on conservation of both sequential as well as important structural features such as solvent inaccessibility, secondary structural content, hydrogen bonding pattern and residue packing. This server also provides three-dimensional orientation patterns of the identified motifs in terms of inter-motif distances and torsion angles. These motifs may form the common core and therefore, can also be employed to design and rationalize protein engineering and folding experiments. AVAILABILITY: SMotif server is available via the URL http://caps.ncbs.res.in/SMotif/index.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Motivos de Aminoácidos , Software , Sequência Conservada , Bases de Dados de Proteínas , Ligação de Hidrogênio , Internet , Dobramento de Proteína , Estrutura Secundária de Proteína , Proteínas/química , Interface Usuário-Computador
18.
J Theor Biol ; 253(2): 375-80, 2008 Jul 21.
Artigo em Inglês | MEDLINE | ID: mdl-18423492

RESUMO

Determination of protein structural class solely from sequence information is a challenging task. Several attempts to solve this problem using various methods can be found in literature. We present support vector machine (SVM) approach where probability-based decision is used along with class-wise optimized feature sets. This approach has two distinguishing characteristics from earlier attempts: (1) it uses class-wise optimized features and (2) decisions of different SVM classifiers are coupled with probability estimates to make the final prediction. The algorithm was tested on three datasets, containing 498 domains, 1092 domains and 5261 domains. Ten-fold external cross-validation was performed to assess the performance of the algorithm. Significantly high accuracy of 92.89% was obtained for the 498-dataset. We achieved 54.67% accuracy for the dataset with 1092 domains, which is better than the previously reported best accuracy of 53.8%. We obtained 59.43% prediction accuracy for the larger and less redundant 5261-dataset. We also investigated the advantage of using class-wise features over union of these features (conventional approach) in one-vs.-all SVM framework. Our results clearly show the advantage of using class-wise optimized features. Brief analysis of the selected class-wise features indicates their biological significance.


Assuntos
Conformação Proteica , Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Biologia Computacional/métodos , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão/métodos
19.
BMC Bioinformatics ; 8: 351, 2007 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-17880712

RESUMO

BACKGROUND: Odorant binding proteins (OBPs) are believed to shuttle odorants from the environment to the underlying odorant receptors, for which they could potentially serve as odorant presenters. Although several sequence based search methods have been exploited for protein family prediction, less effort has been devoted to the prediction of OBPs from sequence data and this area is more challenging due to poor sequence identity between these proteins. RESULTS: In this paper, we propose a new algorithm that uses Regularized Least Squares Classifier (RLSC) in conjunction with multiple physicochemical properties of amino acids to predict odorant-binding proteins. The algorithm was applied to the dataset derived from Pfam and GenDiS database and we obtained overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively). CONCLUSION: Our study suggests that RLSC is potentially useful for predicting the odorant binding proteins from sequence-derived properties irrespective of sequence similarity. Our method predicts 92.8% of 56 odorant binding proteins non-homologous to any protein in the swissprot database and 97.1% of the 414 independent dataset proteins, suggesting the usefulness of RLSC method for facilitating the prediction of odorant binding proteins from sequence information.


Assuntos
Algoritmos , Inteligência Artificial , Modelos Químicos , Reconhecimento Automatizado de Padrão/métodos , Mapeamento de Interação de Proteínas/métodos , Receptores Odorantes/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Sítios de Ligação , Simulação por Computador , Dados de Sequência Molecular , Ligação Proteica
20.
BMC Bioinformatics ; 7: 95, 2006 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-16504159

RESUMO

BACKGROUND: In discriminant analysis of microarray data, usually a small number of samples are expressed by a large number of genes. It is not only difficult but also unnecessary to conduct the discriminant analysis with all the genes. Hence, gene selection is usually performed to select important genes. RESULTS: A gene selection method searches for an optimal or near optimal subset of genes with respect to a given evaluation criterion. In this paper, we propose a new evaluation criterion, named the leave-one-out calculation (LOOC, A list of abbreviations appears just above the list of references) measure. A gene selection method, named leave-one-out calculation sequential forward selection (LOOCSFS) algorithm, is then presented by combining the LOOC measure with the sequential forward selection scheme. Further, a novel gene selection algorithm, the gradient-based leave-one-out gene selection (GLGS) algorithm, is also proposed. Both of the gene selection algorithms originate from an efficient and exact calculation of the leave-one-out cross-validation error of the least squares support vector machine (LS-SVM). The proposed approaches are applied to two microarray datasets and compared to other well-known gene selection methods using codes available from the second author. CONCLUSION: The proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to the existing methods. The GLGS algorithm can also better scale to datasets with a very large number of genes.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão/métodos , Inteligência Artificial , Análise dos Mínimos Quadrados , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA