Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
BMC Bioinformatics ; 21(1): 137, 2020 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-32272894

RESUMO

BACKGROUND: Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. RESULTS: The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. CONCLUSION: In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk.


Assuntos
Proteínas/química , Área Sob a Curva , Bases de Dados de Proteínas , Modelos Logísticos , Domínios Proteicos , Estrutura Secundária de Proteína , Proteínas/metabolismo , Curva ROC , Interface Usuário-Computador
2.
Bioinformatics ; 30(22): 3189-96, 2014 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-25078396

RESUMO

MOTIVATION: A popular method for classification of protein domain movements apportions them into two main types: those with a 'hinge' mechanism and those with a 'shear' mechanism. The intuitive assignment of domain movements to these classes has limited the number of domain movements that can be classified in this way. Furthermore, whether intended or not, the term 'shear' is often interpreted to mean a relative translation of the domains. RESULTS: Numbers of occurrences of four different types of residue contact changes between domains were optimally combined by logistic regression using the training set of domain movements intuitively classified as hinge and shear to produce a predictor for hinge and shear. This predictor was applied to give a 10-fold increase in the number of examples over the number previously available with a high degree of precision. It is shown that overall a relative translation of domains is rare, and that there is no difference between hinge and shear mechanisms in this respect. However, the shear set contains significantly more examples of domains having a relative twisting movement than the hinge set. The angle of rotation is also shown to be a good discriminator between the two mechanisms. AVAILABILITY AND IMPLEMENTATION: Results are free to browse at http://www.cmp.uea.ac.uk/dyndom/interface/. CONTACT: sjh@cmp.uea.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estrutura Terciária de Proteína , Modelos Logísticos , Movimento (Física) , Rotação , Software
3.
Neural Netw ; 21(2-3): 544-50, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18262752

RESUMO

We organized a challenge for IJCNN 2007 to assess the added value of prior domain knowledge in machine learning. Most commercial data mining programs accept data pre-formatted in the form of a table, with each example being encoded as a linear feature vector. Is it worth spending time incorporating domain knowledge in feature construction or algorithm design, or can off-the-shelf programs working directly on simple low-level features do better than skilled data analysts? To answer these questions, we formatted five datasets using two data representations. The participants in the "prior knowledge" track used the raw data, with full knowledge of the meaning of the data representation. Conversely, the participants in the "agnostic learning" track used a pre-formatted data table, with no knowledge of the identity of the features. The results indicate that black-box methods using relatively unsophisticated features work quite well and rapidly approach the best attainable performance. The winners on the prior knowledge track used feature extraction strategies yielding a large number of low-level features. Incorporating prior knowledge in the form of generic coding/smoothing methods to exploit regularities in data is beneficial, but incorporating actual domain knowledge in feature construction is very time consuming and seldom leads to significant improvements. The AL vs. PK challenge web site remains open for post-challenge submissions: http://www.agnostic.inf.ethz.ch/.


Assuntos
Inteligência Artificial , Conhecimento , Aprendizagem/fisiologia , Biologia Computacional , Humanos , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão , Curva ROC
4.
Bioinformatics ; 22(19): 2348-55, 2006 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-16844704

RESUMO

MOTIVATION: Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification. RESULTS: The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense. AVAILABILITY: A MATLAB implementation of the sparse logistic regression algorithm with Bayesian regularization (BLogReg) is available from http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/


Assuntos
Biomarcadores Tumorais/análise , Diagnóstico por Computador/métodos , Perfilação da Expressão Gênica/métodos , Proteínas de Neoplasias/análise , Neoplasias/diagnóstico , Neoplasias/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Teorema de Bayes , Humanos , Modelos Logísticos , Modelos Biológicos , Neoplasias/classificação , Análise de Regressão , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
5.
Neural Netw ; 20(7): 832-41, 2007 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-17600674

RESUMO

Mika, Rätsch, Weston, Schölkopf and Müller [Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing: Vol. IX (pp. 41-48). New York: IEEE Press] introduce a non-linear formulation of Fisher's linear discriminant, based on the now familiar "kernel trick", demonstrating state-of-the-art performance on a wide range of real-world benchmark datasets. In this paper, we extend an existing analytical expression for the leave-one-out cross-validation error [Cawley, G. C., & Talbot, N. L. C. (2003b). Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36(11), 2585-2592] such that the leave-one-out error can be re-estimated following a change in the value of the regularisation parameter with a computational complexity of only O(l(2)) operations, which is substantially less than the O(l(3)) operations required for the basic training algorithm. This allows the regularisation parameter to be tuned at an essentially negligible computational cost. This is achieved by performing the discriminant analysis in canonical form. The proposed method is therefore a useful component of a model selection strategy for this class of kernel machines that alternates between updates of the kernel and regularisation parameters. Results obtained on real-world and synthetic benchmark datasets indicate that the proposed method is competitive with model selection based on k-fold cross-validation in terms of generalisation, whilst being considerably faster.


Assuntos
Algoritmos , Inteligência Artificial , Modelos Lineares , Redes Neurais de Computação , Dinâmica não Linear , Simulação por Computador , Metodologias Computacionais , Análise Discriminante
6.
Neural Netw ; 20(4): 537-49, 2007 May.
Artigo em Inglês | MEDLINE | ID: mdl-17531441

RESUMO

Artificial neural networks have proved an attractive approach to non-linear regression problems arising in environmental modelling, such as statistical downscaling, short-term forecasting of atmospheric pollutant concentrations and rainfall run-off modelling. However, environmental datasets are frequently very noisy and characterized by a noise process that may be heteroscedastic (having input dependent variance) and/or non-Gaussian. The aim of this paper is to review existing methodologies for estimating predictive uncertainty in such situations and, more importantly, to illustrate how a model of the predictive distribution may be exploited in assessing the possible impacts of climate change and to improve current decision making processes. The results of the WCCI-2006 predictive uncertainty in environmental modelling challenge are also reviewed, suggesting a number of areas where further research may provide significant benefits.


Assuntos
Simulação por Computador , Meio Ambiente , Redes Neurais de Computação , Incerteza , Bases de Dados como Assunto/estatística & dados numéricos , Tomada de Decisões , Modelos Estatísticos , Dinâmica não Linear , Valor Preditivo dos Testes
7.
IEEE Trans Neural Netw ; 18(3): 935-7, 2007 May.
Artigo em Inglês | MEDLINE | ID: mdl-17526361

RESUMO

J.-H. Chen and C.-S. Chen have recently proposed a nonlinear variant of Keller and Hunt's fuzzy perceptron algorithm, based on the now familiar "kernel trick." In this letter, we demonstrate experimentally that J.-H. Chen and C.-S. Chen's assertion that the fuzzy kernel perceptron (FKP) outperforms the support vector machine (SVM) cannot be sustained. A more thorough model comparison exercise, based on a much wider range of benchmark data sets, shows that the FKP algorithm is not competitive with the SVM.


Assuntos
Algoritmos , Inteligência Artificial , Técnicas de Apoio para a Decisão , Lógica Fuzzy , Armazenamento e Recuperação da Informação/métodos , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Simulação por Computador , Redes Neurais de Computação
8.
IEEE Trans Neural Netw ; 17(2): 471-81, 2006 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-16566473

RESUMO

Survival analysis is a branch of statistics concerned with the time elapsing before "failure," with diverse applications in medical statistics and the analysis of the reliability of electrical or mechanical components. We introduce a parametric accelerated life survival analysis model based on kernel learning methods that, at least in principal, is able to learn arbitrary dependencies between a vector of explanatory variables and the scale of the distribution of survival times. The proposed kernel survival analysis method is then used to model the growth domain of Clostridium botulinum, the food processing and storage conditions permitting the growth of this foodborne microbial pathogen, leading to the production of the neurotoxin responsible for botulism. A Bayesian training procedure, based on the evidence framework, is used for model selection and to provide a credible interval on model predictions. The kernel survival analysis models are found to be more accurate than models based on more traditional survival analysis techniques but also suggest a risk assessment of the foodborne botulism hazard would benefit from the collection of additional data.


Assuntos
Inteligência Artificial , Clostridium botulinum/citologia , Clostridium botulinum/crescimento & desenvolvimento , Microbiologia de Alimentos , Modelos Biológicos , Análise de Sobrevida , Teorema de Bayes , Proliferação de Células , Sobrevivência Celular/fisiologia , Simulação por Computador , Interpretação Estatística de Dados , Modelos Estatísticos , Crescimento Demográfico , Taxa de Sobrevida
9.
Neural Netw ; 18(5-6): 674-83, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16085387

RESUMO

We present here a simple technique that simplifies the construction of Bayesian treatments of a variety of sparse kernel learning algorithms. An incomplete Cholesky factorisation is employed to modify the dual parameter space, such that the Gaussian prior over the dual model parameters is whitened. The regularisation term then corresponds to the usual weight-decay regulariser, allowing the Bayesian analysis to proceed via the evidence framework of MacKay. There is in addition a useful by-product associated with the incomplete Cholesky factorisation algorithm, it also identifies a subset of the training data forming an approximate basis for the entire dataset in the kernel-induced feature space, resulting in a sparse model. Bayesian treatments of the kernel ridge regression (KRR) algorithm, with both constant and heteroscedastic (input dependent) variance structures, and kernel logistic regression (KLR) are provided as illustrative examples of the proposed method, which we hope will be more widely applicable.


Assuntos
Inteligência Artificial , Teorema de Bayes , Algoritmos , Interpretação Estatística de Dados , Modelos Lineares , Modelos Logísticos , Modelos Estatísticos , Distribuição Normal
10.
Neural Netw ; 17(10): 1467-75, 2004 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-15541948

RESUMO

Leave-one-out cross-validation has been shown to give an almost unbiased estimator of the generalisation properties of statistical models, and therefore provides a sensible criterion for model selection and comparison. In this paper we show that exact leave-one-out cross-validation of sparse Least-Squares Support Vector Machines (LS-SVMs) can be implemented with a computational complexity of only O(ln2) floating point operations, rather than the O(l2n2) operations of a naïve implementation, where l is the number of training patterns and n is the number of basis vectors. As a result, leave-one-out cross-validation becomes a practical proposition for model selection in large scale applications. For clarity the exposition concentrates on sparse least-squares support vector machines in the context of non-linear regression, but is equally applicable in a pattern recognition setting.


Assuntos
Algoritmos , Análise dos Mínimos Quadrados , Modelos Estatísticos , Redes Neurais de Computação , Matemática , Dinâmica não Linear , Reconhecimento Automatizado de Padrão/métodos
11.
Neural Netw ; 53: 69-80, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24561452

RESUMO

Kernel learning methods, whether Bayesian or frequentist, typically involve multiple levels of inference, with the coefficients of the kernel expansion being determined at the first level and the kernel and regularisation parameters carefully tuned at the second level, a process known as model selection. Model selection for kernel machines is commonly performed via optimisation of a suitable model selection criterion, often based on cross-validation or theoretical performance bounds. However, if there are a large number of kernel parameters, as for instance in the case of automatic relevance determination (ARD), there is a substantial risk of over-fitting the model selection criterion, resulting in poor generalisation performance. In this paper we investigate the possibility of learning the kernel, for the Least-Squares Support Vector Machine (LS-SVM) classifier, at the first level of inference, i.e. parameter optimisation. The kernel parameters and the coefficients of the kernel expansion are jointly optimised at the first level of inference, minimising a training criterion with an additional regularisation term acting on the kernel parameters. The key advantage of this approach is that the values of only two regularisation parameters need be determined in model selection, substantially alleviating the problem of over-fitting the model selection criterion. The benefits of this approach are demonstrated using a suite of synthetic and real-world binary classification benchmark problems, where kernel learning at the first level of inference is shown to be statistically superior to the conventional approach, improves on our previous work (Cawley and Talbot, 2007) and is competitive with Multiple Kernel Learning approaches, but with reduced computational expense.


Assuntos
Máquina de Vetores de Suporte , Teorema de Bayes , Análise dos Mínimos Quadrados , Modelos Teóricos
12.
PLoS One ; 8(11): e81224, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24260562

RESUMO

A new method for the classification of domain movements in proteins is described and applied to 1822 pairs of structures from the Protein Data Bank that represent a domain movement in two-domain proteins. The method is based on changes in contacts between residues from the two domains in moving from one conformation to the other. We argue that there are five types of elemental contact changes and that these relate to five model domain movements called: "free", "open-closed", "anchored", "sliding-twist", and "see-saw." A directed graph is introduced called the "Dynamic Contact Graph" which represents the contact changes in a domain movement. In many cases a graph, or part of a graph, provides a clear visual metaphor for the movement it represents and is a motif that can be easily recognised. The Dynamic Contact Graphs are often comprised of disconnected subgraphs indicating independent regions which may play different roles in the domain movement. The Dynamic Contact Graph for each domain movement is decomposed into elemental Dynamic Contact Graphs, those that represent elemental contact changes, allowing us to count the number of instances of each type of elemental contact change in the domain movement. This naturally leads to sixteen classes into which the 1822 domain movements are classified.


Assuntos
Algoritmos , Simulação de Dinâmica Molecular , Proteínas/química , Bases de Dados de Proteínas , Domínios e Motivos de Interação entre Proteínas , Termodinâmica
13.
Genome Res ; 16(3): 414-27, 2006 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-16424108

RESUMO

Establishing transcriptional regulatory networks by analysis of gene expression data and promoter sequences shows great promise. We developed a novel promoter classification method using a Relevance Vector Machine (RVM) and Bayesian statistical principles to identify discriminatory features in the promoter sequences of genes that can correctly classify transcriptional responses. The method was applied to microarray data obtained from Arabidopsis seedlings treated with glucose or abscisic acid (ABA). Of those genes showing >2.5-fold changes in expression level, approximately 70% were correctly predicted as being up- or down-regulated (under 10-fold cross-validation), based on the presence or absence of a small set of discriminative promoter motifs. Many of these motifs have known regulatory functions in sugar- and ABA-mediated gene expression. One promoter motif that was not known to be involved in glucose-responsive gene expression was identified as the strongest classifier of glucose-up-regulated gene expression. We show it confers glucose-responsive gene expression in conjunction with another promoter motif, thus validating the classification method. We were able to establish a detailed model of glucose and ABA transcriptional regulatory networks and their interactions, which will help us to understand the mechanisms linking metabolism with growth in Arabidopsis. This study shows that machine learning strategies coupled to Bayesian statistical methods hold significant promise for identifying functionally significant promoter sequences.


Assuntos
Ácido Abscísico/metabolismo , Arabidopsis/genética , Glucose/metabolismo , Análise em Microsséries/métodos , Regiões Promotoras Genéticas , Transcrição Gênica , Ácido Abscísico/genética , Algoritmos , Sequência de Bases , Biologia Computacional/métodos , Glucose/genética , Dados de Sequência Molecular , Plântula/metabolismo
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa