Búsqueda | Portal de Búsqueda de la BVS Enfermería

Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features.

Demir-Kavuk, Ozgur; Kamada, Mayumi; Akutsu, Tatsuya; Knapp, Ernst-Walter.

BMC Bioinformatics ; 12: 412, 2011 Oct 25.

Artículo en Inglés | MEDLINE | ID: mdl-22026913

RESUMEN

BACKGROUND: Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training. RESULTS: The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers. CONCLUSIONS: The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.

Asunto(s)

Inteligencia Artificial , Biología Computacional/métodos , Animales , Bases de Datos Genéticas , Humanos , Internet , Análisis de Componente Principal , Análisis de Regresión

Exploring classification strategies with the CoEPrA 2006 contest.

Demir-Kavuk, Ozgur; Riedesel, Henning; Knapp, Ernst-Walter.

Bioinformatics ; 26(5): 603-9, 2010 Mar 01.

Artículo en Inglés | MEDLINE | ID: mdl-20097914

RESUMEN

MOTIVATION: In silico methods to classify compounds as potential drugs that bind to a specific target become increasingly important for drug design. To build classification devices training sets of drugs with known activities are needed. For many such classification problems, not only qualitative but also quantitative information of a specific property (e.g. binding affinity) is available. The latter can be used to build a regression scheme to predict this property for new compounds. Predicting a compound property explicitly is generally more difficult than classifying that the property lies below or above a given threshold value. Hence, an indirect classification that is based on regression may lead to poorer results than a direct classification scheme. In fact, initially researchers are only interested to classify compounds as potential drugs. The activities of these compounds are subsequently measured in wet lab. RESULTS: We propose a novel approach that uses available quantitative information directly for classification rather than first using a regression scheme. It uses a new type of loss function called weighted biased regression. Application of this method to four widely studied datasets of the CoEPrA contest (Comparative Evaluation of Prediction Algorithms, http://coepra.org) shows that it can outperform simple classification methods that do not make use of this additional quantitative information. AVAILABILITY: A stand alone application is available at the webpage http://agknapp.chemie.fu-berlin.de/agknapp/index.php?menu=software&page=PeptideClassifier that can be used to build a model for a peptide training set to be submitted.

Asunto(s)

Algoritmos , Péptidos/química , Sitios de Unión , Bases de Datos Factuales , Diseño de Fármacos , Ligandos , Relación Estructura-Actividad Cuantitativa , Análisis de Regresión

DemQSAR: predicting human volume of distribution and clearance of drugs.

Demir-Kavuk, Ozgur; Bentzien, Jörg; Muegge, Ingo; Knapp, Ernst-Walter.

J Comput Aided Mol Des ; 25(12): 1121-33, 2011 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-22101402

RESUMEN

In silico methods characterizing molecular compounds with respect to pharmacologically relevant properties can accelerate the identification of new drugs and reduce their development costs. Quantitative structure-activity/-property relationship (QSAR/QSPR) correlate structure and physico-chemical properties of molecular compounds with a specific functional activity/property under study. Typically a large number of molecular features are generated for the compounds. In many cases the number of generated features exceeds the number of molecular compounds with known property values that are available for learning. Machine learning methods tend to overfit the training data in such situations, i.e. the method adjusts to very specific features of the training data, which are not characteristic for the considered property. This problem can be alleviated by diminishing the influence of unimportant, redundant or even misleading features. A better strategy is to eliminate such features completely. Ideally, a molecular property can be described by a small number of features that are chemically interpretable. The purpose of the present contribution is to provide a predictive modeling approach, which combines feature generation, feature selection, model building and control of overtraining into a single application called DemQSAR. DemQSAR is used to predict human volume of distribution (VD(ss)) and human clearance (CL). To control overtraining, quadratic and linear regularization terms were employed. A recursive feature selection approach is used to reduce the number of descriptors. The prediction performance is as good as the best predictions reported in the recent literature. The example presented here demonstrates that DemQSAR can generate a model that uses very few features while maintaining high predictive power. A standalone DemQSAR Java application for model building of any user defined property as well as a web interface for the prediction of human VD(ss) and CL is available on the webpage of DemPRED: http://agknapp.chemie.fu-berlin.de/dempred/ .

Asunto(s)

Preparaciones Farmacéuticas/química , Farmacocinética , Relación Estructura-Actividad Cuantitativa , Inteligencia Artificial , Humanos , Tasa de Depuración Metabólica , Modelos Biológicos

Understanding properties of cofactors in proteins: redox potentials of synthetic cytochromes b.

Gámiz-Hernández, Ana P; Kieseritzky, Gernot; Galstyan, Artur S; Demir-Kavuk, Ozgur; Knapp, Ernst-Walter.

Chemphyschem ; 11(6): 1196-206, 2010 Apr 26.

Artículo en Inglés | MEDLINE | ID: mdl-20411561

RESUMEN

Haehnel et al. synthesized 399 different artificial cytochrome b (aCb) models. They consist of a template-assisted four-helix bundle with one embedded heme group. Their redox potentials were measured and cover the range from -148 to -89 mV. No crystal structures of these aCb are available. Therefore, we use the chemical composition and general structural principles to generate atomic coordinates of 31 of these aCb mutants, which are chosen to cover the whole interval of redox potentials. We start by modeling the coordinates of one aCb from scratch. Its structure remains stable after energy minimization and during molecular dynamics simulation over 2 ns. Based on this structure, coordinates of the other 30 aCb mutants are modeled. The calculated redox potentials for these 31 aCb agree within 10 mV with the experimental values in terms of root mean square deviation. Analysis of the dependence of heme redox potential on protein environment shows that the shifts in redox potentials relative to the model systems in water are due to the low-dielectric medium of the protein and the protonation states of the heme propionic acid groups, which are influenced by the surrounding amino acids. Alternatively, we perform a blind prediction of the same redox potentials using an empirical approach based on a linear scoring function and reach a similar accuracy. Both methods are useful to understand and predict heme redox potentials. Based on the modeled structure we can understand the detailed structural differences between aCb mutants that give rise to shifts in heme redox potential. On the other hand, one can explore the correlation between sequence variations and aCb redox potentials more directly and on much larger scale using the empirical prediction scheme, which--thanks to its simplicity--is much faster.

Asunto(s)

Coenzimas/química , Citocromos b/química , Secuencia de Aminoácidos , Sustitución de Aminoácidos , Hemo/química , Simulación de Dinámica Molecular , Datos de Secuencia Molecular , Mutación , Oxidación-Reducción , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Electricidad Estática

Predicting protein complex geometries with linear scoring functions.

Demir-Kavuk, Ozgur; Krull, Florian; Chae, Myong-Ho; Knapp, Ernst-Walter.

Genome Inform ; 24: 21-30, 2010.

Artículo en Inglés | MEDLINE | ID: mdl-22081586

RESUMEN

Protein-Protein interactions play an important role in many cellular processes. However experimental determination of the protein complex structure is quite difficult and time consuming. Hence, there is need for fast and accurate in silico protein docking methods. These methods generally consist of two stages: (i) a sampling algorithm that generates a large number of candidate complex geometries (decoys), and (ii) a scoring function that ranks these decoys such that nearnative decoys are higher ranked than other decoys. We have recently developed a neural network based scoring function that performed better than other state-of-the-art scoring functions on a benchmark of 65 protein complexes. Here, we use similar ideas to develop a method that is based on linear scoring functions. We compare the linear scoring function of the present study with other knowledge-based scoring functions such as ZDOCK 3.0, ZRANK and the previously developed neural network. Despite its simplicity the linear scoring function performs as good as the compared state-of-the-art methods and predictions are simple and rapid to compute.

Asunto(s)

Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Bases de Datos de Proteínas , Simulación del Acoplamiento Molecular , Redes Neurales de la Computación , Lenguajes de Programación , Reproducibilidad de los Resultados

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA