Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
1.
Mol Divers ; 25(3): 1439-1460, 2021 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-34159484

RESUMEN

The accumulation of massive data in the plethora of Cheminformatics databases has made the role of big data and artificial intelligence (AI) indispensable in drug design. This has necessitated the development of newer algorithms and architectures to mine these databases and fulfil the specific needs of various drug discovery processes such as virtual drug screening, de novo molecule design and discovery in this big data era. The development of deep learning neural networks and their variants with the corresponding increase in chemical data has resulted in a paradigm shift in information mining pertaining to the chemical space. The present review summarizes the role of big data and AI techniques currently being implemented to satisfy the ever-increasing research demands in drug discovery pipelines.


Asunto(s)
Inteligencia Artificial , Macrodatos , Descubrimiento de Drogas/métodos , Algoritmos , Bases de Datos Factuales , Aprendizaje Profundo , Diseño de Fármacos , Aprendizaje Automático , Reproducibilidad de los Resultados , Flujo de Trabajo
2.
Genomics ; 112(5): 3571-3578, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32320820

RESUMEN

Single Nucleotide Polymorphism (SNP) is one of the important molecular markers widely used in animal breeding program for improvement of any desirable genetic traits. Considering this, the present study was carried out to identify, annotate and analyze the SNPs related to four important traits of buffalo viz. milk volume, age at first calving, post-partum cyclicity and feed conversion efficiency. We identified 246,495, 168,202, 74,136 and 194,747 genome-wide SNPs related to mentioned traits, respectively using ddRAD sequencing technique based on 85 samples of Murrah Buffaloes. Distribution of these SNPs were highest (61.69%) and lowest (1.78%) in intron and exon regions, respectively. Under coding regions, the SNPs for the four traits were further classified as synonymous (4697) and non-synonymous (3827). Moreover, Gene Ontology (GO) terms of identified genes assigned to various traits. These characterized SNPs will enhance the knowledge of cellular mechanism for enhancing productivity of water buffalo through molecular breeding.


Asunto(s)
Búfalos/genética , Polimorfismo de Nucleótido Simple , Animales , Femenino , Leche , Anotación de Secuencia Molecular , Análisis de Secuencia de ADN
3.
BMC Bioinformatics ; 21(1): 493, 2020 Oct 31.
Artículo en Inglés | MEDLINE | ID: mdl-33129275

RESUMEN

BACKGROUND: Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine-receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases-notably autoimmune, inflammatory and infectious diseases-and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. "Gold Standard" negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection. RESULTS: We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance. CONCLUSIONS: A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections-with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.


Asunto(s)
Receptores de Citocinas/metabolismo , Algoritmos , Humanos , Aprendizaje Automático , Posición Específica de Matrices de Puntuación , Reproducibilidad de los Resultados
4.
J Theor Biol ; 479: 37-47, 2019 10 21.
Artículo en Inglés | MEDLINE | ID: mdl-31310757

RESUMEN

Phospholipidosis is characterized by the presence of excessive accumulation of phospholipids in different tissue types (lungs, liver, eyes, kidneys etc.) caused by cationic amphiphilic drugs. Electron microscopy analysis has revealed the presence of lamellar inclusion bodies as the hallmark of phospholipidosis. Some phospholipidosis causing compounds can cause tissue specific inflammatory/retrogressive changes. Reliable and accurate in silico methods could facilitate early screening of phospholipidosis inducing compounds which can subsequently speed up the pharmaceutical drug discovery pipelines. In the present work, stacking ensembles are implemented for combining a number of different base learners to develop predictive models (a total of 256 trained machine learning models were tested) for phospholipidosis inducing compounds using a wide range of molecular descriptors (ChemMine, JOELib, Open babel and RDK descriptors) and structural alerts as input features. The best model consisting of stacked ensemble of machine learning algorithms with random forest as the second level learner outperformed other base and ensemble learners. JOELib descriptors along with structural alerts performed better than the other types of descriptor sets. The best ensemble model achieved an overall accuracy of 88.23%, sensitivity of 86.27%, specificity of 90.20%, mcc of 0.765, auc of 0.896 with 88.21 g-means. To assess the robustness and stability of the best ensemble model, it is further evaluated using stratified 10×10 fold cross validation and holdout testing sets (repeated 10 times) achieving 84.83% mean accuracy with 0.708 mean mcc and 88.46% mean accuracy with 0.771 mean mcc respectively. A comparison of different meta classifiers (Generalized linear regression, Gradient boosting machines, Random forest and Deep learning neural networks) in stacking ensemble revealed that random forest is the better choice for combining multiple classification models.


Asunto(s)
Lipidosis/diagnóstico , Modelos Estadísticos , Fosfolípidos/metabolismo , Área Bajo la Curva , Descubrimiento de Drogas , Humanos , Lipidosis/inducido químicamente , Lipidosis/etiología , Aprendizaje Automático/normas , Sensibilidad y Especificidad
5.
J Theor Biol ; 444: 73-82, 2018 05 07.
Artículo en Inglés | MEDLINE | ID: mdl-29462625

RESUMEN

In yeast and in some mammals the frequencies of recombination are high in some genomic locations which are known as recombination hotspots and in the locations where the recombination is below average are consequently known as coldspots. Knowledge of the hotspot regions gives clues about understanding the meiotic process and also in understanding the possible effects of sequence variation in these regions. Moreover, accurate information about the hotspot and coldspot regions can reveal insights into the genome evolution. In the present work, we have used class specific autoencoders for feature extraction and reduction. Subsequently the deep features that are extracted from the autoencoders were used to train three different classifiers, namely: gradient boosting machines, random forest and deep learning neural networks for predicting the hotspot and coldspot regions. A comparative performance analysis was carried out by experimenting on deep features extracted from different sets of the training data using autoencoders for selecting the best set of deep features. It was observed that learning algorithms trained on features extracted from the combined class specific autoencoder out performed when compared with the performances of these learning algorithms trained with other sets of deep features. So the combined class-specific autoencoder based feature extraction can be applied to a growing range of biological problems to achieve superior prediction performance.


Asunto(s)
Aprendizaje Profundo , Recombinación Genética/genética , Algoritmos , Secuencia de Bases , Clasificación , Redes Neurales de la Computación , Saccharomyces cerevisiae/genética
6.
Amino Acids ; 48(3): 751-762, 2016 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-26520112

RESUMEN

The sequence parameters for halophilic adaptation are still not fully understood. To understand the molecular basis of protein hypersaline adaptation, a detailed analysis is carried out, and investigated the likely association of protein sequence attributes to halophilic adaptation. A two-stage strategy is implemented, where in the first stage a supervised machine learning classifier is build, giving an overall accuracy of 86 % on stratified tenfold cross validation and 90 % on blind testing set, which are better than the previously reported results. The second stage consists of statistical analysis of sequence features and possible extraction of halophilic molecular signatures. The results of this study showed that, halophilic proteins are characterized by lower average charge, lower K content, and lower S content. A statistically significant preference/avoidance list of sequence parameters is also reported giving insights into the molecular basis of halophilic adaptation. D, Q, E, H, P, T, V are significantly preferred while N, C, I, K, M, F, S are significantly avoided. Among amino acid physicochemical groups, small, polar, charged, acidic and hydrophilic groups are preferred over other groups. The halophilic proteins also showed a preference for higher average flexibility, higher average polarity and avoidance for higher average positive charge, average bulkiness and average hydrophobicity. Some interesting trends observed in dipeptide counts are also reported. Further a systematic statistical comparison is undertaken for gaining insights into the sequence feature distribution in different residue structural states. The current analysis may facilitate the understanding of the mechanism of halophilic adaptation clearer, which can be further used for rational design of halophilic proteins.


Asunto(s)
Proteínas/química , Adaptación Fisiológica , Secuencia de Aminoácidos , Animales , Bases de Datos de Proteínas , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , Cinética , Aprendizaje Automático , Proteínas/metabolismo , Cloruro de Sodio/química , Cloruro de Sodio/metabolismo
7.
J Theor Biol ; 390: 117-26, 2016 Feb 07.
Artículo en Inglés | MEDLINE | ID: mdl-26656108

RESUMEN

Piezophiles are the organisms which can successfully survive at extreme pressure conditions. However, the molecular basis of piezophilic adaptation is still poorly understood. Analysis of the protein sequence adjustments that had taken place during evolution can help to reveal the sequence adaptation parameters responsible for protein functional and structural adaptation at such high pressure conditions. In this current work we have used SVM classifier for filtering strong instances and generated human interpretable rules from these strong instances by using the PART algorithm. These generated rules were analyzed for getting insights into the molecular signature patterns present in the piezophilic proteins. The experiments were performed on three different temperature ranges piezophilic groups, namely psychrophilic-piezophilic, mesophilic-piezophilic, and thermophilic-piezophilic for the detailed comparative study. The best classification results were obtained as we move up the temperature range from psychrophilic-piezophilic to thermophilic-piezophilic. Based on the physicochemical classification of amino acids and using feature ranking algorithms, hydrophilic and polar amino acid groups have higher discriminative ability for psychrophilic-piezophilic and mesophilic-piezophilic groups along with hydrophobic and nonpolar amino acids for the thermophilic-piezophilic groups. We also observed an overrepresentation of polar, hydrophilic and small amino acid groups in the discriminatory rules of all the three temperature range piezophiles along with aliphatic, nonpolar and hydrophobic groups in the mesophilic-piezophilic and thermophilic-piezophilic groups.


Asunto(s)
Adaptación Fisiológica/genética , Proteínas Arqueales/genética , Proteínas Bacterianas/genética , Presión , Algoritmos , Aminoácidos/química , Aminoácidos/genética , Archaea/clasificación , Archaea/genética , Archaea/crecimiento & desarrollo , Bacterias/clasificación , Bacterias/genética , Bacterias/crecimiento & desarrollo , Biología Computacional/métodos , Interacciones Hidrofóbicas e Hidrofílicas , Modelos Genéticos , Temperatura
8.
Sci Rep ; 14(1): 5958, 2024 Mar 12.
Artículo en Inglés | MEDLINE | ID: mdl-38472266

RESUMEN

Fuzzy rough entropy established in the notion of fuzzy rough set theory, which has been effectively and efficiently applied for feature selection to handle the uncertainty in real-valued datasets. Further, Fuzzy rough mutual information has been presented by integrating information entropy with fuzzy rough set to measure the importance of features. However, none of the methods till date can handle noise, uncertainty and vagueness simultaneously due to both judgement and identification, which lead to degrade the overall performances of the learning algorithms with the increment in the number of mixed valued conditional features. In the current study, these issues are tackled by presenting a novel intuitionistic fuzzy (IF) assisted mutual information concept along with IF granular structure. Initially, a hybrid IF similarity relation is introduced. Based on this relation, an IF granular structure is introduced. Then, IF rough conditional and joint entropies are established. Further, mutual information based on these concepts are discussed. Next, mathematical theorems are proved to demonstrate the validity of the given notions. Thereafter, significance of the features subset is computed by using this mutual information, and corresponding feature selection is suggested to delete the irrelevant and redundant features. The current approach effectively handles noise and subsequent uncertainty in both nominal and mixed data (including both nominal and category variables). Moreover, comprehensive experimental performances are evaluated on real-valued benchmark datasets to demonstrate the practical validation and effectiveness of the addressed technique. Finally, an application of the proposed method is exhibited to improve the prediction of phospholipidosis positive molecules. RF(h2o) produces the most effective results till date based on our proposed methodology with sensitivity, accuracy, specificity, MCC, and AUC of 86.7%, 90.1%, 93.0% , 0.808, and 0.922 respectively.

9.
Sci Rep ; 14(1): 13568, 2024 Jun 12.
Artículo en Inglés | MEDLINE | ID: mdl-38866851

RESUMEN

The dimension and size of data is growing rapidly with the extensive applications of computer science and lab based engineering in daily life. Due to availability of vagueness, later uncertainty, redundancy, irrelevancy, and noise, which imposes concerns in building effective learning models. Fuzzy rough set and its extensions have been applied to deal with these issues by various data reduction approaches. However, construction of a model that can cope with all these issues simultaneously is always a challenging task. None of the studies till date has addressed all these issues simultaneously. This paper investigates a method based on the notions of intuitionistic fuzzy (IF) and rough sets to avoid these obstacles simultaneously by putting forward an interesting data reduction technique. To accomplish this task, firstly, a novel IF similarity relation is addressed. Secondly, we establish an IF rough set model on the basis of this similarity relation. Thirdly, an IF granular structure is presented by using the established similarity relation and the lower approximation. Next, the mathematical theorems are used to validate the proposed notions. Then, the importance-degree of the IF granules is employed for redundant size elimination. Further, significance-degree-preserved dimensionality reduction is discussed. Hence, simultaneous instance and feature selection for large volume of high-dimensional datasets can be performed to eliminate redundancy and irrelevancy in both dimension and size, where vagueness and later uncertainty are handled with rough and IF sets respectively, whilst noise is tackled with IF granular structure. Thereafter, a comprehensive experiment is carried out over the benchmark datasets to demonstrate the effectiveness of simultaneous feature and data point selection methods. Finally, our proposed methodology aided framework is discussed to enhance the regression performance for IC50 of Antiviral Peptides.

10.
Biol Futur ; 74(4): 489-506, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37889451

RESUMEN

Antiviral peptides (AVPs) open new possibilities as an effective antiviral therapeutic in the current scenario of evolving drug-resistant viruses. Knowledge about the sequence and structure activity relationship in AVPs is still largely unknown. AVPs and antimicrobial peptides (AMPs) share several common features but as they target different life forms (living organisms and viruses), exploring the differential sequence features may facilitate in designing specific AVPs. The current work developed accurate prediction models for discriminating (a) AVPs from AMPs, (b) Coronaviridae AVPs from other virus family specific AVPs and (c) highly active AVPs (HAA) from lowly active AVPs (LAA). Further explainable machine learning methods (using model agnostic global interpretable methods) are utilized for exploring and interpreting the physicochemical spaces of AVPs, Coronaviridae AVPs and highly active AVPs. To further understand the association of physicochemical space distribution with pIC50 values, regression models were developed and analyzed using accumulated local effects and interaction strength analysis. An independent sample t-test is used to filter out the significant compositional differences between the smaller length HAA and longer length HAA groups. AVPs prefer lower charge/length ratio and basic residues in comparison with AMPs. Coronaviridae family-specific AVPs have lower propensities for basic amino acids, charge and preference for aspartic acid. Further there is prevalence for basic residues in lowly active AVPs as compared to highly active AVPs. Sequence order effects captured in terms of average amino acid pair distances proved to be more constructive in deciphering the sequences of AVPs.


Asunto(s)
Antivirales , Péptidos , Secuencia de Aminoácidos , Antivirales/farmacología , Antivirales/química
11.
Med Biol Eng Comput ; 60(8): 2349-2357, 2022 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-35751828

RESUMEN

Early identification of the risk factors associated with development of diabetic foot ulcer (DFU) can be facilitated using machine learning techniques. The aim of this study is to find out the association of various clinical and biochemical risk factors with DFU and develop a prediction model using different machine learning algorithms. Eighty each of type 2 diabetes mellitus (T2DM) with DFU and (T2DM) without DFU were enrolled for this observational study. Clinical and laboratory data were analysed using different machine learning algorithms: Support vector machines (SVM-Poly K), Naive Bayes (NB), K-nearest neighbour (KNN), random forest (RF) and three ensemble learners: Stacking C, Bagging and AdaBoost for constructing prediction models for discriminating between the two groups (stage I classification) and ulcer type classification (stage II classification). Ensemble learning performed better than individual classifiers in terms of various performance evaluation metrics. New risk factors like ApoA1 and IL-10 for development of DFU in diabetes mellitus were identified. IL-10 along with uric acid could discriminate the grades of ulcers according to its severity. Decision fusion strategy using Stacking C algorithm resulted in enhanced prediction accuracy for both the stages of classification which can be used as a complementary method for computational screening for DFU and its subtypes. Current methodology for T2DM with DFU/T2DM without DFU and ulcer type classification.


Asunto(s)
Diabetes Mellitus Tipo 2 , Pie Diabético , Algoritmos , Teorema de Bayes , Diabetes Mellitus Tipo 2/complicaciones , Diabetes Mellitus Tipo 2/diagnóstico , Pie Diabético/diagnóstico , Humanos , Interleucina-10 , Aprendizaje Automático , Factores de Riesgo , Máquina de Vectores de Soporte
12.
Comput Biol Chem ; 95: 107588, 2021 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-34655913

RESUMEN

The low efficacy of current antivirals in conjunction with the resistance of viruses against existing antiviral drugs has resulted in the demand for the development of novel antiviral agents. Antiviral peptides (AVPs) are those bioactive peptides having virucidal activity and they can be developed into promising antiviral drugs. They are shorter length peptides having the ability to cease the progression of viral infections. The use of antiviral peptides in therapeutics has recently attracted the attention of the research community. The development and identification of AVPs is imperative for the discovery of novel therapeutics for viral infections. In the present work, a meta classifier (stacking) based approach is implemented for the prediction of IC50 (half maximal inhibitory concentration) and pIC50 (negative log of half maximal inhibitory concentration) values. The best prediction model with evolutionary information and local alignment scores as features achieved a correlation coefficient values of 0.670 and 0.753 on the training and testing sets respectively for IC50. Further, the prediction of pIC50 reached a correlation coefficient value of 0.797 and 0.789 for training and testing sets respectively. For the development of machine learning models involved in the prediction of IC50, the use of pIC50 over IC50 is recommended as the target variable. Further on a systematic comparison of AVPs with high IC50 values and Low IC50 values, it is revealed that higher mean charge and tiny amino acids are preferred and higher length and consecutive hydrophilic amino acids are avoided in the former.


Asunto(s)
Antivirales/química , Aprendizaje Automático , Péptidos/química , Algoritmos , Biología Computacional , Interacciones Hidrofóbicas e Hidrofílicas
13.
Comput Biol Chem ; 87: 107274, 2020 May 05.
Artículo en Inglés | MEDLINE | ID: mdl-32416563

RESUMEN

Growth hormone binding proteins (GHBPs) are soluble proteins that play an important role in the modulation of signaling pathways pertaining to growth hormones. GHBPs are selective and bind non-covalently with growth hormones, but their functions are still not fully understood. Identification and characterization of GHBPs are the preliminary steps for understanding their roles in various cellular processes. As wet lab based experimental methods involve high cost and labor, computational methods can facilitate in narrowing down the search space of putative GHBPs. Performance of machine learning algorithms largely depends on the quality of features that it feeds on. Informative and non-redundant features generally result in enhanced performance and for this purpose feature selection algorithms are commonly used. In the present work, a novel representation transfer learning approach is presented for prediction of GHBPs. For their accurate prediction, deep autoencoder based features were extracted and subsequently SMO-PolyK classifier is trained. The prediction model is evaluated by both leave one out cross validation (LOOCV) and hold out independent testing set. On LOOCV, the prediction model achieved 89.8%% accuracy, with 89.4% sensitivity and 90.2% specificity and accuracy of 93.5%, sensitivity of 90.2% and specificity of 96.8% is attained on the hold out testing set. Further a comparison was made between the full set of sequence-based features, top performing sequence features extracted using feature selection algorithm, deep autoencoder based features and generalized low rank model based features on the prediction accuracy. Principal component analysis of the representative features along with t-sne visualization demonstrated the effectiveness of deep features in prediction of GHBPs. The present method is robust and accurate and may complement other wet lab based methods for identification of novel GHBPs.

14.
Front Vet Sci ; 7: 518, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32984408

RESUMEN

Machine learning algorithms were employed for predicting the feed conversion efficiency (FCE), using the blood parameters and average daily gain (ADG) as predictor variables in buffalo heifers. It was observed that isotonic regression outperformed other machine learning algorithms used in study. Further, we also achieved the best performance evaluation metrics model with additive regression as the meta learner and isotonic regression as the base learner on 10-fold cross-validation and leaving-one-out cross-validation tests. Further, we created three separate partial least square regression (PLSR) models using all 14 parameters of blood and ADG as independent (explanatory) variables and FCE as the dependent variable, to understand the interactions of blood parameters, ADG with FCE each by inclusion of all FCE values (i), only higher FCE values (negative RFI) (ii), and inclusion of only lower FCE (positive RFI) values (iii). The PLSR model including only the higher FCE values was concluded the best, based on performance evaluation metrics as compared to PLSR models developed by inclusion of the lower FCE values and all types of FCE values. IGF1 and its interactions with the other blood parameters were found highly influential for higher FCE measures. The strength of the estimated interaction effects of the blood parameter in relation to FCE may facilitate understanding of intricate dynamics of blood parameters for growth.

15.
Comput Biol Chem ; 80: 333-340, 2019 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-31078912

RESUMEN

Adhesion is the foremost step in pathogenesis and biofilm formation and is facilitated by a special class of cell wall proteins known as adhesins. Formation of biofilms in catheters and other medical devices subsequently leads to infections. As compared to bacterial adhesins, there is relatively less work for the characterization and identification of fungal adhesins. Understanding the sequence characterization of fungal adhesins may facilitate a better understanding of its role in pathogenesis. Experimental methods for investigation and characterization of fungal adhesins are labor intensive and expensive. Therefore, there is a need for fast and efficient computational methods for the identification and characterization of fungal adhesins. The aim of the current study is twofold: (i) to develop an accurate predictor for fungal adhesins, (ii) to sieve out the prominent molecular signatures present in fungal adhesins. Of the many supervised learning algorithms implemented in the current study, voting ensembles resulted in enhanced prediction accuracy. The best voting-ensemble consisting of three support vector machines with three different kernels (PolyK, RBF, PuK) achieved an accuracy of 94.9% on leave one out cross validation and 98.0% accuracy on blind testing set. A preference/avoidance list of molecular features as well as human interpretable rules are also extracted giving insights into the general sequence features of fungal adhesins. Fungal adhesins are characterized by high Threonine and Cysteine and avoidance for Phenylalanine and Methionine. They also have avoidance for average hydrophilicity. The current analysis possibly will facilitate the understanding of the mechanism of fungal adhesin function which may further help in designing methods for restricting adhesin mediated pathogenesis.


Asunto(s)
Moléculas de Adhesión Celular/química , Proteínas Fúngicas/química , Bases de Datos de Proteínas , Aprendizaje Automático
16.
Methods Mol Biol ; 1762: 21-30, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29594765

RESUMEN

Identification of drug targets and drug target interactions are important steps in the drug-discovery pipeline. Successful computational prediction methods can reduce the cost and time demanded by the experimental methods. Knowledge of putative drug targets and their interactions can be very useful for drug repurposing. Supervised machine learning methods have been very useful in drug target prediction and in prediction of drug target interactions. Here, we describe the details for developing prediction models using supervised learning techniques for human drug target prediction and their interactions.


Asunto(s)
Biología Computacional/métodos , Aprendizaje Automático Supervisado , Secuencia de Aminoácidos , Sistemas de Liberación de Medicamentos , Descubrimiento de Drogas , Interacciones Farmacológicas , Reposicionamiento de Medicamentos , Humanos , Terapia Molecular Dirigida
17.
Comput Biol Chem ; 68: 29-38, 2017 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-28231526

RESUMEN

ß-lactamases provides one of the most successful means of evading the therapeutic effects of ß lactam class of antibiotics by many gram positive and gram negative bacteria. On the basis of sequence identity, ß-lactamases have been identified into four distinct classes- A, B, C and D. The classes A, C and D are the serine ß-lactamases and class B is the metallo-lactamse. In the present study, we developed a two stage cascade classification system. The first-stage performs the classification of ß-lactamases from non-ß-lactamases and the second-stage performs the further classification of ß-lactamases into four different ß-lactamase classes. In the first-stage binary classification, we obtained an accuracy of 97.3% with a sensitivity of 89.1% and specificity of 98.0% and for the second stage multi-class classification, we obtained an accuracy of 87.3% for the class A, 91.0% for the class B, 96.3% for the class C and 96.4% for class D. A systematic statistical analysis is carried out on the sieved-out, correctly-predicted instances from the second stage classifier, which revealed some interesting patterns. We analyzed different classes of ß-lactamases on the basis of sequence and physicochemical property differences between them. Among amino acid composition, H, W, Y and V showed significant differences between the different ß-lactamases classes. Differences in average physicochemical properties are observed for isoelectric point, volume, flexibility, hydrophobicity, bulkiness and charge in one or more ß-lactamase classes. The key differences in physicochemical property groups can be observed in small and aromatic groups. Among amino acid property group n-grams except charged n-grams, all other property group n-grams are significant in one or more classes. Statistically significant differences in dipeptide counts among different ß-lactamase classes are also reported.


Asunto(s)
Fenómenos Químicos , Evolución Molecular , beta-Lactamasas/análisis , beta-Lactamasas/clasificación , Algoritmos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Aprendizaje Automático , beta-Lactamasas/química , beta-Lactamasas/genética
18.
Interdiscip Sci ; 9(2): 292-303, 2017 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-26879961

RESUMEN

Cyclin-dependent kinase inhibitors (CDKIs) govern the regulation of cyclin-dependent kinases, which are responsible for controlling cell cycle progression. The members of the CDKI protein family play important roles in many processes like tumor suppression, apoptosis, transcriptional regulation. The sequence similarity-based search methods to annotate putative CDKIs do not yield optimal performance due to sequence diversity of CDKIs. As a consequence, machine learning-based models have become viable choices for predicting CDKI. In this work, we have developed a framework for handling the class imbalance factor (which is encountered very frequently in biological datasets) in order to enhance the prediction of CDKI through machine learning approaches. We have designed our experiments to achieve the optimal performance of machine learning-based methods in predicting CDKI by investigating the dataset-related prediction enhancement issues, like: (1) What should be the optimal class distribution ratio in the training set? (2) Should we oversample or undersample? (3) At what ratio, positive and negative samples should be oversampled or undersampled? and (4) How to select the best-performing classifier? We have addressed these issues through comparing the results from an imbalanced training set with training sets which are created at different resampling rates by using synthetic minority over-sampling technique and undersampling technique to have varied class distributions. The proposed framework resulted in 100 % sensitivity, 93.7 % specificity, 96.4 % accuracy, 0.929 MCC with 0.981 AUC using simple sequence-based features on a leave-one-out cross-validation test. The generalization ability of the trained model was further tested on four separate blind testing sets. Our work supports the fact that the performance of the algorithms can be enhanced by creating an optimal class distribution in the training set besides fine-tuning of the parameters of the algorithms. This optimal ratio of positive and negative samples in the training set is an important learning enhancement parameter for prediction models based on machine learning algorithms.


Asunto(s)
Algoritmos , Inhibidores de Proteínas Quinasas/química , Aminoácidos/química , Proteínas/química
19.
Comput Biol Med ; 68: 27-36, 2016 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-26599828

RESUMEN

Bioluminescence plays an important role in nature, for example, it is used for intracellular chemical signalling in bacteria. It is also used as a useful reagent for various analytical research methods ranging from cellular imaging to gene expression analysis. However, identification and annotation of bioluminescent proteins is a difficult task as they share poor sequence similarities among them. In this paper, we present a novel approach for within-class and between-class balancing as well as diversifying of a training dataset by effectively combining unsupervised K-Means algorithm with Synthetic Minority Oversampling Technique (SMOTE) in order to achieve the true performance of the prediction model. Further, we experimented by varying different levels of balancing ratio of positive data to negative data in the training dataset in order to probe for an optimal class distribution which produces the best prediction accuracy. The appropriately balanced and diversified training set resulted in near complete learning with greater generalization on the blind test datasets. The obtained results strongly justify the fact that optimal class distribution with a high degree of diversity is an essential factor to achieve near perfect learning. Using random forest as the weak learners in boosting and training it on the optimally balanced and diversified training dataset, we achieved an overall accuracy of 95.3% on a tenfold cross validation test, and an accuracy of 91.7%, sensitivity of 89. 3% and specificity of 91.8% on a holdout test set. It is quite possible that the general framework discussed in the current work can be successfully applied to other biological datasets to deal with imbalance and incomplete learning problems effectively.


Asunto(s)
Algoritmos , Proteínas Luminiscentes/genética , Aprendizaje Automático , Análisis de Secuencia de Proteína/métodos , Valor Predictivo de las Pruebas
20.
3 Biotech ; 6(1): 93, 2016 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-28330163

RESUMEN

To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA