Búsqueda | Portal Regional de la BVS

1.

Predicting lifespan-extending chemical compounds for C. elegans with machine learning and biologically interpretable features.

Ribeiro, Caio; Farmer, Christopher K; de Magalhães, João Pedro; Freitas, Alex A.

Aging (Albany NY) ; 15(13): 6073-6099, 2023 07 13.

Artículo en Inglés | MEDLINE | ID: mdl-37450404

RESUMEN

Recently, there has been a growing interest in the development of pharmacological interventions targeting ageing, as well as in the use of machine learning for analysing ageing-related data. In this work, we use machine learning methods to analyse data from DrugAge, a database of chemical compounds (including drugs) modulating lifespan in model organisms. To this end, we created four types of datasets for predicting whether or not a compound extends the lifespan of C. elegans (the most frequent model organism in DrugAge), using four different types of predictive biological features, based on: compound-protein interactions, interactions between compounds and proteins encoded by ageing-related genes, and two types of terms annotated for proteins targeted by the compounds, namely Gene Ontology (GO) terms and physiology terms from the WormBase's Phenotype Ontology. To analyse these datasets, we used a combination of feature selection methods in a data pre-processing phase and the well-established random forest algorithm for learning predictive models from the selected features. In addition, we interpreted the most important features in the two best models in light of the biology of ageing. One noteworthy feature was the GO term "Glutathione metabolic process", which plays an important role in cellular redox homeostasis and detoxification. We also predicted the most promising novel compounds for extending lifespan from a list of previously unlabelled compounds. These include nitroprusside, which is used as an antihypertensive medication. Overall, our work opens avenues for future work in employing machine learning to predict novel life-extending compounds.

Asunto(s)

Caenorhabditis elegans , Longevidad , Aprendizaje Automático , Longevidad/efectos de los fármacos , Caenorhabditis elegans/efectos de los fármacos , Caenorhabditis elegans/genética , Caenorhabditis elegans/fisiología , Envejecimiento , Glutatión/análisis , Oxidación-Reducción , Ontología de Genes , Algoritmos , Bases de Datos Farmacéuticas

2.

Machine learning-based predictions of dietary restriction associations across ageing-related genes.

Vega Magdaleno, Gustavo Daniel; Bespalov, Vladislav; Zheng, Yalin; Freitas, Alex A; de Magalhaes, Joao Pedro.

BMC Bioinformatics ; 23(1): 10, 2022 Jan 04.

Artículo en Inglés | MEDLINE | ID: mdl-34983372

RESUMEN

BACKGROUND: Dietary restriction (DR) is the most studied pro-longevity intervention; however, a complete understanding of its underlying mechanisms remains elusive, and new research directions may emerge from the identification of novel DR-related genes and DR-related genetic features. RESULTS: This work used a Machine Learning (ML) approach to classify ageing-related genes as DR-related or NotDR-related using 9 different types of predictive features: PathDIP pathways, two types of features based on KEGG pathways, two types of Protein-Protein Interactions (PPI) features, Gene Ontology (GO) terms, Genotype Tissue Expression (GTEx) expression features, GeneFriends co-expression features and protein sequence descriptors. Our findings suggested that features biased towards curated knowledge (i.e. GO terms and biological pathways), had the greatest predictive power, while unbiased features (mainly gene expression and co-expression data) have the least predictive power. Moreover, a combination of all the feature types diminished the predictive power compared to predictions based on curated knowledge. Feature importance analysis on the two most predictive classifiers mostly corroborated existing knowledge and supported recent findings linking DR to the Nuclear Factor Erythroid 2-Related Factor 2 (NRF2) signalling pathway and G protein-coupled receptors (GPCR). We then used the two strongest combinations of feature type and ML algorithm to predict DR-relatedness among ageing-related genes currently lacking DR-related annotations in the data, resulting in a set of promising candidate DR-related genes (GOT2, GOT1, TSC1, CTH, GCLM, IRS2 and SESN2) whose predicted DR-relatedness remain to be validated in future wet-lab experiments. CONCLUSIONS: This work demonstrated the strong potential of ML-based techniques to identify DR-associated features as our findings are consistent with literature and recent discoveries. Although the inference of new DR-related mechanistic findings based solely on GO terms and biological pathways was limited due to their knowledge-driven nature, the predictive power of these two features types remained useful as it allowed inferring new promising candidate DR-related genes.

Asunto(s)

Algoritmos , Aprendizaje Automático , Ontología de Genes , Longevidad/genética

3.

Ageing transcriptome meta-analysis reveals similarities and differences between key mammalian tissues.

Palmer, Daniel; Fabris, Fabio; Doherty, Aoife; Freitas, Alex A; de Magalhães, João Pedro.

Aging (Albany NY) ; 13(3): 3313-3341, 2021 02 11.

Artículo en Inglés | MEDLINE | ID: mdl-33611312

RESUMEN

By combining transcriptomic data with other data sources, inferences can be made about functional changes during ageing. Thus, we conducted a meta-analysis on 127 publicly available microarray and RNA-Seq datasets from mice, rats and humans, identifying a transcriptomic signature of ageing across species and tissues. Analyses on subsets of these datasets produced transcriptomic signatures of ageing for brain, heart and muscle. We then applied enrichment analysis and machine learning to functionally describe these signatures, revealing overexpression of immune and stress response genes and underexpression of metabolic and developmental genes. Further analyses revealed little overlap between genes differentially expressed with age in different tissues, despite ageing differentially expressed genes typically being widely expressed across tissues. Additionally we show that the ageing gene expression signatures (particularly the overexpressed signatures) of the whole meta-analysis, brain and muscle tend to include genes that are central in protein-protein interaction networks. We also show that genes underexpressed with age in the brain are highly central in a co-expression network, suggesting that underexpression of these genes may have broad phenotypic consequences. In sum, we show numerous functional similarities between the ageing transcriptomes of these important tissues, along with unique network properties of genes differentially expressed with age in both a protein-protein interaction and co-expression networks.

Asunto(s)

Envejecimiento/genética , Genómica/métodos , Especificidad de Órganos/genética , Transcriptoma/genética , Animales , Humanos , Aprendizaje Automático , Ratones , Análisis de Secuencia por Matrices de Oligonucleótidos , Mapeo de Interacción de Proteínas , Ratas

4.

A Novel Feature Selection Method for Uncertain Features: An Application to the Prediction of Pro-/Anti-Longevity Genes.

da Silva, Pablo Nascimento; Plastino, Alexandre; Fabris, Fabio; Freitas, Alex A.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2230-2238, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-32324561

RESUMEN

Understanding the ageing process is a very challenging problem for biologists. To help in this task, there has been a growing use of classification methods (from machine learning) to learn models that predict whether a gene influences the process of ageing or promotes longevity. One type of predictive feature often used for learning such classification models is Protein-Protein Interaction (PPI) features. One important property of PPI features is their uncertainty, i.e., a given feature (PPI annotation) is often associated with a confidence score, which is usually ignored by conventional classification methods. Hence, we propose the Lazy Feature Selection for Uncertain Features (LFSUF) method, which is tailored for coping with the uncertainty in PPI confidence scores. In addition, following the lazy learning paradigm, LFSUF selects features for each instance to be classified, making the feature selection process more flexible. We show that our LFSUF method achieves better predictive accuracy when compared to other feature selection methods that either do not explicitly take PPI confidence scores into account or deal with uncertainty globally rather than using a per-instance approach. Also, we interpret the results of the classification process using the features selected by LFSUF, showing that the number of selected features is significantly reduced, assisting the interpretability of the results. The datasets used in the experiments and the program code of the LFSUF method are freely available on the web at http://github.com/pablonsilva/FSforUncertainFeatureSpaces.

Asunto(s)

Envejecimiento/genética , Biología Computacional/métodos , Aprendizaje Automático , Algoritmos , Animales , Drosophila melanogaster/genética , Genoma Humano/genética , Humanos , Ratones , Mapas de Interacción de Proteínas/genética , Incertidumbre , Levaduras/genética

5.

Comparing enrichment analysis and machine learning for identifying gene properties that discriminate between gene classes.

Fabris, Fabio; Palmer, Daniel; de Magalhães, João Pedro; Freitas, Alex A.

Brief Bioinform ; 21(3): 803-814, 2020 05 21.

Artículo en Inglés | MEDLINE | ID: mdl-30895300

RESUMEN

Biologists very often use enrichment methods based on statistical hypothesis tests to identify gene properties that are significantly over-represented in a given set of genes of interest, by comparison with a 'background' set of genes. These enrichment methods, although based on rigorous statistical foundations, are not always the best single option to identify patterns in biological data. In many cases, one can also use classification algorithms from the machine-learning field. Unlike enrichment methods, classification algorithms are designed to maximize measures of predictive performance and are capable of analysing combinations of gene properties, instead of one property at a time. In practice, however, the majority of studies use either enrichment or classification methods (rather than both), and there is a lack of literature discussing the pros and cons of both types of method. The goal of this paper is to compare and contrast enrichment and classification methods, offering two contributions. First, we discuss the (to some extent complementary) advantages and disadvantages of both types of methods for identifying gene properties that discriminate between gene classes. Second, we provide a set of high-level recommendations for using enrichment and classification methods. Overall, by highlighting the strengths and the weaknesses of both types of methods we argue that both should be used in bioinformatics analyses.

Asunto(s)

Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Aprendizaje Automático , Algoritmos

6.

Investigating the role of Simpson's paradox in the analysis of top-ranked features in high-dimensional bioinformatics datasets.

Freitas, Alex A.

Brief Bioinform ; 21(2): 421-428, 2020 03 23.

Artículo en Inglés | MEDLINE | ID: mdl-30629111

RESUMEN

An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area has, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson's paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson's paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson's paradox involving top-ranked predictors are much more common for one of the feature ranking methods.

Asunto(s)

Biología Computacional , Conjuntos de Datos como Asunto , Aprendizaje Automático

7.

Using deep learning to associate human genes with age-related diseases.

Fabris, Fabio; Palmer, Daniel; Salama, Khalid M; de Magalhães, João Pedro; Freitas, Alex A.

Bioinformatics ; 36(7): 2202-2208, 2020 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-31845988

RESUMEN

MOTIVATION: One way to identify genes possibly associated with ageing is to build a classification model (from the machine learning field) capable of classifying genes as associated with multiple age-related diseases. To build this model, we use a pre-compiled list of human genes associated with age-related diseases and apply a novel Deep Neural Network (DNN) method to find associations between gene descriptors (e.g. Gene Ontology terms, protein-protein interaction data and biological pathway information) and age-related diseases. RESULTS: The novelty of our new DNN method is its modular architecture, which has the capability of combining several sources of biological data to predict which ageing-related diseases a gene is associated with (if any). Our DNN method achieves better predictive performance than standard DNN approaches, a Gradient Boosted Tree classifier (a strong baseline method) and a Logistic Regression classifier. Given the DNN model produced by our method, we use two approaches to identify human genes that are not known to be associated with age-related diseases according to our dataset. First, we investigate genes that are close to other disease-associated genes in a complex multi-dimensional feature space learned by the DNN algorithm. Second, using the class label probabilities output by our DNN approach, we identify genes with a high probability of being associated with age-related diseases according to the model. We provide evidence of these putative associations retrieved from the DNN model with literature support. AVAILABILITY AND IMPLEMENTATION: The source code and datasets can be found at: https://github.com/fabiofabris/Bioinfo2019. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Aprendizaje Profundo , Aprendizaje Automático , Envejecimiento , Ontología de Genes , Humanos , Redes Neurales de la Computación

8.

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.

Zhou, Naihui; Jiang, Yuxiang; Bergquist, Timothy R; Lee, Alexandra J; Kacsoh, Balint Z; Crocker, Alex W; Lewis, Kimberley A; Georghiou, George; Nguyen, Huy N; Hamid, Md Nafiz; Davis, Larry; Dogan, Tunca; Atalay, Volkan; Rifaioglu, Ahmet S; Dalkiran, Alperen; Cetin Atalay, Rengul; Zhang, Chengxin; Hurto, Rebecca L; Freddolino, Peter L; Zhang, Yang; Bhat, Prajwal; Supek, Fran; Fernández, José M; Gemovic, Branislava; Perovic, Vladimir R; Davidovic, Radoslav S; Sumonja, Neven; Veljkovic, Nevena; Asgari, Ehsaneddin; Mofrad, Mohammad R K; Profiti, Giuseppe; Savojardo, Castrense; Martelli, Pier Luigi; Casadio, Rita; Boecker, Florian; Schoof, Heiko; Kahanda, Indika; Thurlby, Natalie; McHardy, Alice C; Renaux, Alexandre; Saidi, Rabie; Gough, Julian; Freitas, Alex A; Antczak, Magdalena; Fabris, Fabio; Wass, Mark N; Hou, Jie; Cheng, Jianlin; Wang, Zheng; Romero, Alfonso E.

Genome Biol ; 20(1): 244, 2019 11 19.

Artículo en Inglés | MEDLINE | ID: mdl-31744546

RESUMEN

BACKGROUND: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS: Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION: We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Asunto(s)

Anotación de Secuencia Molecular/tendencias , Animales , Biopelículas , Candida albicans/genética , Drosophila melanogaster/genética , Genoma Bacteriano , Genoma Fúngico , Humanos , Locomoción , Memoria a Largo Plazo , Anotación de Secuencia Molecular/métodos , Pseudomonas aeruginosa/genética

9.

A new approach for interpreting Random Forest models and its application to the biology of ageing.

Fabris, Fabio; Doherty, Aoife; Palmer, Daniel; de Magalhães, João Pedro; Freitas, Alex A.

Bioinformatics ; 34(14): 2449-2456, 2018 07 15.

Artículo en Inglés | MEDLINE | ID: mdl-29462247

RESUMEN

Motivation: This work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model. Results: The new feature importance measure identified highly relevant Gene Ontology terms for the aforementioned gene classification task, producing a feature ranking that is much more informative to biologists than an alternative, state-of-the-art feature importance measure. Availability and implementation: The dataset and source codes used in this paper are available as 'Supplementary Material' and the description of the data can be found at: https://fabiofabris.github.io/bioinfo2018/web/. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Envejecimiento/genética , Encéfalo/metabolismo , Biología Computacional/métodos , Regulación de la Expresión Génica , Programas Informáticos , Animales , Ontología de Genes , Humanos , Aprendizaje Automático

10.

Machine learning for predicting lifespan-extending chemical compounds.

Barardo, Diogo G; Newby, Danielle; Thornton, Daniel; Ghafourian, Taravat; de Magalhães, João Pedro; Freitas, Alex A.

Aging (Albany NY) ; 9(7): 1721-1737, 2017 07 18.

Artículo en Inglés | MEDLINE | ID: mdl-28783712

RESUMEN

Increasing age is a risk factor for many diseases; therefore developing pharmacological interventions that slow down ageing and consequently postpone the onset of many age-related diseases is highly desirable. In this work we analyse data from the DrugAge database, which contains chemical compounds and their effect on the lifespan of model organisms. Predictive models were built using the machine learning method random forests to predict whether or not a chemical compound will increase Caenorhabditis elegans' lifespan, using as features Gene Ontology (GO) terms annotated for proteins targeted by the compounds and chemical descriptors calculated from each compound's chemical structure. The model with the best predictive accuracy used both biological and chemical features, achieving a prediction accuracy of 80%. The top 20 most important GO terms include those related to mitochondrial processes, to enzymatic and immunological processes, and terms related to metabolic and transport processes. We applied our best model to predict compounds which are more likely to increase C. elegans' lifespan in the DGIdb database, where the effect of the compounds on an organism's lifespan is unknown. The top hit compounds can be broadly divided into four groups: compounds affecting mitochondria, compounds for cancer treatment, anti-inflammatories, and compounds for gonadotropin-releasing hormone therapies.

Asunto(s)

Bases de Datos Farmacéuticas , Longevidad/efectos de los fármacos , Aprendizaje Automático , Animales , Caenorhabditis elegans

11.

A review of supervised machine learning applied to ageing research.

Fabris, Fabio; Magalhães, João Pedro de; Freitas, Alex A.

Biogerontology ; 18(2): 171-188, 2017 04.

Artículo en Inglés | MEDLINE | ID: mdl-28265788

RESUMEN

Broadly speaking, supervised machine learning is the computational task of learning correlations between variables in annotated data (the training set), and using this information to create a predictive model capable of inferring annotations for new data, whose annotations are not known. Ageing is a complex process that affects nearly all animal species. This process can be studied at several levels of abstraction, in different organisms and with different objectives in mind. Not surprisingly, the diversity of the supervised machine learning algorithms applied to answer biological questions reflects the complexities of the underlying ageing processes being studied. Many works using supervised machine learning to study the ageing process have been recently published, so it is timely to review these works, to discuss their main findings and weaknesses. In summary, the main findings of the reviewed papers are: the link between specific types of DNA repair and ageing; ageing-related proteins tend to be highly connected and seem to play a central role in molecular pathways; ageing/longevity is linked with autophagy and apoptosis, nutrient receptor genes, and copper and iron ion transport. Additionally, several biomarkers of ageing were found by machine learning. Despite some interesting machine learning results, we also identified a weakness of current works on this topic: only one of the reviewed papers has corroborated the computational results of machine learning algorithms through wet-lab experiments. In conclusion, supervised machine learning has contributed to advance our knowledge and has provided novel insights on ageing, yet future work should have a greater emphasis in validating the predictions.

Asunto(s)

Envejecimiento/fisiología , Biología Computacional/métodos , Modelos Biológicos , Proyectos de Investigación , Aprendizaje Automático Supervisado , Animales , Simulación por Computador , Humanos

12.

Simultaneous Prediction of four ATP-binding Cassette Transporters' Substrates Using Multi-label QSAR.

Aniceto, Natália; Freitas, Alex A; Bender, Andreas; Ghafourian, Taravat.

Mol Inform ; 35(10): 514-528, 2016 10.

Artículo en Inglés | MEDLINE | ID: mdl-27582431

RESUMEN

Efflux by the ATP-binding cassette (ABC) transporters affects the pharmacokinetic profile of drugs and it has been implicated in drug-drug interactions as well as its major role in multi-drug resistance in cancer. It is therefore important for the pharmaceutical industry to be able to understand what phenomena rule ABC substrate recognition. Considering a high degree of substrate overlap between various members of ABC transporter family, it is advantageous to employ a multi-label classification approach where predictions made for one transporter can be used for modeling of the other ABC transporters. Here, we present decision tree-based QSAR classification models able to simultaneously predict substrates and non-substrates for BCRP1, P-gp/MDR1 and MRP1 and MRP2, using a dataset of 1493 compounds. To this end, two multi-label classification QSAR modelling approaches were adopted: Binary Relevance (BR) and Classifier Chain (CC). Even though both multi-label models yielded similar predictive performances in terms of overall accuracies (close to 70 %), the CC model overcame the problem of skewed performance towards identifying substrates compared with non-substrates, which is a common problem in the literature. The models were thoroughly validated by using external testing, applicability domain and activity cliffs characterization. In conclusion, a multi-label classification approach is an appropriate alternative for the prediction of ABC efflux.

Asunto(s)

Transportadoras de Casetes de Unión a ATP/química , Ligandos , Modelos Moleculares , Relación Estructura-Actividad Cuantitativa , Transportadoras de Casetes de Unión a ATP/metabolismo , Algoritmos , Estructura Molecular , Unión Proteica , Reproducibilidad de los Resultados , Especificidad por Sustrato

13.

New KEGG pathway-based interpretable features for classifying ageing-related mouse proteins.

Fabris, Fabio; Freitas, Alex A.

Bioinformatics ; 32(19): 2988-95, 2016 10 01.

Artículo en Inglés | MEDLINE | ID: mdl-27318209

RESUMEN

MOTIVATION: The incidence of ageing-related diseases has been constantly increasing in the last decades, raising the need for creating effective methods to analyze ageing-related protein data. These methods should have high predictive accuracy and be easily interpretable by ageing experts. To enable this, one needs interpretable classification models (supervised machine learning) and features with rich biological meaning. In this paper we propose two interpretable feature types based on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and compare them with traditional feature types in hierarchical classification (a more challenging classification task regarding predictive performance) and binary classification (a classification task producing easier to interpret classification models). As far as we know, this work is the first to: (i) explore the potential of the KEGG pathway data in the hierarchical classification setting, (i) use the graph structure of KEGG pathways to create a feature type that quantifies the influence of a current protein on another specific protein within a KEGG pathway graph and (iii) propose a method for interpreting the classification models induced using KEGG features. RESULTS: We performed tests measuring predictive accuracy considering hierarchical and binary class labels extracted from the Mouse Phenotype Ontology. One of the KEGG feature types leads to the highest predictive accuracy among five individual feature types across three hierarchical classification algorithms. Additionally, the combination of the two KEGG feature types proposed in this work results in one of the best predictive accuracies when using the binary class version of our datasets, at the same time enabling the extraction of knowledge from ageing-related data using quantitative influence information. AVAILABILITY AND IMPLEMENTATION: The datasets created in this paper will be freely available after publication. CONTACT: ff79@kent.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Envejecimiento , Genoma , Proteínas , Algoritmos , Animales , Ratones , Fenotipo

14.

An Extensive Empirical Comparison of Probabilistic Hierarchical Classifiers in Datasets of Ageing-Related Genes.

Fabris, Fabio; Freitas, Alex A; Tullet, Jennifer M A.

IEEE/ACM Trans Comput Biol Bioinform ; 13(6): 1045-1058, 2016.

Artículo en Inglés | MEDLINE | ID: mdl-26661786

RESUMEN

This study comprehensively evaluates the performance of five types of probabilistic hierarchical classification methods used for predicting Gene Ontology (GO) terms related to ageing. Of those tested, a new hybrid of a Local Hierarchical Classifier (LHC) and the Predictive Clustering Tree algorithm (LHC-PCT) had the best predictive accuracy results. We also tested the impact of two types of variations in most hierarchical classification algorithms, namely: (a) changing the base algorithm (we tested Naive Bayes and Support Vector Machines), and the impact of (b) using or not the Correlation based Feature Selection (CFS) algorithm in a pre-processing step. In total, we evaluated the predictive performance of 17 variations of hierarchical classifiers across 15 datasets of ageing and longevity-related genes. We conclude that the LHC-PCT algorithm ranks better across several tests (seven out of 12). In addition, we interpreted the models generated by the PCT algorithm to show how hierarchical classification algorithms can be used to extract biological insights out of the ageing-related datasets that we compiled.

Asunto(s)

Envejecimiento/genética , Perfilación de la Expresión Génica/métodos , Modelos Genéticos , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteoma/genética , Algoritmos , Simulación por Computador , Minería de Datos/métodos , Bases de Datos Genéticas , Humanos , Aprendizaje Automático

15.

Improving the Interpretability of Classification Rules Discovered by an Ant Colony Algorithm: Extended Results.

Otero, Fernando E B; Freitas, Alex A.

Evol Comput ; 24(3): 385-409, 2016.

Artículo en Inglés | MEDLINE | ID: mdl-26066807

RESUMEN

Most ant colony optimization (ACO) algorithms for inducing classification rules use a ACO-based procedure to create a rule in a one-at-a-time fashion. An improved search strategy has been proposed in the cAnt-Miner[Formula: see text] algorithm, where an ACO-based procedure is used to create a complete list of rules (ordered rules), i.e., the ACO search is guided by the quality of a list of rules instead of an individual rule. In this paper we propose an extension of the cAnt-Miner[Formula: see text] algorithm to discover a set of rules (unordered rules). The main motivations for this work are to improve the interpretation of individual rules by discovering a set of rules and to evaluate the impact on the predictive accuracy of the algorithm. We also propose a new measure to evaluate the interpretability of the discovered rules to mitigate the fact that the commonly used model size measure ignores how the rules are used to make a class prediction. Comparisons with state-of-the-art rule induction algorithms, support vector machines, and the cAnt-Miner[Formula: see text] producing ordered rules are also presented.

Asunto(s)

Algoritmos , Hormigas/fisiología , Animales , Biología Computacional

16.

Predicting the Pro-Longevity or Anti-Longevity Effect of Model Organism Genes with New Hierarchical Feature Selection Methods.

Wan, Cen; Freitas, Alex A; de Magalhães, João Pedro.

IEEE/ACM Trans Comput Biol Bioinform ; 12(2): 262-75, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26357215

RESUMEN

Ageing is a highly complex biological process that is still poorly understood. With the growing amount of ageing-related data available on the web, in particular concerning the genetics of ageing, it is timely to apply data mining methods to that data, in order to try to discover novel patterns that may assist ageing research. In this work, we introduce new hierarchical feature selection methods for the classification task of data mining and apply them to ageing-related data from four model organisms: Caenorhabditis elegans (worm), Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fly), and Mus musculus (mouse). The main novel aspect of the proposed feature selection methods is that they exploit hierarchical relationships in the set of features (Gene Ontology terms) in order to improve the predictive accuracy of the Naïve Bayes and 1-Nearest Neighbour (1-NN) classifiers, which are used to classify model organisms' genes into pro-longevity or anti-longevity genes. The results show that our hierarchical feature selection methods, when used together with Naïve Bayes and 1-NN classifiers, obtain higher predictive accuracy than the standard (without feature selection) Naïve Bayes and 1-NN classifiers, respectively. We also discuss the biological relevance of a number of Gene Ontology terms very frequently selected by our algorithms in our datasets.

Asunto(s)

Senescencia Celular/genética , Biología Computacional/métodos , Minería de Datos/métodos , Modelos Genéticos , Algoritmos , Animales , Teorema de Bayes , Caenorhabditis elegans/genética , Bases de Datos Genéticas , Drosophila melanogaster/genética , Ontología de Genes , Ratones , Saccharomyces cerevisiae/genética

17.

Predicting volume of distribution with decision tree-based regression methods using predicted tissue:plasma partition coefficients.

Freitas, Alex A; Limbu, Kriti; Ghafourian, Taravat.

J Cheminform ; 7: 6, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-25767566

RESUMEN

BACKGROUND: Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values. RESULTS: Comparison of the models that used only molecular descriptors, in particular, the Bagging decision tree (mean fold error of 2.33), with those employing predicted Kt:p values in addition to the molecular descriptors, such as the Bagging decision tree using adipose Kt:p (mean fold error of 2.29), indicated that the use of predicted Kt:p values as descriptors may be beneficial for accurate prediction of Vss using decision trees if prior feature selection is applied. CONCLUSIONS: Decision tree based models presented in this work have an accuracy that is reasonable and similar to the accuracy of reported Vss inter-species extrapolations in the literature. The estimation of Vss for new compounds in drug discovery will benefit from methods that are able to integrate large and varied sources of data and flexible non-linear data mining methods such as decision trees, which can produce interpretable models. Graphical AbstractDecision trees for the prediction of tissue partition coefficient and volume of distribution of drugs.

18.

Comparing multilabel classification methods for provisional biopharmaceutics class prediction.

Newby, Danielle; Freitas, Alex A; Ghafourian, Taravat.

Mol Pharm ; 12(1): 87-102, 2015 Jan 05.

Artículo en Inglés | MEDLINE | ID: mdl-25397721

RESUMEN

The biopharmaceutical classification system (BCS) is now well established and utilized for the development and biowaivers of immediate oral dosage forms. The prediction of BCS class can be carried out using multilabel classification. Unlike single label classification, multilabel classification methods predict more than one class label at the same time. This paper compares two multilabel methods, binary relevance and classifier chain, for provisional BCS class prediction. Large data sets of permeability and solubility of drug and drug-like compounds were obtained from the literature and were used to build models using decision trees. The separate permeability and solubility models were validated, and a BCS validation set of 127 compounds where both permeability and solubility were known was used to compare the two aforementioned multilabel classification methods for provisional BCS class prediction. Overall, the results indicate that the classifier chain method, which takes into account label interactions, performed better compared to the binary relevance method. This work offers a comparison of multilabel methods and shows the potential of the classifier chain multilabel method for improved biological property predictions for use in drug discovery and development.

Asunto(s)

Biofarmacia/métodos , Química Farmacéutica/métodos , Modelos Teóricos , Administración Oral , Algoritmos , Células CACO-2 , Simulación por Computador , Descubrimiento de Drogas , Humanos , Imagenología Tridimensional , Permeabilidad , Análisis de Regresión , Reproducibilidad de los Resultados , Solubilidad

19.

Decision trees to characterise the roles of permeability and solubility on the prediction of oral absorption.

Newby, Danielle; Freitas, Alex A; Ghafourian, Taravat.

Eur J Med Chem ; 90: 751-65, 2015 Jan 27.

Artículo en Inglés | MEDLINE | ID: mdl-25528330

RESUMEN

Oral absorption of compounds depends on many physiological, physiochemical and formulation factors. Two important properties that govern oral absorption are in vitro permeability and solubility, which are commonly used as indicators of human intestinal absorption. Despite this, the nature and exact characteristics of the relationship between these parameters are not well understood. In this study a large dataset of human intestinal absorption was collated along with in vitro permeability, aqueous solubility, melting point, and maximum dose for the same compounds. The dataset allowed a permeability threshold to be established objectively to predict high or low intestinal absorption. Using this permeability threshold, classification decision trees incorporating a solubility-related parameter such as experimental or predicted solubility, or the melting point based absorption potential (MPbAP), along with structural molecular descriptors were developed and validated to predict oral absorption class. The decision trees were able to determine the individual roles of permeability and solubility in oral absorption process. Poorly permeable compounds with high solubility show low intestinal absorption, whereas poorly water soluble compounds with high or low permeability may have high intestinal absorption provided that they have certain molecular characteristics such as a small polar surface or specific topology.

Asunto(s)

Árboles de Decisión , Absorción Fisiológica , Administración Oral , Animales , Células CACO-2 , Perros , Humanos , Células de Riñón Canino Madin Darby , Permeabilidad , Solubilidad

20.

Pre-processing feature selection for improved C&RT models for oral absorption.

Newby, Danielle; Freitas, Alex A; Ghafourian, Taravat.

J Chem Inf Model ; 53(10): 2730-42, 2013 Oct 28.

Artículo en Inglés | MEDLINE | ID: mdl-24050619

RESUMEN

There are currently thousands of molecular descriptors that can be calculated to represent a chemical compound. Utilizing all molecular descriptors in Quantitative Structure-Activity Relationships (QSAR) modeling can result in overfitting, decreased interpretability, and thus reduced model performance. Feature selection methods can overcome some of these problems by drastically reducing the number of molecular descriptors and selecting the molecular descriptors relevant to the property being predicted. In particular, decision trees such as C&RT, although they have an embedded feature selection algorithm, can be inadequate since further down the tree there are fewer compounds available for descriptor selection, and therefore descriptors may be selected which are not optimal. In this work we compare two broad approaches for feature selection: (1) a "two-stage" feature selection procedure, where a pre-processing feature selection method selects a subset of descriptors, and then classification and regression trees (C&RT) selects descriptors from this subset to build a decision tree; (2) a "one-stage" approach where C&RT is used as the only feature selection technique. These methods were applied in order to improve prediction accuracy of QSAR models for oral absorption. Additionally, this work utilizes misclassification costs in model building to overcome the problem of the biased oral absorption data sets with more highly absorbed than poorly absorbed compounds. In most cases the two-stage feature selection with pre-processing approach had higher model accuracy compared with the one-stage approach. Using the top 20 molecular descriptors from the random forest predictor importance method gave the most accurate C&RT classification model. The molecular descriptors selected by the five filter feature selection methods have been compared in relation to oral absorption. In conclusion, the use of filter pre-processing feature selection methods and misclassification costs produce models with better interpretability and predictability for the prediction of oral absorption.

Asunto(s)

Árboles de Decisión , Drogas en Investigación/farmacocinética , Modelos Estadísticos , Mucosa Bucal/metabolismo , Administración Oral , Algoritmos , Drogas en Investigación/síntesis química , Humanos , Relación Estructura-Actividad Cuantitativa

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA