Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 23(1): 10, 2022 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-34983372

RESUMO

BACKGROUND: Dietary restriction (DR) is the most studied pro-longevity intervention; however, a complete understanding of its underlying mechanisms remains elusive, and new research directions may emerge from the identification of novel DR-related genes and DR-related genetic features. RESULTS: This work used a Machine Learning (ML) approach to classify ageing-related genes as DR-related or NotDR-related using 9 different types of predictive features: PathDIP pathways, two types of features based on KEGG pathways, two types of Protein-Protein Interactions (PPI) features, Gene Ontology (GO) terms, Genotype Tissue Expression (GTEx) expression features, GeneFriends co-expression features and protein sequence descriptors. Our findings suggested that features biased towards curated knowledge (i.e. GO terms and biological pathways), had the greatest predictive power, while unbiased features (mainly gene expression and co-expression data) have the least predictive power. Moreover, a combination of all the feature types diminished the predictive power compared to predictions based on curated knowledge. Feature importance analysis on the two most predictive classifiers mostly corroborated existing knowledge and supported recent findings linking DR to the Nuclear Factor Erythroid 2-Related Factor 2 (NRF2) signalling pathway and G protein-coupled receptors (GPCR). We then used the two strongest combinations of feature type and ML algorithm to predict DR-relatedness among ageing-related genes currently lacking DR-related annotations in the data, resulting in a set of promising candidate DR-related genes (GOT2, GOT1, TSC1, CTH, GCLM, IRS2 and SESN2) whose predicted DR-relatedness remain to be validated in future wet-lab experiments. CONCLUSIONS: This work demonstrated the strong potential of ML-based techniques to identify DR-associated features as our findings are consistent with literature and recent discoveries. Although the inference of new DR-related mechanistic findings based solely on GO terms and biological pathways was limited due to their knowledge-driven nature, the predictive power of these two features types remained useful as it allowed inferring new promising candidate DR-related genes.


Assuntos
Algoritmos , Aprendizado de Máquina , Ontologia Genética , Longevidade/genética
2.
Brief Bioinform ; 21(2): 421-428, 2020 03 23.
Artigo em Inglês | MEDLINE | ID: mdl-30629111

RESUMO

An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area has, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson's paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson's paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson's paradox involving top-ranked predictors are much more common for one of the feature ranking methods.


Assuntos
Biologia Computacional , Conjuntos de Dados como Assunto , Aprendizado de Máquina
3.
Brief Bioinform ; 21(3): 803-814, 2020 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-30895300

RESUMO

Biologists very often use enrichment methods based on statistical hypothesis tests to identify gene properties that are significantly over-represented in a given set of genes of interest, by comparison with a 'background' set of genes. These enrichment methods, although based on rigorous statistical foundations, are not always the best single option to identify patterns in biological data. In many cases, one can also use classification algorithms from the machine-learning field. Unlike enrichment methods, classification algorithms are designed to maximize measures of predictive performance and are capable of analysing combinations of gene properties, instead of one property at a time. In practice, however, the majority of studies use either enrichment or classification methods (rather than both), and there is a lack of literature discussing the pros and cons of both types of method. The goal of this paper is to compare and contrast enrichment and classification methods, offering two contributions. First, we discuss the (to some extent complementary) advantages and disadvantages of both types of methods for identifying gene properties that discriminate between gene classes. Second, we provide a set of high-level recommendations for using enrichment and classification methods. Overall, by highlighting the strengths and the weaknesses of both types of methods we argue that both should be used in bioinformatics analyses.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Aprendizado de Máquina , Algoritmos
4.
Bioinformatics ; 36(7): 2202-2208, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31845988

RESUMO

MOTIVATION: One way to identify genes possibly associated with ageing is to build a classification model (from the machine learning field) capable of classifying genes as associated with multiple age-related diseases. To build this model, we use a pre-compiled list of human genes associated with age-related diseases and apply a novel Deep Neural Network (DNN) method to find associations between gene descriptors (e.g. Gene Ontology terms, protein-protein interaction data and biological pathway information) and age-related diseases. RESULTS: The novelty of our new DNN method is its modular architecture, which has the capability of combining several sources of biological data to predict which ageing-related diseases a gene is associated with (if any). Our DNN method achieves better predictive performance than standard DNN approaches, a Gradient Boosted Tree classifier (a strong baseline method) and a Logistic Regression classifier. Given the DNN model produced by our method, we use two approaches to identify human genes that are not known to be associated with age-related diseases according to our dataset. First, we investigate genes that are close to other disease-associated genes in a complex multi-dimensional feature space learned by the DNN algorithm. Second, using the class label probabilities output by our DNN approach, we identify genes with a high probability of being associated with age-related diseases according to the model. We provide evidence of these putative associations retrieved from the DNN model with literature support. AVAILABILITY AND IMPLEMENTATION: The source code and datasets can be found at: https://github.com/fabiofabris/Bioinfo2019. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Aprendizado de Máquina , Envelhecimento , Ontologia Genética , Humanos , Redes Neurais de Computação
5.
Bioinformatics ; 34(14): 2449-2456, 2018 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-29462247

RESUMO

Motivation: This work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model. Results: The new feature importance measure identified highly relevant Gene Ontology terms for the aforementioned gene classification task, producing a feature ranking that is much more informative to biologists than an alternative, state-of-the-art feature importance measure. Availability and implementation: The dataset and source codes used in this paper are available as 'Supplementary Material' and the description of the data can be found at: https://fabiofabris.github.io/bioinfo2018/web/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Envelhecimento/genética , Encéfalo/metabolismo , Biologia Computacional/métodos , Regulação da Expressão Gênica , Software , Animais , Ontologia Genética , Humanos , Aprendizado de Máquina
6.
Bioinformatics ; 32(19): 2988-95, 2016 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-27318209

RESUMO

MOTIVATION: The incidence of ageing-related diseases has been constantly increasing in the last decades, raising the need for creating effective methods to analyze ageing-related protein data. These methods should have high predictive accuracy and be easily interpretable by ageing experts. To enable this, one needs interpretable classification models (supervised machine learning) and features with rich biological meaning. In this paper we propose two interpretable feature types based on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and compare them with traditional feature types in hierarchical classification (a more challenging classification task regarding predictive performance) and binary classification (a classification task producing easier to interpret classification models). As far as we know, this work is the first to: (i) explore the potential of the KEGG pathway data in the hierarchical classification setting, (i) use the graph structure of KEGG pathways to create a feature type that quantifies the influence of a current protein on another specific protein within a KEGG pathway graph and (iii) propose a method for interpreting the classification models induced using KEGG features. RESULTS: We performed tests measuring predictive accuracy considering hierarchical and binary class labels extracted from the Mouse Phenotype Ontology. One of the KEGG feature types leads to the highest predictive accuracy among five individual feature types across three hierarchical classification algorithms. Additionally, the combination of the two KEGG feature types proposed in this work results in one of the best predictive accuracies when using the binary class version of our datasets, at the same time enabling the extraction of knowledge from ageing-related data using quantitative influence information. AVAILABILITY AND IMPLEMENTATION: The datasets created in this paper will be freely available after publication. CONTACT: ff79@kent.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Envelhecimento , Genoma , Proteínas , Algoritmos , Animais , Camundongos , Fenótipo
7.
Biogerontology ; 18(2): 171-188, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-28265788

RESUMO

Broadly speaking, supervised machine learning is the computational task of learning correlations between variables in annotated data (the training set), and using this information to create a predictive model capable of inferring annotations for new data, whose annotations are not known. Ageing is a complex process that affects nearly all animal species. This process can be studied at several levels of abstraction, in different organisms and with different objectives in mind. Not surprisingly, the diversity of the supervised machine learning algorithms applied to answer biological questions reflects the complexities of the underlying ageing processes being studied. Many works using supervised machine learning to study the ageing process have been recently published, so it is timely to review these works, to discuss their main findings and weaknesses. In summary, the main findings of the reviewed papers are: the link between specific types of DNA repair and ageing; ageing-related proteins tend to be highly connected and seem to play a central role in molecular pathways; ageing/longevity is linked with autophagy and apoptosis, nutrient receptor genes, and copper and iron ion transport. Additionally, several biomarkers of ageing were found by machine learning. Despite some interesting machine learning results, we also identified a weakness of current works on this topic: only one of the reviewed papers has corroborated the computational results of machine learning algorithms through wet-lab experiments. In conclusion, supervised machine learning has contributed to advance our knowledge and has provided novel insights on ageing, yet future work should have a greater emphasis in validating the predictions.


Assuntos
Envelhecimento/fisiologia , Biologia Computacional/métodos , Modelos Biológicos , Projetos de Pesquisa , Aprendizado de Máquina Supervisionado , Animais , Simulação por Computador , Humanos
8.
Evol Comput ; 24(3): 385-409, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26066807

RESUMO

Most ant colony optimization (ACO) algorithms for inducing classification rules use a ACO-based procedure to create a rule in a one-at-a-time fashion. An improved search strategy has been proposed in the cAnt-Miner[Formula: see text] algorithm, where an ACO-based procedure is used to create a complete list of rules (ordered rules), i.e., the ACO search is guided by the quality of a list of rules instead of an individual rule. In this paper we propose an extension of the cAnt-Miner[Formula: see text] algorithm to discover a set of rules (unordered rules). The main motivations for this work are to improve the interpretation of individual rules by discovering a set of rules and to evaluate the impact on the predictive accuracy of the algorithm. We also propose a new measure to evaluate the interpretability of the discovered rules to mitigate the fact that the commonly used model size measure ignores how the rules are used to make a class prediction. Comparisons with state-of-the-art rule induction algorithms, support vector machines, and the cAnt-Miner[Formula: see text] producing ordered rules are also presented.


Assuntos
Algoritmos , Formigas/fisiologia , Animais , Biologia Computacional
9.
Mol Pharm ; 12(1): 87-102, 2015 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-25397721

RESUMO

The biopharmaceutical classification system (BCS) is now well established and utilized for the development and biowaivers of immediate oral dosage forms. The prediction of BCS class can be carried out using multilabel classification. Unlike single label classification, multilabel classification methods predict more than one class label at the same time. This paper compares two multilabel methods, binary relevance and classifier chain, for provisional BCS class prediction. Large data sets of permeability and solubility of drug and drug-like compounds were obtained from the literature and were used to build models using decision trees. The separate permeability and solubility models were validated, and a BCS validation set of 127 compounds where both permeability and solubility were known was used to compare the two aforementioned multilabel classification methods for provisional BCS class prediction. Overall, the results indicate that the classifier chain method, which takes into account label interactions, performed better compared to the binary relevance method. This work offers a comparison of multilabel methods and shows the potential of the classifier chain multilabel method for improved biological property predictions for use in drug discovery and development.


Assuntos
Biofarmácia/métodos , Química Farmacêutica/métodos , Modelos Teóricos , Administração Oral , Algoritmos , Células CACO-2 , Simulação por Computador , Descoberta de Drogas , Humanos , Imageamento Tridimensional , Permeabilidade , Análise de Regressão , Reprodutibilidade dos Testes , Solubilidade
10.
Comput Biol Med ; 180: 108999, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39137672

RESUMO

Dietary Restriction (DR) is one of the most popular anti-ageing interventions; recently, Machine Learning (ML) has been explored to identify potential DR-related genes among ageing-related genes, aiming to minimize costly wet lab experiments needed to expand our knowledge on DR. However, to train a model from positive (DR-related) and negative (non-DR-related) examples, the existing ML approach naively labels genes without known DR relation as negative examples, assuming that lack of DR-related annotation for a gene represents evidence of absence of DR-relatedness, rather than absence of evidence. This hinders the reliability of the negative examples (non-DR-related genes) and the method's ability to identify novel DR-related genes. This work introduces a novel gene prioritization method based on the two-step Positive-Unlabelled (PU) Learning paradigm: using a similarity-based, KNN-inspired approach, our method first selects reliable negative examples among the genes without known DR associations. Then, these reliable negatives and all known positives are used to train a classifier that effectively differentiates DR-related and non-DR-related genes, which is finally employed to generate a more reliable ranking of promising genes for novel DR-relatedness. Our method significantly outperforms (p<0.05) the existing state-of-the-art approach in three predictive accuracy metrics with up to ∼40% lower computational cost in the best case, and we identify 4 new promising DR-related genes (PRKAB1, PRKAB2, IRS2, PRKAG1), all with evidence from the existing literature supporting their potential DR-related role.


Assuntos
Envelhecimento , Aprendizado de Máquina , Humanos , Envelhecimento/genética , Envelhecimento/fisiologia , Restrição Calórica , Biologia Computacional/métodos
11.
J Chem Inf Model ; 53(10): 2730-42, 2013 Oct 28.
Artigo em Inglês | MEDLINE | ID: mdl-24050619

RESUMO

There are currently thousands of molecular descriptors that can be calculated to represent a chemical compound. Utilizing all molecular descriptors in Quantitative Structure-Activity Relationships (QSAR) modeling can result in overfitting, decreased interpretability, and thus reduced model performance. Feature selection methods can overcome some of these problems by drastically reducing the number of molecular descriptors and selecting the molecular descriptors relevant to the property being predicted. In particular, decision trees such as C&RT, although they have an embedded feature selection algorithm, can be inadequate since further down the tree there are fewer compounds available for descriptor selection, and therefore descriptors may be selected which are not optimal. In this work we compare two broad approaches for feature selection: (1) a "two-stage" feature selection procedure, where a pre-processing feature selection method selects a subset of descriptors, and then classification and regression trees (C&RT) selects descriptors from this subset to build a decision tree; (2) a "one-stage" approach where C&RT is used as the only feature selection technique. These methods were applied in order to improve prediction accuracy of QSAR models for oral absorption. Additionally, this work utilizes misclassification costs in model building to overcome the problem of the biased oral absorption data sets with more highly absorbed than poorly absorbed compounds. In most cases the two-stage feature selection with pre-processing approach had higher model accuracy compared with the one-stage approach. Using the top 20 molecular descriptors from the random forest predictor importance method gave the most accurate C&RT classification model. The molecular descriptors selected by the five filter feature selection methods have been compared in relation to oral absorption. In conclusion, the use of filter pre-processing feature selection methods and misclassification costs produce models with better interpretability and predictability for the prediction of oral absorption.


Assuntos
Árvores de Decisões , Drogas em Investigação/farmacocinética , Modelos Estatísticos , Mucosa Bucal/metabolismo , Administração Oral , Algoritmos , Drogas em Investigação/síntese química , Humanos , Relação Quantitativa Estrutura-Atividade
12.
J Chem Inf Model ; 53(2): 461-74, 2013 Feb 25.
Artigo em Inglês | MEDLINE | ID: mdl-23293925

RESUMO

Class imbalance occurs frequently in drug discovery data sets. In oral absorption data sets, in the literature, there are considerably more highly absorbed compounds compared to poorly absorbed compounds. This produces models that are biased toward highly absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly absorbed. This paper presents two strategies to cope with unbalanced class data sets: undersampling the majority high absorption class and misclassification costs using classification decision trees. The published data set by Hou et al. [J. Chem. Inf. Model.2007, 47, 208-218], which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using classification and regression tree (C&RT) analysis. The results indicate that undersampling the majority class, highly absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorly absorbed compounds, whereas the biased training set achieved higher accuracies for highly absorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced data sets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class data sets to obtain classification models applicable in industry.


Assuntos
Descoberta de Drogas/métodos , Absorção , Administração Oral , Bases de Dados de Produtos Farmacêuticos , Árvores de Decisões , Humanos , Modelos Biológicos , Análise de Regressão
13.
Evol Comput ; 21(4): 659-84, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23339552

RESUMO

This study reports the empirical analysis of a hyper-heuristic evolutionary algorithm that is capable of automatically designing top-down decision-tree induction algorithms. Top-down decision-tree algorithms are of great importance, considering their ability to provide an intuitive and accurate knowledge representation for classification problems. The automatic design of these algorithms seems timely, given the large literature accumulated over more than 40 years of research in the manual design of decision-tree induction algorithms. The proposed hyper-heuristic evolutionary algorithm, HEAD-DT, is extensively tested using 20 public UCI datasets and 10 microarray gene expression datasets. The algorithms automatically designed by HEAD-DT are compared with traditional decision-tree induction algorithms, such as C4.5 and CART. Experimental results show that HEAD-DT is capable of generating algorithms which are significantly more accurate than C4.5 and CART.


Assuntos
Algoritmos , Classificação/métodos , Árvores de Decisões , Perfilação da Expressão Gênica/métodos , Humanos
14.
Aging (Albany NY) ; 15(13): 6073-6099, 2023 07 13.
Artigo em Inglês | MEDLINE | ID: mdl-37450404

RESUMO

Recently, there has been a growing interest in the development of pharmacological interventions targeting ageing, as well as in the use of machine learning for analysing ageing-related data. In this work, we use machine learning methods to analyse data from DrugAge, a database of chemical compounds (including drugs) modulating lifespan in model organisms. To this end, we created four types of datasets for predicting whether or not a compound extends the lifespan of C. elegans (the most frequent model organism in DrugAge), using four different types of predictive biological features, based on: compound-protein interactions, interactions between compounds and proteins encoded by ageing-related genes, and two types of terms annotated for proteins targeted by the compounds, namely Gene Ontology (GO) terms and physiology terms from the WormBase's Phenotype Ontology. To analyse these datasets, we used a combination of feature selection methods in a data pre-processing phase and the well-established random forest algorithm for learning predictive models from the selected features. In addition, we interpreted the most important features in the two best models in light of the biology of ageing. One noteworthy feature was the GO term "Glutathione metabolic process", which plays an important role in cellular redox homeostasis and detoxification. We also predicted the most promising novel compounds for extending lifespan from a list of previously unlabelled compounds. These include nitroprusside, which is used as an antihypertensive medication. Overall, our work opens avenues for future work in employing machine learning to predict novel life-extending compounds.


Assuntos
Caenorhabditis elegans , Longevidade , Aprendizado de Máquina , Longevidade/efeitos dos fármacos , Caenorhabditis elegans/efeitos dos fármacos , Caenorhabditis elegans/genética , Caenorhabditis elegans/fisiologia , Envelhecimento , Glutationa/análise , Oxirredução , Ontologia Genética , Algoritmos , Bases de Dados de Produtos Farmacêuticos
15.
BMC Genomics ; 12: 27, 2011 Jan 12.
Artigo em Inglês | MEDLINE | ID: mdl-21226956

RESUMO

BACKGROUND: The ageing of the worldwide population means there is a growing need for research on the biology of ageing. DNA damage is likely a key contributor to the ageing process and elucidating the role of different DNA repair systems in ageing is of great interest. In this paper we propose a data mining approach, based on classification methods (decision trees and Naive Bayes), for analysing data about human DNA repair genes. The goal is to build classification models that allow us to discriminate between ageing-related and non-ageing-related DNA repair genes, in order to better understand their different properties. RESULTS: The main patterns discovered by the classification methods are as follows: (a) the number of protein-protein interactions was a predictor of DNA repair proteins being ageing-related; (b) the use of predictor attributes based on protein-protein interactions considerably increased predictive accuracy of attributes based on Gene Ontology (GO) annotations; (c) GO terms related to "response to stimulus" seem reasonably good predictors of ageing-relatedness for DNA repair genes; (d) interaction with the XRCC5 (Ku80) protein is a strong predictor of ageing-relatedness for DNA repair genes; and (e) DNA repair genes with a high expression in T lymphocytes are more likely to be ageing-related. CONCLUSIONS: The above patterns are broadly integrated in an analysis discussing relations between Ku, the non-homologous end joining DNA repair pathway, ageing and lymphocyte development. These patterns and their analysis support non-homologous end joining double strand break repair as central to the ageing-relatedness of DNA repair genes. Our work also showcases the use of protein interaction partners to improve accuracy in data mining methods and our approach could be applied to other ageing-related pathways.


Assuntos
Envelhecimento/genética , Reparo do DNA/genética , Mineração de Dados , Algoritmos , Animais , Humanos
16.
Mutat Res ; 728(1-2): 12-22, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21600302

RESUMO

Given the central role of DNA in life, and how ageing can be seen as the gradual and irreversible breakdown of living systems, the idea that damage to the DNA is the crucial cause of ageing remains a powerful one. DNA damage and mutations of different types clearly accumulate with age in mammalian tissues. Human progeroid syndromes resulting in what appears to be accelerated ageing have been linked to defects in DNA repair or processing, suggesting that elevated levels of DNA damage can accelerate physiological decline and the development of age-related diseases not limited to cancer. Higher DNA damage may trigger cellular signalling pathways, such as apoptosis, that result in a faster depletion of stem cells, which in turn contributes to accelerated ageing. Genetic manipulations of DNA repair pathways in mice further strengthen this view and also indicate that disruption of specific pathways, such as nucleotide excision repair and non-homologous end joining, is more strongly associated with premature ageing phenotypes. Delaying ageing in mice by decreasing levels of DNA damage, however, has not been achieved yet, perhaps due to the complexity inherent to DNA repair and DNA damage response pathways. Another open question is whether DNA repair optimization is involved in the evolution of species longevity, and we suggest that the way cells from different organisms respond to DNA damage may be crucial in species differences in ageing. Taken together, the data suggest a major role of DNA damage in the modulation of longevity, possibly through effects on cell dysfunction and loss, although understanding how to modify DNA damage repair and response systems to delay ageing remains a crucial challenge.


Assuntos
Envelhecimento/genética , Dano ao DNA , Animais , Apoptose , Evolução Biológica , Síndrome de Cockayne/genética , Reparo do DNA , Humanos , Longevidade/genética , Mutação , Especificidade da Espécie , Células-Tronco/fisiologia
17.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2230-2238, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-32324561

RESUMO

Understanding the ageing process is a very challenging problem for biologists. To help in this task, there has been a growing use of classification methods (from machine learning) to learn models that predict whether a gene influences the process of ageing or promotes longevity. One type of predictive feature often used for learning such classification models is Protein-Protein Interaction (PPI) features. One important property of PPI features is their uncertainty, i.e., a given feature (PPI annotation) is often associated with a confidence score, which is usually ignored by conventional classification methods. Hence, we propose the Lazy Feature Selection for Uncertain Features (LFSUF) method, which is tailored for coping with the uncertainty in PPI confidence scores. In addition, following the lazy learning paradigm, LFSUF selects features for each instance to be classified, making the feature selection process more flexible. We show that our LFSUF method achieves better predictive accuracy when compared to other feature selection methods that either do not explicitly take PPI confidence scores into account or deal with uncertainty globally rather than using a per-instance approach. Also, we interpret the results of the classification process using the features selected by LFSUF, showing that the number of selected features is significantly reduced, assisting the interpretability of the results. The datasets used in the experiments and the program code of the LFSUF method are freely available on the web at http://github.com/pablonsilva/FSforUncertainFeatureSpaces.


Assuntos
Envelhecimento/genética , Biologia Computacional/métodos , Aprendizado de Máquina , Algoritmos , Animais , Drosophila melanogaster/genética , Genoma Humano/genética , Humanos , Camundongos , Mapas de Interação de Proteínas/genética , Incerteza , Leveduras/genética
18.
Aging (Albany NY) ; 13(3): 3313-3341, 2021 02 11.
Artigo em Inglês | MEDLINE | ID: mdl-33611312

RESUMO

By combining transcriptomic data with other data sources, inferences can be made about functional changes during ageing. Thus, we conducted a meta-analysis on 127 publicly available microarray and RNA-Seq datasets from mice, rats and humans, identifying a transcriptomic signature of ageing across species and tissues. Analyses on subsets of these datasets produced transcriptomic signatures of ageing for brain, heart and muscle. We then applied enrichment analysis and machine learning to functionally describe these signatures, revealing overexpression of immune and stress response genes and underexpression of metabolic and developmental genes. Further analyses revealed little overlap between genes differentially expressed with age in different tissues, despite ageing differentially expressed genes typically being widely expressed across tissues. Additionally we show that the ageing gene expression signatures (particularly the overexpressed signatures) of the whole meta-analysis, brain and muscle tend to include genes that are central in protein-protein interaction networks. We also show that genes underexpressed with age in the brain are highly central in a co-expression network, suggesting that underexpression of these genes may have broad phenotypic consequences. In sum, we show numerous functional similarities between the ageing transcriptomes of these important tissues, along with unique network properties of genes differentially expressed with age in both a protein-protein interaction and co-expression networks.


Assuntos
Envelhecimento/genética , Genômica/métodos , Especificidade de Órgãos/genética , Transcriptoma/genética , Animais , Humanos , Aprendizado de Máquina , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos , Mapeamento de Interação de Proteínas , Ratos
19.
Bioinformatics ; 24(18): 2064-70, 2008 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-18641010

RESUMO

MOTIVATION: Cellular processes often hinge upon specific interactions among proteins, and knowledge of these processes at a system level constitutes a major goal of proteomics. In particular, a greater understanding of protein-protein interactions can be gained via a more detailed investigation of the protein domain interactions that mediate the interactions of proteins. Existing high-throughput experimental techniques assay protein-protein interactions, yet they do not provide any direct information on the interactions among domains. Inferences concerning the latter can be made by analysis of the domain composition of a set of proteins and their interaction map. This inference problem is non-trivial, however, due to the high level of noise generally present in experimental data concerning protein-protein interactions. This noise leads to contradictions, i.e. the impossibility of having a pattern of domain interactions compatible with the protein-protein interaction map. RESULTS: We formulate the problem of prediction of protein domain interactions in a form that lends itself to the application of belief propagation, a powerful algorithm for such inference problems, which is based on message passing. The input to our algorithm is an interaction map among a set of proteins, and a set of domain assignments to the relevant proteins. The output is a list of probabilities of interaction between each pair of domains. Our method is able to effectively cope with errors in the protein-protein interaction dataset and systematically resolve contradictions. We applied the method to a dataset concerning the budding yeast Saccharomyces cerevisiae and tested the quality of our predictions by cross-validation on this dataset, by comparison with existing computational predictions, and finally with experimentally available domain interactions. Results compare favourably to those by existing algorithms. AVAILABILITY: A C language implementation of the algorithm is available upon request.


Assuntos
Algoritmos , Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Motivos de Aminoácidos , Sítios de Ligação , Bases de Dados de Proteínas , Saccharomyces cerevisiae/genética
20.
Bioinformatics ; 24(18): 1980-6, 2008 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-18676973

RESUMO

MOTIVATION: There is much interest in reducing the complexity inherent in the representation of the 20 standard amino acids within bioinformatics algorithms by developing a so-called reduced alphabet. Although there is no universally applicable residue grouping, there are numerous physiochemical criteria upon which one can base groupings. Local descriptors are a form of alignment-free analysis, the efficiency of which is dependent upon the correct selection of amino acid groupings. RESULTS: Within the context of G-protein coupled receptor (GPCR) classification, an optimization algorithm was developed, which was able to identify the most efficient grouping when used to generate local descriptors. The algorithm was inspired by the relatively new computational intelligence paradigm of artificial immune systems. A number of amino acid groupings produced by this algorithm were evaluated with respect to their ability to generate local descriptors capable of providing an accurate classification algorithm for GPCRs.


Assuntos
Algoritmos , Aminoácidos/classificação , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/classificação , Inteligência Artificial , Biologia Computacional/métodos , Bases de Dados de Proteínas , Receptores Acoplados a Proteínas G/metabolismo , Análise de Sequência de Proteína/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA