Pesquisa | Portal Regional da BVS

Graph coloring for extracting discriminative genes in cancer data.

Mahfouz, Mohamed A; Nepomuceno, Juan A.

Ann Hum Genet ; 83(3): 141-159, 2019 05.

Artigo em Inglês | MEDLINE | ID: mdl-30644085

RESUMO

BACKGROUND AND OBJECTIVE: The major difficulty of the analysis of the input gene expression data in a microarray-based approach for an automated diagnosis of cancer is the large number of genes (high dimensionality) with many irrelevant genes (noise) compared to the very small number of samples. This research study tackles the dimensionality reduction challenge in this area. METHODS: This research study introduces a dimension-reduction technique termed graph coloring approach (GCA) for microarray data-based cancer classification based on analyzing the absolute correlation between gene-gene pairs and partitioning genes into several hubs using graph coloring. GCA starts by a gene-selection step in which top relevant genes are selected using a biserial correlation. Each time, a gene from an ordered list of top relevant genes is selected as the hub gene (representative) and redundant genes are added to its group; the process is repeated recursively for the remaining genes. A gene is considered redundant if its absolute correlation with the hub gene is greater than a controlling threshold. A suitable range for the threshold is estimated by computing a percentage graph for the absolute correlation between gene-gene pairs. Each value in the estimated range for the threshold can efficiently produce a new feature subset. RESULTS: GCA achieved significant improvement over several existing techniques in terms of higher accuracy and a smaller number of features. Also, genes selected by this method are relevant genes according to the information stored in scientific repositories. CONCLUSIONS: The proposed dimension-reduction technique can help biologists accurately predict cancer in several areas of the body.

Assuntos

Algoritmos , Processamento Eletrônico de Dados/métodos , Neoplasias/genética , Humanos , Análise de Sequência com Séries de Oligonucleotídeos

Pairwise gene GO-based measures for biclustering of high-dimensional expression data.

Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S.

BioData Min ; 11: 4, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29610579

RESUMO

BACKGROUND: Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. RESULTS: The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. CONCLUSIONS: It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.

An application of the Shapley value to the analysis of co-expression networks.

Cesari, Giulia; Algaba, Encarnación; Moretti, Stefano; Nepomuceno, Juan A.

Appl Netw Sci ; 3(1): 35, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30839839

RESUMO

We study the problem of identifying relevant genes in a co-expression network using a (cooperative) game theoretic approach. The Shapley value of a cooperative game is used to asses the relevance of each gene in interaction with the others, and to stress the role of nodes in the periphery of a co-expression network for the regulation of complex biological pathways of interest. An application of the method to the analysis of gene expression data from microarrays is presented, as well as a comparison with classical centrality indices. Finally, making further assumptions about the a priori importance of genes, we combine the game theoretic model with other techniques from cluster analysis.

Integrating biological knowledge based on functional annotations for biclustering of gene expression data.

Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S.

Comput Methods Programs Biomed ; 119(3): 163-80, 2015 May.

Artigo em Inglês | MEDLINE | ID: mdl-25843807

RESUMO

Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows the integration of biological information from several sources and can be extended to other biclustering algorithms based on the optimization of a merit function.

Assuntos

Algoritmos , Perfilação da Expressão Gênica/estatística & dados numéricos , Anotação de Sequência Molecular/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Análise por Conglomerados , Mineração de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Ontologia Genética/estatística & dados numéricos , Genes Fúngicos , Bases de Conhecimento , Leveduras/genética

Biclustering of gene expression data by correlation-based scatter search.

Nepomuceno, Juan A; Troncoso, Alicia; Aguilar-Ruiz, Jesús S.

BioData Min ; 4(1): 3, 2011 Jan 24.

Artigo em Inglês | MEDLINE | ID: mdl-21261986

RESUMO

BACKGROUND: The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. METHODS: Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. RESULTS: The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database.

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA