Pesquisa | Prevenção e Controle de Câncer

Statistical assessment of functional categories of genes deregulated in pathological conditions by using microarray data.

Maglietta, R; Piepoli, A; Catalano, D; Licciulli, F; Carella, M; Liuni, S; Pesole, G; Perri, F; Ancona, N.

Bioinformatics ; 23(16): 2063-72, 2007 Aug 15.

Artigo em Inglês | MEDLINE | ID: mdl-17540679

RESUMO

MOTIVATION: A major challenge in current biomedical research is the identification of cellular processes deregulated in a given pathology through the analysis of gene expression profiles. To this end, predefined lists of genes, coding specific functions, are compared with a list of genes ordered according to their values of differential expression measured by suitable univariate statistics. RESULTS: We propose a statistically well-founded method for measuring the relevance of predefined lists of genes and for assessing their statistical significance starting from their raw expression levels as recorded on the microarray. We use prediction accuracy as a measure of relevance of the list. The rationale is that a functional category, coded through a list of genes, is perturbed in a given pathology if it is possible to correctly predict the occurrence of the disease in new subjects on the basis of the expression levels of the genes belonging to the list only. The accuracy is estimated with multiple random validation strategy and its statistical significance is assessed against a couple of null hypothesis, by using two independent permutation tests. The utility of the proposed methodology is illustrated by analyzing the relevance of Gene Ontology terms belonging to biological process category in colon and prostate cancer, by using three different microarray data sets and by comparing it with current approaches. AVAILABILITY: Source code for the algorithms is available from author upon request. SUPPLEMENTARY INFORMATION: Colon cancer data set and a complete description of experimental results are available at: ftp://bioftp:76bioftpxxx@marx.ba.issia.cnr.it/supp-info.htm.

Assuntos

Biomarcadores Tumorais/metabolismo , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Família Multigênica , Proteínas de Neoplasias/metabolismo , Neoplasias/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Interpretação Estatística de Dados , Humanos , Masculino , Proteínas de Neoplasias/classificação

On the statistical assessment of classifiers using DNA microarray data.

Ancona, N; Maglietta, R; Piepoli, A; D'Addabbo, A; Cotugno, R; Savino, M; Liuni, S; Carella, M; Pesole, G; Perri, F.

BMC Bioinformatics ; 7: 387, 2006 Aug 19.

Artigo em Inglês | MEDLINE | ID: mdl-16919171

RESUMO

BACKGROUND: In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia--Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data. RESULTS: We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed. CONCLUSIONS: The method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required.

Assuntos

Algoritmos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Estatística como Assunto/métodos , Idoso , Neoplasias do Colo/classificação , Neoplasias do Colo/genética , Interpretação Estatística de Dados , Feminino , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica/genética , Humanos , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Análise Numérica Assistida por Computador , Reprodutibilidade dos Testes , Software

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA