Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
BMC Bioinformatics ; 17: 74, 2016 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-26852017

RESUMO

BACKGROUND: Gene set analysis (GSA) aims to evaluate the association between the expression of biological pathways, or a priori defined gene sets, and a particular phenotype. Numerous GSA methods have been proposed to assess the enrichment of sets of genes. However, most methods are developed with respect to a specific alternative scenario, such as a differential mean pattern or a differential coexpression. Moreover, a very limited number of methods can handle either binary, categorical, or continuous phenotypes. In this paper, we develop two novel GSA tests, called SDRs, based on the sufficient dimension reduction technique, which aims to capture sufficient information about the relationship between genes and the phenotype. The advantages of our proposed methods are that they allow for categorical and continuous phenotypes, and they are also able to identify a variety of enriched gene sets. RESULTS: Through simulation studies, we compared the type I error and power of SDRs with existing GSA methods for binary, triple, and continuous phenotypes. We found that SDR methods adequately control the type I error rate at the pre-specified nominal level, and they have a satisfactory power to detect gene sets with differential coexpression and to test non-linear associations between gene sets and a continuous phenotype. In addition, the SDR methods were compared with seven widely-used GSA methods using two real microarray datasets for illustration. CONCLUSIONS: We concluded that the SDR methods outperform the others because of their flexibility with regard to handling different kinds of phenotypes and their power to detect a wide range of alternative scenarios. Our real data analysis highlights the differences between GSA methods for detecting enriched gene sets.


Assuntos
Biologia Computacional/métodos , Simulação por Computador , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Neoplasias da Próstata/genética , Proteína Supressora de Tumor p53/genética , Negro ou Afro-Americano/genética , Genótipo , Humanos , Masculino , Fenótipo , Neoplasias da Próstata/etnologia
2.
BMC Res Notes ; 7: 25, 2014 Jan 10.
Artigo em Inglês | MEDLINE | ID: mdl-24410929

RESUMO

BACKGROUND: A biomarker is usually used as a diagnostic or assessment tool in medical research. Finding an ideal biomarker is not easy and combining multiple biomarkers provides a promising alternative. Moreover, some biomarkers based on the optimal linear combination do not have enough discriminatory power. As a result, the aim of this study was to find the significant biomarkers based on the optimal linear combination maximizing the pAUC for assessment of the biomarkers. METHODS: Under the binormality assumption we obtain the optimal linear combination of biomarkers maximizing the partial area under the receiver operating characteristic curve (pAUC). Related statistical tests are developed for assessment of a biomarker set and of an individual biomarker. Stepwise biomarker selections are introduced to identify those biomarkers of statistical significance. RESULTS: The results of simulation study and three real examples, Duchenne Muscular Dystrophy disease, heart disease, and breast tissue example are used to show that our methods are most suitable biomarker selection for the data sets of a moderate number of biomarkers. CONCLUSIONS: Our proposed biomarker selection approaches can be used to find the significant biomarkers based on hypothesis testing.


Assuntos
Biomarcadores/análise , Diagnóstico , Curva ROC , Algoritmos , Área Sob a Curva , Doenças Mamárias/patologia , Simulação por Computador , Doença da Artéria Coronariana/sangue , Impedância Elétrica , Triagem de Portadores Genéticos , Distrofia Muscular de Duchenne/sangue , Distrofia Muscular de Duchenne/genética , Distribuição Normal , Sensibilidade e Especificidade
3.
Gene ; 518(1): 179-86, 2013 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-23219997

RESUMO

In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses.


Assuntos
Algoritmos , Expressão Gênica , Anotação de Sequência Molecular/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Neoplasias da Mama/genética , Bases de Dados Genéticas , Feminino , Genes p53 , Humanos , Neoplasias Pulmonares/genética , Masculino , Fenótipo
4.
Biom J ; 53(1): 5-18, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21259305

RESUMO

Case-control sampling is popular in epidemiological research because of its cost and time saving. In a logistic regression model, with limited knowledge on the covariance matrix of the point estimator of the regression coefficients a priori, there exists no fixed sample size analysis. In this study, we propose a two-stage sequential analysis, in which the optimal sample fraction and the required sample size to achieve a predetermined volume of a joint confidence set are estimated in an interim analysis. Additionally required observations are collected in the second stage according to the estimated optimal sample fraction. At the end of the experiment, data from these two stages are combined and analyzed for statistical inference. Simulation studies are conducted to justify the proposed two-stage procedure and an example is presented for illustration. It is found that the proposed two-stage procedure performs adequately in the sense that the resultant joint confidence set has a well-controlled volume and achieves the required coverage probability. Furthermore, the optimal sample fractions among all the selected scenarios are close to one. Hence, the proposed procedure can be simplified by always considering a balance design.


Assuntos
Interpretação Estatística de Dados , Análise de Regressão , Projetos de Pesquisa/estatística & dados numéricos , Tamanho da Amostra , Algoritmos , Estudos de Casos e Controles , Modelos Logísticos , Estudos Retrospectivos , Estudos de Amostragem
5.
BMC Bioinformatics ; 11: 48, 2010 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-20100337

RESUMO

BACKGROUND: Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (pi1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes. RESULTS: A sample size estimate based on the common formulation, to achieve the desired sensitivity on average, can be calculated using a univariate method without taking the correlation among genes into consideration. This formulation of sample size problem is inadequate because the probability of detecting the specified sensitivity can be lower than 50%. On the other hand, the needed sample size calculated by the proposed permutation method will ensure detecting at least the desired sensitivity with 95% probability. The method is shown to perform well for a real example dataset using a small pilot dataset with 4-6 samples per group. CONCLUSIONS: We recommend that the sample size problem should be formulated to detect a specified proportion of differentially expressed genes with 95% probability. This formulation ensures finding the desired proportion of true positives with high probability. The proposed permutation method takes the correlation structure and effect size heterogeneity into consideration and works well using only a small pilot dataset.


Assuntos
Biologia Computacional/métodos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Reações Falso-Positivas , Perfilação da Expressão Gênica/métodos , Tamanho da Amostra
6.
J Biopharm Stat ; 18(5): 869-82, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18781522

RESUMO

An important objective in mass spectrometry (MS) is to identify a set of biomarkers that can be used to potentially distinguish patients between distinct treatments (or conditions) from tens or hundreds of spectra. A common two-step approach involving peak extraction and quantification is employed to identify the features of scientific interest. The selected features are then used for further investigation to understand underlying biological mechanism of individual protein or for development of genomic biomarkers to early diagnosis. However, the use of inadequate or ineffective peak detection and peak alignment algorithms in peak extraction step may lead to a high rate of false positives. Also, it is crucial to reduce the false positive rate in detecting biomarkers from ten or hundreds of spectra. Here a new procedure is introduced for feature extraction in mass spectrometry data that extends the continuous wavelet transform-based (CWT-based) algorithm to multiple spectra. The proposed multispectra CWT-based algorithm (MCWT) not only can perform peak detection for multiple spectra but also carry out peak alignment at the same time. The author' MCWT algorithm constructs a reference, which integrates information of multiple raw spectra, for feature extraction. The algorithm is applied to a SELDI-TOF mass spectra data set provided by CAMDA 2006 with known polypeptide m/z positions. This new approach is easy to implement and it outperforms the existing peak extraction method from the Bioconductor PROcess package.


Assuntos
Reconhecimento Automatizado de Padrão/métodos , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos , Algoritmos , Humanos , Mapeamento de Peptídeos , Sensibilidade e Especificidade , Processamento de Sinais Assistido por Computador
7.
BMC Bioinformatics ; 8: 412, 2007 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-17961233

RESUMO

BACKGROUND: Many researchers are concerned with the comparability and reliability of microarray gene expression data. Recent completion of the MicroArray Quality Control (MAQC) project provides a unique opportunity to assess reproducibility across multiple sites and the comparability across multiple platforms. The MAQC analysis presented for the conclusion of inter- and intra-platform comparability/reproducibility of microarray gene expression measurements is inadequate. We evaluate the reproducibility/comparability of the MAQC data for 12901 common genes in four titration samples generated from five high-density one-color microarray platforms and the TaqMan technology. We discuss some of the problems with the use of correlation coefficient as metric to evaluate the inter- and intra-platform reproducibility and the percent of overlapping genes (POG) as a measure for evaluation of a gene selection procedure by MAQC. RESULTS: A total of 293 arrays were used in the intra- and inter-platform analysis. A hierarchical cluster analysis shows distinct differences in the measured intensities among the five platforms. A number of genes show a small fold-change in one platform and a large fold-change in another platform, even though the correlations between platforms are high. An analysis of variance shows thirty percent of gene expressions of the samples show inconsistent patterns across the five platforms. We illustrated that POG does not reflect the accuracy of a selected gene list. A non-overlapping gene can be truly differentially expressed with a stringent cut, and an overlapping gene can be non-differentially expressed with non-stringent cutoff. In addition, POG is an unusable selection criterion. POG can increase or decrease irregularly as cutoff changes; there is no criterion to determine a cutoff so that POG is optimized. CONCLUSION: Using various statistical methods we demonstrate that there are differences in the intensities measured by different platforms and different sites within platform. Within each platform, the patterns of expression are generally consistent, but there is site-by-site variability. Evaluation of data analysis methods for use in regulatory decision should take no treatment effect into consideration, when there is no treatment effect, "a fold-change cutoff with a non-stringent p-value cutoff" could result in 100% false positive error selection.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise Serial de Proteínas/métodos , Reprodutibilidade dos Testes , Análise por Conglomerados , Sondas de DNA/análise , Bases de Dados Genéticas , Análise de Falha de Equipamento , Reações Falso-Positivas , Perfilação da Expressão Gênica/instrumentação , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/normas , Análise de Sequência com Séries de Oligonucleotídeos/instrumentação , Análise de Sequência com Séries de Oligonucleotídeos/normas , Análise Serial de Proteínas/instrumentação , Análise Serial de Proteínas/normas , Controle de Qualidade , Padrões de Referência , Análise de Regressão , Sensibilidade e Especificidade
8.
J Biopharm Stat ; 14(3): 553-73, 2004 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-15468752

RESUMO

Microarray technology allows the measurement of expression levels of a large number of genes simultaneously. There are inherent biases in microarray data generated from an experiment. Various statistical methods have been proposed for data normalization and data analysis. This paper proposes a generalized additive model for the analysis of gene expression data. This model consists of two sub-models: a non-linear model and a linear model. We propose a two-step normalization algorithm to fit the two sub-models sequentially. The first step involves a non-parametric regression using lowess fits to adjust for non-linear systematic biases. The second step uses a linear ANOVA model to estimate the remaining effects including the interaction effect of genes and treatments, the effect of interest in a study. The proposed model is a generalization of the ANOVA model for microarray data analysis. We show correspondences between the lowess fit and the ANOVA model methods. The normalization procedure does not assume the majority of genes do not change their expression levels, and neither does it assume two channel intensities from the same spot are independent. The procedure can be applied to either one channel or two channel data from the experiments with multiple treatments or multiple nuisance factors. Two toxicogenomic experiment data sets and a simulated data set are used to contrast the proposed method with the commonly known lowess fit and ANOVA methods.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Algoritmos , Análise de Variância , Animais , Interpretação Estatística de Dados , Humanos , Metais/toxicidade , Modelos Estatísticos , Toxicogenética
9.
Bioinformatics ; 20(9): 1436-46, 2004 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-14962916

RESUMO

MOTIVATION: A microarray experiment is a multi-step process, and each step is a potential source of variation. There are two major sources of variation: biological variation and technical variation. This study presents a variance-components approach to investigating animal-to-animal, between-array, within-array and day-to-day variations for two data sets. The first data set involved estimation of technical variances for pooled control and pooled treated RNA samples. The variance components included between-array, and two nested within-array variances: between-section (the upper- and lower-sections of the array are replicates) and within-section (two adjacent spots of the same gene are printed within each section). The second experiment was conducted on four different weeks. Each week there were reference and test samples with a dye-flip replicate in two hybridization days. The variance components included week-to-week, animal-to-animal and between-array and within-array variances. RESULTS: We applied the linear mixed-effects model to quantify different sources of variation. In the first data set, we found that the between-array variance is greater than the between-section variance, which, in turn, is greater than the within-section variance. In the second data set, for the reference samples, the week-to-week variance is larger than the between-array variance, which, in turn, is slightly larger than the within-array variance. For the test samples, the week-to-week variance has the largest variation. The animal-to-animal variance is slightly larger than the between-array and within-array variances. However, in a gene-by-gene analysis, the animal-to-animal variance is smaller than the between-array variance in four out of five housekeeping genes. In summary, the largest variation observed is the week-to-week effect. Another important source of variability is the animal-to-animal variation. Finally, we describe the use of variance-component estimates to determine optimal numbers of animals, arrays per animal and sections per array in planning microarray experiments.


Assuntos
Algoritmos , Análise de Variância , Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Variação Genética/genética , Modelos Estatísticos , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
10.
J Biopharm Stat ; 13(4): 675-89, 2003 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-14584715

RESUMO

When a large number of statistical tests is performed, the chance of false positive findings could increase considerably. The traditional approach is to control the probability of rejecting at least one true null hypothesis, the familywise error rate (FWE). To improve the power of detecting treatment differences, an alternative approach is to control the expected proportion of errors among the rejected hypotheses, the false discovery rate (FDR). When some of the hypotheses are not true, the error rate from either the FWE- or the FDR-controlling procedure is usually lower than the designed level. This paper compares five methods used to estimate the number of true null hypotheses over a large number of hypotheses. The estimated number of true null hypotheses is then used to improve the power of FWE- or FDR-controlling methods. Monte Carlo simulations are conducted to evaluate the performance of these methods. The lowest slope method, developed by Benjamini and Hochberg (2000) on the adaptive control of the FDR in multiple testing with independent statistics, and the mean of differences method appear to perform the best. These two methods control the FWE properly when the number of nontrue null hypotheses is small. A data set from a toxicogenomic microarray experiment is used for illustration.


Assuntos
Modelos Estatísticos , Probabilidade , Animais , Funções Verossimilhança , Ratos , Estatística como Assunto
11.
Biometrics ; 59(4): 1071-81, 2003 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-14969487

RESUMO

Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. If R denotes the number of rejections (declared significant genes) and V denotes the number of false rejections, then V/R, if R > 0, is the proportion of false rejected hypotheses. This paper proposes a model for the distribution of the number of rejections and the conditional distribution of V given R, V / R. Under the independence assumption, the distribution of R is a convolution of two binomials and the distribution of V / R has a noncentral hypergeometric distribution. Under an equicorrelated model, the distributions are more complex and are also derived. Five false discovery rate probability error measures are considered: FDR = E(V/R), pFDR = E(V/R / R > 0) (positive FDR), cFDR = E(V/R / R = r) (conditional FDR), mFDR = E(V)/E(R) (marginal FDR), and eFDR = E(V)/r (empirical FDR). The pFDR, cFDR, and mFDR are shown to be equivalent under the Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. We present a parametric and a bootstrap procedure to estimate the FDRs. Monte Carlo simulations were conducted to evaluate the performance of these two methods. The bootstrap procedure appears to perform reasonably well, even when the alternative hypotheses are correlated (rho = .25). An example from a toxicogenomic microarray experiment is presented for illustration.


Assuntos
Biometria , Modelos Genéticos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Teorema de Bayes , Cádmio/toxicidade , Genes/efeitos dos fármacos , Genes/genética , Análise de Sequência com Séries de Oligonucleotídeos/normas , Reprodutibilidade dos Testes
12.
Stat Med ; 21(2): 231-45, 2002 Jan 30.
Artigo em Inglês | MEDLINE | ID: mdl-11782062

RESUMO

Assessment of therapeutic equivalence or non-inferiority between two medical diagnostic procedures often involves comparisons of the response rates between paired binary endpoints. The commonly used and accepted approach to assessing equivalence is by comparing the asymptotic confidence interval on the difference of two response rates with some clinical meaningful equivalence limits. This paper investigates two asymptotic test statistics, a Wald-type (sample-based) test statistic and a restricted maximum likelihood estimation (RMLE-based) test statistic, to assess equivalence or non-inferiority based on paired binary endpoints. The sample size and power functions of the two tests are derived. The actual type I error and power of the two tests are computed by enumerating the exact probabilities in the rejection region. The results show that the RMLE-based test controls type I error better than the sample-based test. To establish an equivalence between two treatments with a symmetric equivalence limit of 0.15, a minimal sample size of 120 is needed. The RMLE-based test without the continuity correction performs well at the boundary point 0. A numerical example illustrates the proposed procedures.


Assuntos
Testes Diagnósticos de Rotina/métodos , Análise Numérica Assistida por Computador , Estatística como Assunto/métodos , Biópsia , Diagnóstico por Imagem/normas , Histocitoquímica , Humanos , Funções Verossimilhança , Hepatopatias/diagnóstico , Imageamento por Ressonância Magnética , Tamanho da Amostra , Equivalência Terapêutica , Tomografia Computadorizada por Raios X
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA