Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
1.
J Clin Oncol ; 41(26): 4192-4199, 2023 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-37672882

RESUMO

PURPOSE: To improve on current standards for breast cancer prognosis and prediction of chemotherapy benefit by developing a risk model that incorporates the gene expression-based "intrinsic" subtypes luminal A, luminal B, HER2-enriched, and basal-like. METHODS: A 50-gene subtype predictor was developed using microarray and quantitative reverse transcriptase polymerase chain reaction data from 189 prototype samples. Test sets from 761 patients (no systemic therapy) were evaluated for prognosis, and 133 patients were evaluated for prediction of pathologic complete response (pCR) to a taxane and anthracycline regimen. RESULTS: The intrinsic subtypes as discrete entities showed prognostic significance (P = 2.26E-12) and remained significant in multivariable analyses that incorporated standard parameters (estrogen receptor status, histologic grade, tumor size, and node status). A prognostic model for node-negative breast cancer was built using intrinsic subtype and clinical information. The C-index estimate for the combined model (subtype and tumor size) was a significant improvement on either the clinicopathologic model or subtype model alone. The intrinsic subtype model predicted neoadjuvant chemotherapy efficacy with a negative predictive value for pCR of 97%. CONCLUSION: Diagnosis by intrinsic subtype adds significant prognostic and predictive information to standard parameters for patients with breast cancer. The prognostic properties of the continuous risk score will be of value for the management of node-negative breast cancers. The subtypes and risk score can also be used to assess the likelihood of efficacy from neoadjuvant chemotherapy.

2.
Dis Esophagus ; 36(4)2023 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-36222072

RESUMO

Few predictors of response to topical corticosteroid (tCS) treatment have been identified in eosinophilic esophagitis (EoE). We aimed to determine whether baseline gene expression predicts histologic response to tCS treatment for EoE. We analyzed prospectively collected samples from incident EoE cases who were treated with tCS for 8 weeks in a development cohort (prospective study) or in an independent validation cohort (clinical trial). Whole transcriptome RNA expression was determined from a baseline (pre-treatment) RNA-later preserved esophageal biopsy. Baseline expression was compared between histologic responders (<15 eos/hpf) and non-responders (≥15 eos/hpf), and differential correlation was used to assess baseline gene expression by response status. In 87 EoE cases analyzed in the development set, there were no differentially expressed genes associated with treatment response (at false discovery rate = 0.1). However, differential correlation identified a module of 22 genes with statistically significantly high pairwise correlation in non-responders (mean correlation coefficient = 0.7) compared to low correlation in responders (coefficient = 0.3). When this 22-gene module was applied to the 89 EoE cases in the independent cohort, it was not validated to predict tCS response at the 15 eos/hpf threshold (mean correlation coefficient = 0.32 in responders and 0.25 in nonresponders). Exploration of other thresholds also did not validate any modules. Though we identified a 22 gene differential correlation module measured pre-treatment that was strongly associated with subsequent histologic response to tCS in EoE, this was not validated in an independent population. Alternative methods to predict steroid response should be explored.


Assuntos
Esofagite Eosinofílica , Humanos , Esofagite Eosinofílica/tratamento farmacológico , Esofagite Eosinofílica/genética , Esofagite Eosinofílica/complicações , Estudos Prospectivos , Glucocorticoides/uso terapêutico , Esteroides/uso terapêutico , Expressão Gênica
3.
Biostatistics ; 24(2): 388-405, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-33948626

RESUMO

The relative proportion of RNA isoforms expressed for a given gene has been associated with disease states in cancer, retinal diseases, and neurological disorders. Examination of relative isoform proportions can help determine biological mechanisms, but such analyses often require a per-gene investigation of splicing patterns. Leveraging large public data sets produced by genomic consortia as a reference, one can compare splicing patterns in a data set of interest with those of a reference panel in which samples are divided into distinct groups, such as tissue of origin, or disease status. We propose A latent Dirichlet model to Compare expressed isoform proportions TO a Reference panel (ACTOR), a latent Dirichlet model with Dirichlet Multinomial observations to compare expressed isoform proportions in a data set to an independent reference panel. We use a variational Bayes procedure to estimate posterior distributions for the group membership of one or more samples. Using the Genotype-Tissue Expression project as a reference data set, we evaluate ACTOR on simulated and real RNA-seq data sets to determine tissue-type classifications of genes. ACTOR is publicly available as an R package at https://github.com/mccabes292/actor.


Assuntos
Teorema de Bayes , Humanos , Isoformas de Proteínas/genética , Isoformas de Proteínas/análise , Isoformas de Proteínas/metabolismo , Análise de Sequência de RNA/métodos
4.
Genome Biol ; 20(1): 52, 2019 03 07.
Artigo em Inglês | MEDLINE | ID: mdl-30845957

RESUMO

We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods. By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data.


Assuntos
Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Genômica/métodos , Neoplasias/mortalidade , Software , Humanos , Modelos Estatísticos , Neoplasias/genética , Prognóstico , Taxa de Sobrevida
5.
BMC Bioinformatics ; 19(1): 95, 2018 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-29523079

RESUMO

BACKGROUND: Expression quantitative trait loci (eQTL) analysis identifies genetic markers associated with the expression of a gene. Most existing eQTL analyses and methods investigate association in a single, readily available tissue, such as blood. Joint analysis of eQTL in multiple tissues has the potential to improve, and expand the scope of, single-tissue analyses. Large-scale collaborative efforts such as the Genotype-Tissue Expression (GTEx) program are currently generating high quality data in a large number of tissues. However, computational constraints limit genome-wide multi-tissue eQTL analysis. RESULTS: We develop an integrative method under a hierarchical Bayesian framework for eQTL analysis in a large number of tissues. The model fitting procedure is highly scalable, and the computing time is a polynomial function of the number of tissues. Multi-tissue eQTLs are identified through a local false discovery rate approach, which rigorously controls the false discovery rate. Using simulation and GTEx real data studies, we show that the proposed method has superior performance to existing methods in terms of computing time and the power of eQTL discovery. CONCLUSIONS: We provide a scalable method for eQTL analysis in a large number of tissues. The method enables the identification of eQTL with different configurations and facilitates the characterization of tissue specificity.


Assuntos
Regulação da Expressão Gênica , Especificidade de Órgãos/genética , Locos de Características Quantitativas/genética , Algoritmos , Teorema de Bayes , Simulação por Computador , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Polimorfismo de Nucleotídeo Único/genética , Curva ROC
6.
J Mach Learn Res ; 182018 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-30853860

RESUMO

Community detection is the process of grouping strongly connected nodes in a network. Many community detection methods for un-weighted networks have a theoretical basis in a null model. Communities discovered by these methods therefore have interpretations in terms of statistical significance. In this paper, we introduce a null for weighted networks called the continuous configuration model. First, we propose a community extraction algorithm for weighted networks which incorporates iterative hypothesis testing under the null. We prove a central limit theorem for edge-weight sums and asymptotic consistency of the algorithm under a weighted stochastic block model. We then incorporate the algorithm in a community detection method called CCME. To benchmark the method, we provide a simulation framework involving the null to plant "background" nodes in weighted networks with communities. We show that the empirical performance of CCME on these simulations is competitive with existing methods, particularly when overlapping communities and background nodes are present. To further validate the method, we present two real-world networks with potential background nodes and analyze them with CCME, yielding results that reveal macro-features of the corresponding systems.

7.
Biometrics ; 74(2): 616-625, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29073327

RESUMO

The study of expression Quantitative Trait Loci (eQTL) is an important problem in genomics and biomedicine. While detection (testing) of eQTL associations has been widely studied, less work has been devoted to the estimation of eQTL effect size. To reduce false positives, detection methods frequently rely on linear modeling of rank-based normalized or log-transformed gene expression data. Unfortunately, these approaches do not correspond to the simplest model of eQTL action, and thus yield estimates of eQTL association that can be uninterpretable and inaccurate. In this article, we propose a new, log-of-linear model for eQTL action, termed ACME, that captures allelic contributions to cis-acting eQTLs in an additive fashion, yielding effect size estimates that correspond to a biologically coherent model of cis-eQTLs. We describe a non-linear least-squares algorithm to fit the model by maximum likelihood, and obtain corresponding p-values. We perform careful investigation of the model using a combination of simulated data and data from the Genotype Tissue Expression (GTEx) project. Our results reveal little evidence for dominance effects, a parsimonious result that accords with a simple biological model for allele-specific expression and supports use of the ACME model. We show that Type-I error is well-controlled under our approach in a realistic setting, so that rank-based normalizations are unnecessary. Furthermore, we show that such normalizations can be detrimental to power and estimation accuracy under the proposed model. We then show, through effect size analyses of whole-genome cis-eQTLs in the GTEx data, that using standard normalizations instead of ACME noticeably affects the ranking and sign of estimates.


Assuntos
Modelos Lineares , Locos de Características Quantitativas , Algoritmos , Alelos , Expressão Gênica , Humanos , Estatística como Assunto
8.
Biostatistics ; 19(3): 391-406, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29029013

RESUMO

Expression quantitative trait locus (eQTL) analyses identify genetic markers associated with the expression of a gene. Most up-to-date eQTL studies consider the connection between genetic variation and expression in a single tissue. Multi-tissue analyses have the potential to improve findings in a single tissue, and elucidate the genotypic basis of differences between tissues. In this article, we develop a hierarchical Bayesian model (MT-eQTL) for multi-tissue eQTL analysis. MT-eQTL explicitly captures patterns of variation in the presence or absence of eQTL, as well as the heterogeneity of effect sizes across tissues. We devise an efficient Expectation-Maximization (EM) algorithm for model fitting. Inferences concerning eQTL detection and the configuration of eQTL across tissues are derived from the adaptive thresholding of local false discovery rates, and maximum a posteriori estimation, respectively. We also provide theoretical justification of the adaptive procedure. We investigate the MT-eQTL model through an extensive analysis of a 9-tissue data set from the GTEx initiative.


Assuntos
Bioestatística/métodos , Expressão Gênica , Genômica/métodos , Técnicas de Genotipagem/métodos , Modelos Estatísticos , Locos de Características Quantitativas , Teorema de Bayes , Humanos
9.
J Mach Learn Res ; 18: 5458-5506, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-31871433

RESUMO

Multilayer networks are a useful way to capture and model multiple, binary or weighted relationships among a fixed group of objects. While community detection has proven to be a useful exploratory technique for the analysis of single-layer networks, the development of community detection methods for multilayer networks is still in its infancy. We propose and investigate a procedure, called Multilayer Extraction, that identifies densely connected vertex-layer sets in multilayer networks. Multilayer Extraction makes use of a significance based score that quantifies the connectivity of an observed vertex-layer set through comparison with a fixed degree random graph model. Multilayer Extraction directly handles networks with heterogeneous layers where community structure may be different from layer to layer. The procedure can capture overlapping communities, as well as background vertex-layer pairs that do not belong to any community. We establish consistency of the vertex-layer set optimizer of our proposed multilayer score under the multilayer stochastic block model. We investigate the performance of Multilayer Extraction on three applications and a test bed of simulations. Our theoretical and numerical evaluations suggest that Multilayer Extraction is an effective exploratory tool for analyzing complex multilayer networks. Publicly available code is available at https://github.com/jdwilson4/MultilayerExtraction.

10.
Biometrics ; 71(4): 1185-94, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26243050

RESUMO

We describe a simple, computationally efficient, permutation-based procedure for selecting the penalty parameter in LASSO-penalized regression. The procedure, permutation selection, is intended for applications where variable selection is the primary focus, and can be applied in a variety of structural settings, including that of generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of real biomedical data sets in which permutation selection is compared with selection based on the following: cross-validation (CV), the Bayesian information criterion (BIC), scaled sparse linear regression, and a selection method based on recently developed testing procedures for the LASSO.


Assuntos
Modelos Estatísticos , Animais , Teorema de Bayes , Biometria/métodos , Neoplasias da Mama/genética , HDL-Colesterol/sangue , HDL-Colesterol/genética , Simulação por Computador , Bases de Dados Factuais/estatística & dados numéricos , Feminino , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Modelos Lineares , Modelos Logísticos , Camundongos , Análise de Regressão
11.
Genet Epidemiol ; 39(2): 77-88, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-25417853

RESUMO

Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single-nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow-up studies. Current multi-SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA-dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single-marker analysis; when additive-only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive-only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.


Assuntos
Genes Dominantes/genética , Modelos Genéticos , Polimorfismo de Nucleotídeo Único/genética , Loci Gênicos/genética , Estudo de Associação Genômica Ampla , Heterozigoto , Humanos , Fenótipo , Curva ROC
12.
Bernoulli (Andover) ; 19(1): 275-294, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24194673

RESUMO

We investigate the maximal size of distinguished submatrices of a Gaussian random matrix. Of interest are submatrices whose entries have an average greater than or equal to a positive constant, and submatrices whose entries are well fit by a two-way ANOVA model. We identify size thresholds and associated (asymptotic) probability bounds for both large-average and ANOVA-fit submatrices. Probability bounds are obtained when the matrix and submatrices of interest are square and, in rectangular cases, when the matrix and submatrices of interest have fixed aspect ratios. Our principal result is an almost sure interval concentration result for the size of large average submatrices in the square case.

13.
Ann Appl Stat ; 7(1): 523-542, 2013 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-23745156

RESUMO

Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.

14.
Bioinformatics ; 27(5): 678-85, 2011 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-21183584

RESUMO

MOTIVATION: DNA copy number gains and losses are commonly found in tumor tissue, and some of these aberrations play a role in tumor genesis and development. Although high resolution DNA copy number data can be obtained using array-based techniques, no single method is widely used to distinguish between recurrent and sporadic copy number aberrations. RESULTS: Here we introduce Discovering Copy Number Aberrations Manifested In Cancer (DiNAMIC), a novel method for assessing the statistical significance of recurrent copy number aberrations. In contrast to competing procedures, the testing procedure underlying DiNAMIC is carefully motivated, and employs a novel cyclic permutation scheme. Extensive simulation studies show that DiNAMIC controls false positive discoveries in a variety of realistic scenarios. We use DiNAMIC to analyze two publicly available tumor datasets, and our results show that DiNAMIC detects multiple loci that have biological relevance. AVAILABILITY: Source code implemented in R, as well as text files containing examples and sample datasets are available at http://www.bios.unc.edu/research/genomic_software/DiNAMIC.


Assuntos
Variações do Número de Cópias de DNA , Neoplasias/genética , Análise de Sequência de DNA/métodos , Software , Algoritmos , Biologia Computacional/métodos , DNA de Neoplasias/genética , Humanos
15.
BMC Genomics ; 11: 574, 2010 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-20955544

RESUMO

BACKGROUND: Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon. RESULTS: We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data. CONCLUSIONS: These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.


Assuntos
Bases de Dados Genéticas , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Animais , Regulação da Expressão Gênica , Humanos , Camundongos , Publicações
16.
PLoS One ; 4(3): e4886, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-19295910

RESUMO

In cells lacking the histone methyltransferase Set2, initiation of RNA polymerase II transcription occurs inappropriately within the protein-coding regions of genes, rather than being restricted to the proximal promoter. It was previously reported that this "cryptic" transcription occurs preferentially in long genes, and in genes that are infrequently transcribed. Here, we mapped the transcripts produced in an S. cerevisiae strain lacking Set2, and applied rigorous statistical methods to identify sites of cryptic transcription at high resolution. We find that suppression of cryptic transcription occurs independent of gene length or transcriptional frequency. Our conclusions differ with those reported previously because we obtained a higher-resolution dataset, we accounted for the fact that gene length and transcriptional frequency are not independent variables, and we accounted for several ascertainment biases that make cryptic transcription easier to detect in long, infrequently transcribed genes. These new results and conclusions have implications for many commonly used genomic analysis approaches, and for the evolution of high-fidelity RNA polymerase II transcriptional initiation in eukaryotes.


Assuntos
Histona Desacetilases/metabolismo , Metiltransferases/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/genética , Transcrição Gênica , Algoritmos , Histona Desacetilases/genética , Metiltransferases/genética , Análise em Microsséries , Regiões Promotoras Genéticas , RNA/genética , RNA/metabolismo , RNA Polimerase II/metabolismo , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética
17.
J Clin Oncol ; 27(8): 1160-7, 2009 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-19204204

RESUMO

UNLABELLED: PURPOSE To improve on current standards for breast cancer prognosis and prediction of chemotherapy benefit by developing a risk model that incorporates the gene expression-based "intrinsic" subtypes luminal A, luminal B, HER2-enriched, and basal-like. METHODS A 50-gene subtype predictor was developed using microarray and quantitative reverse transcriptase polymerase chain reaction data from 189 prototype samples. Test sets from 761 patients (no systemic therapy) were evaluated for prognosis, and 133 patients were evaluated for prediction of pathologic complete response (pCR) to a taxane and anthracycline regimen. RESULTS: The intrinsic subtypes as discrete entities showed prognostic significance (P = 2.26E-12) and remained significant in multivariable analyses that incorporated standard parameters (estrogen receptor status, histologic grade, tumor size, and node status). A prognostic model for node-negative breast cancer was built using intrinsic subtype and clinical information. The C-index estimate for the combined model (subtype and tumor size) was a significant improvement on either the clinicopathologic model or subtype model alone. The intrinsic subtype model predicted neoadjuvant chemotherapy efficacy with a negative predictive value for pCR of 97%. CONCLUSION Diagnosis by intrinsic subtype adds significant prognostic and predictive information to standard parameters for patients with breast cancer. The prognostic properties of the continuous risk score will be of value for the management of node-negative breast cancers. The subtypes and risk score can also be used to assess the likelihood of efficacy from neoadjuvant chemotherapy.


Assuntos
Neoplasias da Mama/classificação , Adulto , Idoso , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/etiologia , Neoplasias da Mama/mortalidade , Quimioterapia Adjuvante , Feminino , Humanos , Pessoa de Meia-Idade , Recidiva Local de Neoplasia/etiologia , Prognóstico , Receptor ErbB-2/análise , Receptores de Estrogênio/análise , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Risco
18.
Bioinformatics ; 25(4): 482-9, 2009 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-19091771

RESUMO

MOTIVATION: Gene expression Quantitative Trait Locus (eQTL) mapping measures the association between transcript expression and genotype in order to find genomic locations likely to regulate transcript expression. The availability of both gene expression and high-density genotype data has improved our ability to perform eQTL mapping in inbred mouse and other homozygous populations. However, existing eQTL mapping software does not scale well when the number of transcripts and markers are on the order of 10(5) and 10(5)-10(6), respectively. RESULTS: We propose a new method, FastMap, for fast and efficient eQTL mapping in homozygous inbred populations with binary allele calls. FastMap exploits the discrete nature and structure of the measured single nucleotide polymorphisms (SNPs). In particular, SNPs are organized into a Hamming distance-based tree that minimizes the number of arithmetic operations required to calculate the association of a SNP by making use of the association of its parent SNP in the tree. FastMap's tree can be used to perform both single marker mapping and haplotype association mapping over an m-SNP window. These performance enhancements also permit permutation-based significance testing. AVAILABILITY: The FastMap program and source code are available at the website: http://cebc.unc.edu/fastmap86.html.


Assuntos
Homozigoto , Locos de Características Quantitativas/genética , Software , Algoritmos , Animais , Perfilação da Expressão Gênica , Genoma , Genótipo , Camundongos
19.
J Am Stat Assoc ; 104(488): 1318-1323, 2009 12.
Artigo em Inglês | MEDLINE | ID: mdl-21904418
20.
J Am Stat Assoc ; 104(488): 1295-1310, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-21127725

RESUMO

In this paper we define a hierarchical Bayesian model for microarray expression data collected from several studies and use it to identify genes that show differential expression between two conditions. Key features include shrinkage across both genes and studies, and flexible modeling that allows for interactions between platforms and the estimated effect, as well as concordant and discordant differential expression across studies. We evaluated the performance of our model in a comprehensive fashion, using both artificial data, and a "split-study" validation approach that provides an agnostic assessment of the model's behavior not only under the null hypothesis, but also under a realistic alternative. The simulation results from the artificial data demonstrate the advantages of the Bayesian model. The 1 - AUC values for the Bayesian model are roughly half of the corresponding values for a direct combination of t- and SAM-statistics. Furthermore, the simulations provide guidelines for when the Bayesian model is most likely to be useful. Most noticeably, in small studies the Bayesian model generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. The split-study validation illustrates appropriate shrinkage of the Bayesian model in the absence of platform-, sample-, and annotation-differences that otherwise complicate experimental data analyses. Finally, we fit our model to four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA