An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A

Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A.

Afiliação

Lai Y; Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Zhang F; Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Nayak TK; Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Modarres R; Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Lee NH; Department of Pharmacology and Physiology.
McCaffrey TA; Division of Genomic Medicine, Department of Medicine, The George Washington University Medical Center, Washington, DC 20037, USA.

Bioinformatics ; 33(23): 3852-3860, 2017 Dec 01.

Article em En | MEDLINE | ID: mdl-28174897

ABSTRACT

ABSTRACT

MOTIVATION We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.

RESULTS:

In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer. AVAILABILITY AND IMPLEMENTATION Additional results are included in a supplemental file. Computer program R-functions are freely available at http//home.gwu.edu/â¼ylai/research/Concordance. CONTACT ylai@gwu.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos; Perfilação da Expressão Gênica/métodos; Análise de Sequência com Séries de Oligonucleotídeos/métodos; Análise de Sequência de RNA/métodos; Bases de Dados Genéticas; Estudos de Associação Genética; Genoma Humano; Humanos; Neoplasias Pulmonares/genética; Modelos Estatísticos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Análise de Sequência de RNA / Análise de Sequência com Séries de Oligonucleotídeos / Perfilação da Expressão Gênica Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2017 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google