Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

Gauran, Iris Ivy M; Park, Junyong; Lim, Johan; Park, DoHwan; Zylstra, John; Peterson, Thomas; Kann, Maricel; Spouge, John L

Gauran, Iris Ivy M; Park, Junyong; Lim, Johan; Park, DoHwan; Zylstra, John; Peterson, Thomas; Kann, Maricel; Spouge, John L.

Afiliación

Gauran IIM; Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
Park J; School of Statistics, University of the Philippines Diliman, Quezon City, 1101, Philippines.
Lim J; Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
Park D; Department of Statistics, Seoul National University, Seoul, 08826, Republic of Korea.
Zylstra J; Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
Peterson T; Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
Kann M; Department of Biological Sciences, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
Spouge JL; Department of Biological Sciences, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.

Biometrics ; 74(2): 458-471, 2018 06.

Article en En | MEDLINE | ID: mdl-28940296

RESUMEN

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.

Asunto(s)

Biometría/métodos; Interpretación Estadística de Datos; Dominios Proteicos; Distribuciones Estadísticas; Análisis Mutacional de ADN; Bases de Datos de Proteínas; Humanos; Tasa de Mutación; Distribución de Poisson

Palabras clave

Local false discovery rate; Protein domain; Zero-in ated generalized poisson

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Distribuciones Estadísticas / Interpretación Estadística de Datos / Biometría / Dominios Proteicos Idioma: En Revista: Biometrics Año: 2018 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google