OASIS: An interpretable, finite-sample valid alternative to Pearson's <i>X</i><sup>2</sup> for scientific discovery.

Baharav, Tavor Z; Tse, David; Salzman, Julia

OASIS: An interpretable, finite-sample valid alternative to Pearson's X² for scientific discovery.

Baharav, Tavor Z; Tse, David; Salzman, Julia.

Afiliação

Baharav TZ; Eric and Wendy Schmidt Center, Broad Institute, Cambridge, MA 02142.
Tse D; Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02115.
Salzman J; Department of Electrical Engineering, Stanford University, Stanford, CA 94305.

Proc Natl Acad Sci U S A ; 121(15): e2304671121, 2024 Apr 09.

Article em En | MEDLINE | ID: mdl-38564640

ABSTRACT

ABSTRACT

Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.

Assuntos

Genoma; Genômica; Mapeamento Cromossômico

Palavras-chave

computational genomics; contingency table; finite-sample P-value; reference genome free inference

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genoma / Genômica Idioma: En Revista: Proc Natl Acad Sci U S A Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google