Your browser doesn't support javascript.
loading
Identifying correlations driven by influential observations in large datasets.
Bu, Kevin; Wallach, David S; Wilson, Zach; Shen, Nan; Segal, Leopoldo N; Bagiella, Emilia; Clemente, Jose C.
Affiliation
  • Bu K; Department of Genetics and Data Science, Icahn School of Medicine at Mount Sinai. New York, NY, USA.
  • Wallach DS; Department of Genetics and Data Science, Icahn School of Medicine at Mount Sinai. New York, NY, USA.
  • Wilson Z; Department of Genetics and Data Science, Icahn School of Medicine at Mount Sinai. New York, NY, USA.
  • Shen N; Department of Genetics and Data Science, Icahn School of Medicine at Mount Sinai. New York, NY, USA.
  • Segal LN; Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, New York University School of Medicine, New York, NY, USA.
  • Bagiella E; Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
  • Clemente JC; Department of Genetics and Data Science, Icahn School of Medicine at Mount Sinai. New York, NY, USA.
Brief Bioinform ; 23(1)2022 01 17.
Article in En | MEDLINE | ID: mdl-34864851
ABSTRACT
Although high-throughput data allow researchers to interrogate thousands of variables simultaneously, it can also introduce a significant number of spurious results. Here we demonstrate that correlation analysis of large datasets can yield numerous false positives due to the presence of outliers that canonical methods fail to identify. We present Correlations Under The InfluencE (CUTIE), an open-source jackknifing-based method to detect such cases with both parametric and non-parametric correlation measures, and which can also uniquely rescue correlations not originally deemed significant or with incorrect sign. Our approach can additionally be used to identify variables or samples that induce these false correlations in high proportion. A meta-analysis of various omics datasets using CUTIE reveals that this issue is pervasive across different domains, although microbiome data are particularly susceptible to it. Although the significance of a correlation eventually depends on the thresholds used, our approach provides an efficient way to automatically identify those that warrant closer examination in very large datasets.
Subject(s)
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Microbiota Type of study: Prognostic_studies / Systematic_reviews Language: En Journal: Brief Bioinform Journal subject: BIOLOGIA / INFORMATICA MEDICA Year: 2022 Document type: Article Affiliation country: Estados Unidos

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Microbiota Type of study: Prognostic_studies / Systematic_reviews Language: En Journal: Brief Bioinform Journal subject: BIOLOGIA / INFORMATICA MEDICA Year: 2022 Document type: Article Affiliation country: Estados Unidos