Search | Nursing VHL Search Portal

Distributional bias compromises leave-one-out cross-validation.

Austin, George I; Pe'er, Itsik; Korem, Tal.

ArXiv ; 2024 Jun 03.

Article in English | MEDLINE | ID: mdl-38883233

ABSTRACT

Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called 'leave-one-out cross-validation' is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test data point available per model trained, predictions are aggregated across the entire dataset to calculate common rank-based performance metrics such as the area under the receiver operating characteristic or precision-recall curves. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations and in several published leave-one-out analyses.

Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models.

Austin, George I; Kav, Aya Brown; Park, Heekuk; Biermann, Jana; Uhlemann, Anne-Catrin; Korem, Tal.

bioRxiv ; 2024 Feb 12.

Article in English | MEDLINE | ID: mdl-38405914

ABSTRACT

Every step in common microbiome profiling protocols has variable efficiency for each microbe. For example, different DNA extraction kits may have different efficiency for Gram-positive and -negative bacteria. These variable efficiencies, combined with technical variation, create strong processing biases, which impede the identification of signals that are reproducible across studies and the development of generalizable and biologically interpretable prediction models. "Batch-correction" methods have been used to alleviate these issues computationally with some success. However, many make strong parametric assumptions which do not necessarily apply to microbiome data or processing biases, or require the use of an outcome variable, which risks overfitting. Lastly and importantly, existing transformations used to correct microbiome data are largely non-interpretable, and could, for example, introduce values to features that were initially mostly zeros. Altogether, processing bias currently compromises our ability to glean robust and generalizable biological insights from microbiome data. Here, we present DEBIAS-M (Domain adaptation with phenotype Estimation and Batch Integration Across Studies of the Microbiome), an interpretable framework for inference and correction of processing bias, which facilitates domain adaptation in microbiome studies. DEBIAS-M learns bias-correction factors for each microbe in each batch that simultaneously minimize batch effects and maximize cross-study associations with phenotypes. Using benchmarks of HIV and colorectal cancer classification from gut microbiome data, and cervical neoplasia prediction from cervical microbiome data, we demonstrate that DEBIAS-M outperforms batch-correction methods commonly used in the field. Notably, we show that the inferred bias-correction factors are stable, interpretable, and strongly associated with specific experimental protocols. Overall, we show that DEBIAS-M allows for better modeling of microbiome data and identification of interpretable signals that are reproducible across studies.

The Salivary Microbiome and Predicted Metabolite Production Are Associated with Barrett's Esophagus and High-Grade Dysplasia or Adenocarcinoma.

Solfisburg, Quinn S; Baldini, Federico; Baldwin-Hunter, Brittany; Austin, George I; Lee, Harry H; Park, Heekuk; Freedberg, Daniel E; Lightdale, Charles J; Korem, Tal; Abrams, Julian A.

Cancer Epidemiol Biomarkers Prev ; 33(3): 371-380, 2024 03 01.

Article in English | MEDLINE | ID: mdl-38117184

ABSTRACT

BACKGROUND: Esophageal adenocarcinoma (EAC) is rising in incidence, and established risk factors do not explain this trend. Esophageal microbiome alterations have been associated with Barrett's esophagus (BE) and dysplasia and EAC. The oral microbiome is tightly linked to the esophageal microbiome; this study aimed to identify salivary microbiome-related factors associated with BE, dysplasia, and EAC. METHODS: Clinical data and oral health history were collected from patients with and without BE. The salivary microbiome was characterized, assessing differential relative abundance of taxa by 16S rRNA gene sequencing and associations between microbiome composition and clinical features. Microbiome metabolic modeling was used to predict metabolite production. RESULTS: A total of 244 patients (125 non-BE and 119 BE) were analyzed. Patients with high-grade dysplasia (HGD)/EAC had a significantly higher prevalence of tooth loss (P = 0.001). There were significant shifts with increased dysbiosis associated with HGD/EAC, independent of tooth loss, with the largest shifts within the genus Streptococcus. Modeling predicted significant shifts in the microbiome metabolic capacities, including increases in L-lactic acid and decreases in butyric acid and L-tryptophan production in HGD/EAC. CONCLUSIONS: Marked dysbiosis in the salivary microbiome is associated with HGD and EAC, with notable increases within the genus Streptococcus and accompanying changes in predicted metabolite production. Further work is warranted to identify the biological significance of these alterations and to validate metabolic shifts. IMPACT: There is an association between oral dysbiosis and HGD/EAC. Further work is needed to establish the diagnostic, predictive, and causal potential of this relationship.

Subject(s)

Adenocarcinoma , Barrett Esophagus , Esophageal Neoplasms , Microbiota , Tooth Loss , Humans , Dysbiosis , RNA, Ribosomal, 16S/genetics , Butyric Acid

Robustness of cancer microbiome signals over a broad range of methodological variation.

Sepich-Poore, Gregory D; McDonald, Daniel; Kopylova, Evguenia; Guccione, Caitlin; Zhu, Qiyun; Austin, George; Carpenter, Carolina; Fraraccio, Serena; Wandro, Stephen; Kosciolek, Tomasz; Janssen, Stefan; Metcalf, Jessica L; Song, Se Jin; Kanbar, Jad; Miller-Montgomery, Sandrine; Heaton, Robert; Mckay, Rana; Patel, Sandip Pravin; Swafford, Austin D; Korem, Tal; Knight, Rob.

Oncogene ; 43(15): 1127-1148, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38396294

ABSTRACT

In 2020, we identified cancer-specific microbial signals in The Cancer Genome Atlas (TCGA) [1]. Multiple peer-reviewed papers independently verified or extended our findings [2-12]. Given this impact, we carefully considered concerns by Gihawi et al. [13] that batch correction and database contamination with host sequences artificially created the appearance of cancer type-specific microbiomes. (1) We tested batch correction by comparing raw and Voom-SNM-corrected data per-batch, finding predictive equivalence and significantly similar features. We found consistent results with a modern microbiome-specific method (ConQuR [14]), and when restricting to taxa found in an independent, highly-decontaminated cohort. (2) Using Conterminator [15], we found low levels of human contamination in our original databases (~1% of genomes). We demonstrated that the increased detection of human reads in Gihawi et al. [13] was due to using a newer human genome reference. (3) We developed Exhaustive, a method twice as sensitive as Conterminator, to clean RefSeq. We comprehensively host-deplete TCGA with many human (pan)genome references. We repeated all analyses with this and the Gihawi et al. [13] pipeline, and found cancer type-specific microbiomes. These extensive re-analyses and updated methods validate our original conclusion that cancer type-specific microbial signatures exist in TCGA, and show they are robust to methodology.

Subject(s)

Microbiota , Neoplasms , Humans , Neoplasms/genetics , Microbiota/genetics

Correction: Robustness of cancer microbiome signals over a broad range of methodological variation.

Oncogene ; 43(20): 1579, 2024 May.

Article in English | MEDLINE | ID: mdl-38580705

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL