Búsqueda | Portal Regional de la BVS

Distributional bias compromises leave-one-out cross-validation.

Austin, George I; Pe'er, Itsik; Korem, Tal.

ArXiv ; 2024 Jun 03.

Artículo en Inglés | MEDLINE | ID: mdl-38883233

RESUMEN

Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called 'leave-one-out cross-validation' is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test data point available per model trained, predictions are aggregated across the entire dataset to calculate common rank-based performance metrics such as the area under the receiver operating characteristic or precision-recall curves. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations and in several published leave-one-out analyses.

Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models.

Austin, George I; Kav, Aya Brown; Park, Heekuk; Biermann, Jana; Uhlemann, Anne-Catrin; Korem, Tal.

bioRxiv ; 2024 Feb 12.

Artículo en Inglés | MEDLINE | ID: mdl-38405914

RESUMEN

Every step in common microbiome profiling protocols has variable efficiency for each microbe. For example, different DNA extraction kits may have different efficiency for Gram-positive and -negative bacteria. These variable efficiencies, combined with technical variation, create strong processing biases, which impede the identification of signals that are reproducible across studies and the development of generalizable and biologically interpretable prediction models. "Batch-correction" methods have been used to alleviate these issues computationally with some success. However, many make strong parametric assumptions which do not necessarily apply to microbiome data or processing biases, or require the use of an outcome variable, which risks overfitting. Lastly and importantly, existing transformations used to correct microbiome data are largely non-interpretable, and could, for example, introduce values to features that were initially mostly zeros. Altogether, processing bias currently compromises our ability to glean robust and generalizable biological insights from microbiome data. Here, we present DEBIAS-M (Domain adaptation with phenotype Estimation and Batch Integration Across Studies of the Microbiome), an interpretable framework for inference and correction of processing bias, which facilitates domain adaptation in microbiome studies. DEBIAS-M learns bias-correction factors for each microbe in each batch that simultaneously minimize batch effects and maximize cross-study associations with phenotypes. Using benchmarks of HIV and colorectal cancer classification from gut microbiome data, and cervical neoplasia prediction from cervical microbiome data, we demonstrate that DEBIAS-M outperforms batch-correction methods commonly used in the field. Notably, we show that the inferred bias-correction factors are stable, interpretable, and strongly associated with specific experimental protocols. Overall, we show that DEBIAS-M allows for better modeling of microbiome data and identification of interpretable signals that are reproducible across studies.

The Salivary Microbiome and Predicted Metabolite Production Are Associated with Barrett's Esophagus and High-Grade Dysplasia or Adenocarcinoma.

Solfisburg, Quinn S; Baldini, Federico; Baldwin-Hunter, Brittany; Austin, George I; Lee, Harry H; Park, Heekuk; Freedberg, Daniel E; Lightdale, Charles J; Korem, Tal; Abrams, Julian A.

Cancer Epidemiol Biomarkers Prev ; 33(3): 371-380, 2024 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-38117184

RESUMEN

BACKGROUND: Esophageal adenocarcinoma (EAC) is rising in incidence, and established risk factors do not explain this trend. Esophageal microbiome alterations have been associated with Barrett's esophagus (BE) and dysplasia and EAC. The oral microbiome is tightly linked to the esophageal microbiome; this study aimed to identify salivary microbiome-related factors associated with BE, dysplasia, and EAC. METHODS: Clinical data and oral health history were collected from patients with and without BE. The salivary microbiome was characterized, assessing differential relative abundance of taxa by 16S rRNA gene sequencing and associations between microbiome composition and clinical features. Microbiome metabolic modeling was used to predict metabolite production. RESULTS: A total of 244 patients (125 non-BE and 119 BE) were analyzed. Patients with high-grade dysplasia (HGD)/EAC had a significantly higher prevalence of tooth loss (P = 0.001). There were significant shifts with increased dysbiosis associated with HGD/EAC, independent of tooth loss, with the largest shifts within the genus Streptococcus. Modeling predicted significant shifts in the microbiome metabolic capacities, including increases in L-lactic acid and decreases in butyric acid and L-tryptophan production in HGD/EAC. CONCLUSIONS: Marked dysbiosis in the salivary microbiome is associated with HGD and EAC, with notable increases within the genus Streptococcus and accompanying changes in predicted metabolite production. Further work is warranted to identify the biological significance of these alterations and to validate metabolic shifts. IMPACT: There is an association between oral dysbiosis and HGD/EAC. Further work is needed to establish the diagnostic, predictive, and causal potential of this relationship.

Asunto(s)

Adenocarcinoma , Esófago de Barrett , Neoplasias Esofágicas , Microbiota , Pérdida de Diente , Humanos , Disbiosis , ARN Ribosómico 16S/genética , Ácido Butírico

Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data.

Austin, George I; Park, Heekuk; Meydan, Yoli; Seeram, Dwayne; Sezin, Tanya; Lou, Yue Clare; Firek, Brian A; Morowitz, Michael J; Banfield, Jillian F; Christiano, Angela M; Pe'er, Itsik; Uhlemann, Anne-Catrin; Shenhav, Liat; Korem, Tal.

Nat Biotechnol ; 41(12): 1820-1828, 2023 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-36928429

RESUMEN

Sequencing-based approaches for the analysis of microbial communities are susceptible to contamination, which could mask biological signals or generate artifactual ones. Methods for in silico decontamination using controls are routinely used, but do not make optimal use of information shared across samples and cannot handle taxa that only partially originate in contamination or leakage of biological material into controls. Here we present Source tracking for Contamination Removal in microBiomes (SCRuB), a probabilistic in silico decontamination method that incorporates shared information across multiple samples and controls to precisely identify and remove contamination. We validate the accuracy of SCRuB in multiple data-driven simulations and experiments, including induced contamination, and demonstrate that it outperforms state-of-the-art methods by an average of 15-20 times. We showcase the robustness of SCRuB across multiple ecosystems, data types and sequencing depths. Demonstrating its applicability to microbiome research, SCRuB facilitates improved predictions of host phenotypes, most notably the prediction of treatment response in melanoma patients using decontaminated tumor microbiome data.

Asunto(s)

Microbiota , Neoplasias , Humanos , Microbiota/genética , Fenotipo

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA