RESUMEN
OBJECTIVE: Pooling multisite resting-state electroencephalography (rsEEG) datasets may introduce bias due to batch effects (i.e., cross-site differences in the rsEEG related to scanner/sample characteristics). The Combining Batches (ComBat) models, introduced for microarray expression and adapted for neuroimaging, can control for batch effects while preserving the variability of biological covariates. We aim to evaluate four ComBat harmonization methods in a pooled sample from five independent rsEEG datasets of young and old adults. METHODS: RsEEG signals (n = 374) were automatically preprocessed. Oscillatory and aperiodic rsEEG features were extracted in sensor space. Features were harmonized using neuroCombat (standard ComBat used in neuroimaging), neuroHarmonize (variant with nonlinear adjustment of covariates), OPNested-GMM (variant based on Gaussian Mixture Models to fit bimodal feature distributions), and HarmonizR (variant based on resampling to handle missing feature values). Relationships between rsEEG features and age were explored before and after harmonizing batch effects. RESULTS: Batch effects were identified in rsEEG features. All ComBat methods reduced batch effects and features' dispersion; HarmonizR and OPNested-GMM ComBat achieved the greatest performance. Harmonized Beta power, individual Alpha peak frequency, Aperiodic exponent, and offset in posterior electrodes showed significant relations with age. All ComBat models maintained the direction of observed relationships while increasing the effect size. CONCLUSIONS: ComBat models, particularly HarmonizeR and OPNested-GMM ComBat, effectively control for batch effects in rsEEG spectral features. SIGNIFICANCE: This workflow can be used in multisite studies to harmonize batch effects in sensor-space rsEEG spectral features while preserving biological associations.
RESUMEN
Neuroimaging research requires purpose-built analysis software, which is challenging to install and may produce different results across computing environments. The community-oriented, open-source Neurodesk platform ( https://www.neurodesk.org/ ) harnesses a comprehensive and growing suite of neuroimaging software containers. Neurodesk includes a browser-accessible virtual desktop, command-line interface and computational notebook compatibility, allowing for accessible, flexible, portable and fully reproducible neuroimaging analysis on personal workstations, high-performance computers and the cloud.
Asunto(s)
Neuroimagen , Programas Informáticos , Neuroimagen/métodos , Humanos , Interfaz Usuario-Computador , Reproducibilidad de los Resultados , Encéfalo/diagnóstico por imagenRESUMEN
Machine learning (ML) is increasingly used in cognitive, computational and clinical neuroscience. The reliable and efficient application of ML requires a sound understanding of its subtleties and limitations. Training ML models on datasets with imbalanced classes is a particularly common problem, and it can have severe consequences if not adequately addressed. With the neuroscience ML user in mind, this paper provides a didactic assessment of the class imbalance problem and illustrates its impact through systematic manipulation of data imbalance ratios in (i) simulated data and (ii) brain data recorded with electroencephalography (EEG), magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI). Our results illustrate how the widely-used Accuracy (Acc) metric, which measures the overall proportion of successful predictions, yields misleadingly high performances, as class imbalance increases. Because Acc weights the per-class ratios of correct predictions proportionally to class size, it largely disregards the performance on the minority class. A binary classification model that learns to systematically vote for the majority class will yield an artificially high decoding accuracy that directly reflects the imbalance between the two classes, rather than any genuine generalizable ability to discriminate between them. We show that other evaluation metrics such as the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC), and the less common Balanced Accuracy (BAcc) metric - defined as the arithmetic mean between sensitivity and specificity, provide more reliable performance evaluations for imbalanced data. Our findings also highlight the robustness of Random Forest (RF), and the benefits of using stratified cross-validation and hyperprameter optimization to tackle data imbalance. Critically, for neuroscience ML applications that seek to minimize overall classification error, we recommend the routine use of BAcc, which in the specific case of balanced data is equivalent to using standard Acc, and readily extends to multi-class settings. Importantly, we present a list of recommendations for dealing with imbalanced data, as well as open-source code to allow the neuroscience community to replicate and extend our observations and explore alternative approaches to coping with imbalanced data.
Asunto(s)
Benchmarking , Encéfalo , Humanos , Magnetoencefalografía , Aprendizaje Automático , Electroencefalografía , AlgoritmosRESUMEN
OBJECTIVE: This study aims 1) To analyse differences in resting-state electroencephalogram (rs-EEG) spectral features of Parkinson's Disease (PD) and healthy subjects (non-PD) using Functional Data Analysis (FDA) and 2) To explore, in four independent cohorts, the external validity and reproducibility of the findings using both epoch-to-epoch FDA and averaged-epochs approach. METHODS: We included 169 subjects (85 non-PD; 84 PD) from four centres. Rs-EEG signals were preprocessed with a combination of automated pipelines. Sensor-level relative power spectral density (PSD), dominant frequency (DF), and DF variability (DFV) features were extracted. Differences in each feature were compared between PD and non-PD on averaged epochs and using FDA to model the epoch-to-epoch change of each feature. RESULTS: For averaged epochs, significantly higher theta relative PSD in PD was found across all datasets. Also, higher pre-alpha relative PSD was observed in three of four datasets in PD patients. For FDA, similar findings were achieved in theta, but all datasets showed consistently significant posterior pre-alpha differences across multiple epochs. CONCLUSIONS: Increased generalised theta, with posterior pre-alpha relative PSD, was the most reproducible finding in PD. SIGNIFICANCE: Rs-EEG theta and pre-alpha findings are generalisable in PD. FDA constitutes a reliable and powerful tool to analyse epoch-to-epoch the rs-EEG.
Asunto(s)
Enfermedad de Parkinson , Humanos , Enfermedad de Parkinson/diagnóstico , Reproducibilidad de los Resultados , ElectroencefalografíaRESUMEN
Neuroimaging data analysis often requires purpose-built software, which can be challenging to install and may produce different results across computing environments. Beyond being a roadblock to neuroscientists, these issues of accessibility and portability can hamper the reproducibility of neuroimaging data analysis pipelines. Here, we introduce the Neurodesk platform, which harnesses software containers to support a comprehensive and growing suite of neuroimaging software (https://www.neurodesk.org/). Neurodesk includes a browser-accessible virtual desktop environment and a command line interface, mediating access to containerized neuroimaging software libraries on various computing platforms, including personal and high-performance computers, cloud computing and Jupyter Notebooks. This community-oriented, open-source platform enables a paradigm shift for neuroimaging data analysis, allowing for accessible, flexible, fully reproducible, and portable data analysis pipelines.