RESUMO
Alzheimer's disease (AD) and Parkinson's disease (PD) are influenced by genetic and environmental factors. Using data from UK Biobank, SAIL Biobank, and FinnGen, we conducted an unbiased, population-scale study to: 1) Investigate how 155 endocrine, nutritional, metabolic, and digestive system disorders are associated with AD and PD risk prior to their diagnosis, considering known genetic influences; 2) Assess plasma biomarkers' specificity for AD or PD in individuals with these conditions; 3) Develop a multi-modal classification model integrating genetics, proteomics, and clinical data relevant to conditions affecting the gut-brain axis. Our findings show that certain disorders elevate AD and PD risk before AD and PD diagnosis including: insulin and non-insulin dependent diabetes mellitus, noninfective gastro-enteritis and colitis, functional intestinal disorders, and bacterial intestinal infections, among others. Polygenic risk scores revealed lower genetic predisposition to AD and PD in individuals with co-occurring disorders in the study categories, underscoring the importance of regulating the gut-brain axis to potentially prevent or delay the onset of neurodegenerative diseases. The proteomic profile of AD/PD cases was influenced by comorbid endocrine, nutritional, metabolic, and digestive systems conditions. Importantly, we developed multi-modal prediction models integrating clinical, genetic, proteomic and demographic data, the combination of which performs better than any single paradigm approach in disease classification. This work aims to illuminate the intricate interplay between various physiological factors involved in the gut-brain axis and the development of AD and PD, providing a multifactorial systemic understanding that goes beyond traditional approaches.
RESUMO
While machine learning (ML) research has recently grown more in popularity, its application in the omics domain is constrained by access to sufficiently large, high-quality datasets needed to train ML models. Federated learning (FL) represents an opportunity to enable collaborative curation of such datasets among participating institutions. We compare the simulated performance of several models trained using FL against classically trained ML models on the task of multi-omics Parkinson's disease prediction. We find that FL model performance tracks centrally trained ML models, where the most performant FL model achieves an AUC-PR of 0.876 ± 0.009, 0.014 ± 0.003 less than its centrally trained variation. We also determine that the dispersion of samples within a federation plays a meaningful role in model performance. Our study implements several open-source FL frameworks and aims to highlight some of the challenges and opportunities when applying these collaborative methods in multi-omics studies.
RESUMO
While machine learning (ML) research has recently grown more in popularity, its application in the omics domain is constrained by access to sufficiently large, high-quality datasets needed to train ML models. Federated Learning (FL) represents an opportunity to enable collaborative curation of such datasets among participating institutions. We compare the simulated performance of several models trained using FL against classically trained ML models on the task of multi-omics Parkinson's Disease prediction. We find that FL model performance tracks centrally trained ML models, where the most performant FL model achieves an AUC-PR of 0.876 ± 0.009, 0.014 ± 0.003 less than its centrally trained variation. We also determine that the dispersion of samples within a federation plays a meaningful role in model performance. Our study implements several open source FL frameworks and aims to highlight some of the challenges and opportunities when applying these collaborative methods in multi-omics studies.
RESUMO
High-dimensional data analysis starts with projecting the data to low dimensions to visualize and understand the underlying data structure. Several methods have been developed for dimensionality reduction, but they are limited to cross-sectional datasets. The recently proposed Aligned-UMAP, an extension of the uniform manifold approximation and projection (UMAP) algorithm, can visualize high-dimensional longitudinal datasets. We demonstrated its utility for researchers to identify exciting patterns and trajectories within enormous datasets in biological sciences. We found that the algorithm parameters also play a crucial role and must be tuned carefully to utilize the algorithm's potential fully. We also discussed key points to remember and directions for future extensions of Aligned-UMAP. Further, we made our code open source to enhance the reproducibility and applicability of our work. We believe our benchmarking study becomes more important as more and more high-dimensional longitudinal data in biomedical research become available.
RESUMO
Overlapping symptoms and co-pathologies are common in closely related neurodegenerative diseases (NDDs). Investigating genetic risk variants across these NDDs can give further insight into disease manifestations. In this study we have leveraged genome-wide single nucleotide polymorphisms and genome-wide association study summary statistics to cluster patients based on their genetic status across identified risk variants for five NDDs (Alzheimer's disease, Parkinson's disease, amyotrophic lateral sclerosis, Lewy body dementia and frontotemporal dementia). The multi-disease and disease-specific clustering results presented here provide evidence that NDDs have more overlapping genetic aetiology than previously expected and how neurodegeneration should be viewed as a spectrum of symptomology. These clustering analyses also show potential subsets of patients with these diseases that are significantly depleted for any known common genetic risk factors suggesting environmental or other factors at work. Establishing that NDDs with overlapping pathologies share genetic risk loci, future research into how these variants might have different effects on downstream protein expression, pathology and NDD manifestation in general is important for refining and treating NDDs.
Assuntos
Doença de Alzheimer , Doença por Corpos de Lewy , Doenças Neurodegenerativas , Doença de Parkinson , Humanos , Doenças Neurodegenerativas/genética , Estudo de Associação Genômica Ampla , Doença de Parkinson/genética , Doença por Corpos de Lewy/genética , Doença de Alzheimer/genética , Fatores de RiscoRESUMO
The clinical manifestations of Parkinson's disease (PD) are characterized by heterogeneity in age at onset, disease duration, rate of progression, and the constellation of motor versus non-motor features. There is an unmet need for the characterization of distinct disease subtypes as well as improved, individualized predictions of the disease course. We used unsupervised and supervised machine learning methods on comprehensive, longitudinal clinical data from the Parkinson's Disease Progression Marker Initiative (n = 294 cases) to identify patient subtypes and to predict disease progression. The resulting models were validated in an independent, clinically well-characterized cohort from the Parkinson's Disease Biomarker Program (n = 263 cases). Our analysis distinguished three distinct disease subtypes with highly predictable progression rates, corresponding to slow, moderate, and fast disease progression. We achieved highly accurate projections of disease progression 5 years after initial diagnosis with an average area under the curve (AUC) of 0.92 (95% CI: 0.95 ± 0.01) for the slower progressing group (PDvec1), 0.87 ± 0.03 for moderate progressors, and 0.95 ± 0.02 for the fast-progressing group (PDvec3). We identified serum neurofilament light as a significant indicator of fast disease progression among other key biomarkers of interest. We replicated these findings in an independent cohort, released the analytical code, and developed models in an open science manner. Our data-driven study provides insights to deconstruct PD heterogeneity. This approach could have immediate implications for clinical trials by improving the detection of significant clinical outcomes. We anticipate that machine learning models will improve patient counseling, clinical trial design, and ultimately individualized patient care.
RESUMO
Personalized medicine promises individualized disease prediction and treatment. The convergence of machine learning (ML) and available multimodal data is key moving forward. We build upon previous work to deliver multimodal predictions of Parkinson's disease (PD) risk and systematically develop a model using GenoML, an automated ML package, to make improved multi-omic predictions of PD, validated in an external cohort. We investigated top features, constructed hypothesis-free disease-relevant networks, and investigated drug-gene interactions. We performed automated ML on multimodal data from the Parkinson's progression marker initiative (PPMI). After selecting the best performing algorithm, all PPMI data was used to tune the selected model. The model was validated in the Parkinson's Disease Biomarker Program (PDBP) dataset. Our initial model showed an area under the curve (AUC) of 89.72% for the diagnosis of PD. The tuned model was then tested for validation on external data (PDBP, AUC 85.03%). Optimizing thresholds for classification increased the diagnosis prediction accuracy and other metrics. Finally, networks were built to identify gene communities specific to PD. Combining data modalities outperforms the single biomarker paradigm. UPSIT and PRS contributed most to the predictive power of the model, but the accuracy of these are supplemented by many smaller effect transcripts and risk SNPs. Our model is best suited to identifying large groups of individuals to monitor within a health registry or biobank to prioritize for further testing. This approach allows complex predictive models to be reproducible and accessible to the community, with the package, code, and results publicly available.
RESUMO
BACKGROUND: Amyotrophic lateral sclerosis (ALS) is known to represent a collection of overlapping syndromes. Various classification systems based on empirical observations have been proposed, but it is unclear to what extent they reflect ALS population substructures. We aimed to use machine-learning techniques to identify the number and nature of ALS subtypes to obtain a better understanding of this heterogeneity, enhance our understanding of the disease, and improve clinical care. METHODS: In this retrospective study, we applied unsupervised Uniform Manifold Approximation and Projection [UMAP]) modelling, semi-supervised (neural network UMAP) modelling, and supervised (ensemble learning based on LightGBM) modelling to a population-based discovery cohort of patients who were diagnosed with ALS while living in the Piedmont and Valle d'Aosta regions of Italy, for whom detailed clinical data, such as age at symptom onset, were available. We excluded patients with missing Revised ALS Functional Rating Scale (ALSFRS-R) feature values from the unsupervised and semi-supervised steps. We replicated our findings in an independent population-based cohort of patients who were diagnosed with ALS while living in the Emilia Romagna region of Italy. FINDINGS: Between Jan 1, 1995, and Dec 31, 2015, 2858 patients were entered in the discovery cohort. After excluding 497 (17%) patients with missing ALSFRS-R feature values, data for 42 clinical features across 2361 (83%) patients were available for the unsupervised and semi-supervised analysis. We found that semi-supervised machine learning produced the optimum clustering of the patients with ALS. These clusters roughly corresponded to the six clinical subtypes defined by the Chiò classification system (ie, bulbar, respiratory, flail arm, classical, pyramidal, and flail leg ALS). Between Jan 1, 2009, and March 1, 2018, 1097 patients were entered in the replication cohort. After excluding 108 (10%) patients with missing ALSFRS-R feature values, data for 42 clinical features across 989 patients were available for the unsupervised and semi-supervised analysis. All 1097 patients were included in the supervised analysis. The same clusters were identified in the replication cohort. By contrast, other ALS classification schemes, such as the El Escorial categories, Milano-Torino clinical staging, and King's clinical stages, did not adequately label the clusters. Supervised learning identified 11 clinical parameters that predicted ALS clinical subtypes with high accuracy (area under the curve 0·982 [95% CI 0·980-0·983]). INTERPRETATION: Our data-driven study provides insight into the ALS population substructure and confirms that the Chiò classification system successfully identifies ALS subtypes. Additional validation is required to determine the accuracy and clinical use of these algorithms in assigning clinical subtypes. Nevertheless, our algorithms offer a broad insight into the clinical heterogeneity of ALS and help to determine the actual subtypes of disease that exist within this fatal neurodegenerative syndrome. The systematic identification of ALS subtypes will improve clinical care and clinical trial design. FUNDING: US National Institute on Aging, US National Institutes of Health, Italian Ministry of Health, European Commission, University of Torino Rita Levi Montalcini Department of Neurosciences, Emilia Romagna Regional Health Authority, and Italian Ministry of Education, University, and Research. TRANSLATIONS: For the Italian and German translations of the abstract see Supplementary Materials section.