RESUMO
Institutions in highly regulated domains such as finance and healthcare often have restrictive rules around data sharing. Federated learning is a distributed learning framework that enables multi-institutional collaborations on decentralized data with improved protection for each collaborator's data privacy. In this paper, we propose a communication-efficient scheme for decentralized federated learning called ProxyFL, or proxy-based federated learning. Each participant in ProxyFL maintains two models, a private model, and a publicly shared proxy model designed to protect the participant's privacy. Proxy models allow efficient information exchange among participants without the need of a centralized server. The proposed method eliminates a significant limitation of canonical federated learning by allowing model heterogeneity; each participant can have a private model with any architecture. Furthermore, our protocol for communication by proxy leads to stronger privacy guarantees using differential privacy analysis. Experiments on popular image datasets, and a cancer diagnostic problem using high-quality gigapixel histology whole slide images, show that ProxyFL can outperform existing alternatives with much less communication overhead and stronger privacy.
RESUMO
OBJECTIVE: To predict older adults' risk of avoidable hospitalisation related to ambulatory care sensitive conditions (ACSC) using machine learning applied to administrative health data of Ontario, Canada. DESIGN, SETTING AND PARTICIPANTS: A retrospective cohort study was conducted on a large cohort of all residents covered under a single-payer system in Ontario, Canada over the period of 10 years (2008-2017). The study included 1.85 million Ontario residents between 65 and 74 years old at any time throughout the study period. DATA SOURCES: Administrative health data from Ontario, Canada obtained from the (ICES formely known as the Institute for Clinical Evaluative Sciences Data Repository. MAIN OUTCOME MEASURES: Risk of hospitalisations due to ACSCs 1 year after the observation period. RESULTS: The study used a total of 1 854 116 patients, split into train, validation and test sets. The ACSC incidence rates among the data points were 1.1% for all sets. The final XGBoost model achieved an area under the receiver operating curve of 80.5% and an area under precision-recall curve of 0.093 on the test set, and the predictions were well calibrated, including in key subgroups. When ranking the model predictions, those at the top 5% of risk as predicted by the model captured 37.4% of those presented with an ACSC-related hospitalisation. A variety of features such as the previous number of ambulatory care visits, presence of ACSC-related hospitalisations during the observation window, age, rural residence and prescription of certain medications were contributors to the prediction. Our model was also able to capture the geospatial heterogeneity of ACSC risk in Ontario, and especially the elevated risk in rural and marginalised regions. CONCLUSIONS: This study aimed to predict the 1-year risk of hospitalisation from ambulatory-care sensitive conditions in seniors aged 65-74 years old with a single, large-scale machine learning model. The model shows the potential to inform population health planning and interventions to reduce the burden of ACSC-related hospitalisations.
Assuntos
Condições Sensíveis à Atenção Primária , Saúde da População , Idoso , Estudos de Coortes , Hospitalização , Humanos , Aprendizado de Máquina , Ontário/epidemiologia , Estudos RetrospectivosRESUMO
BACKGROUND: The COVID-19 pandemic has led to an increased demand for health care resources and, in some cases, shortage of medical equipment and staff. Our objective was to develop and validate a multivariable model to predict risk of hospitalization for patients infected with SARS-CoV-2. METHODS: We used routinely collected health records in a patient cohort to develop and validate our prediction model. This cohort included adult patients (age ≥ 18 yr) from Ontario, Canada, who tested positive for SARS-CoV-2 ribonucleic acid by polymerase chain reaction between Feb. 2 and Oct. 5, 2020, and were followed up through Nov. 5, 2020. Patients living in long-term care facilities were excluded, as they were all assumed to be at high risk of hospitalization for COVID-19. Risk of hospitalization within 30 days of diagnosis of SARS-CoV-2 infection was estimated via gradient-boosting decision trees, and variable importance examined via Shapley values. We built a gradient-boosting model using the Extreme Gradient Boosting (XGBoost) algorithm and compared its performance against 4 empirical rules commonly used for risk stratifications based on age and number of comorbidities. RESULTS: The cohort included 36 323 patients with 2583 hospitalizations (7.1%). Hospitalized patients had a higher median age (64 yr v. 43 yr), were more likely to be male (56.3% v. 47.3%) and had a higher median number of comorbidities (3, interquartile range [IQR] 2-6 v. 1, IQR 0-3) than nonhospitalized patients. Patients were split into development (n = 29 058, 80.0%) and held-out validation (n = 7265, 20.0%) cohorts. The gradient-boosting model achieved high discrimination (development cohort: area under the receiver operating characteristic curve across the 5 folds of 0.852; validation cohort: 0.8475) and strong calibration (slope = 1.01, intercept = -0.01). The patients who scored at the top 10% captured 47.4% of hospitalizations, and those who scored at the top 30% captured 80.6%. INTERPRETATION: We developed and validated an accurate risk stratification model using routinely collected health administrative data. We envision that modelling such risk stratification based on routinely collected health data could support management of COVID-19 on a population health level.
Assuntos
COVID-19/epidemiologia , Árvores de Decisões , Hospitalização/estatística & dados numéricos , Medição de Risco , Adulto , Idoso , COVID-19/terapia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Ontário/epidemiologia , Medição de Risco/métodos , Fatores de RiscoRESUMO
Importance: Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions. Objective: To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data. Design, Setting, and Participants: This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1â¯657â¯395 patients, validated on 243â¯442 patients, and tested on 236â¯506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016. Exposures: A random sample of 2â¯137â¯343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions. Main Outcomes and Measures: Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars. Results: This study trained a gradient boosting decision tree model on data from 1 657 395 patients (12 900 257 instances; 6 666 662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario. Conclusions and Relevance: In this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.
Assuntos
Idade de Início , Algoritmos , Tomada de Decisões Assistida por Computador , Diabetes Mellitus Tipo 2/diagnóstico , Diabetes Mellitus Tipo 2/fisiopatologia , Previsões/métodos , Aprendizado de Máquina , Medição de Risco/métodos , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Estudos de Coortes , Diabetes Mellitus Tipo 2/epidemiologia , Registros Eletrônicos de Saúde/estatística & dados numéricos , Feminino , Humanos , Incidência , Masculino , Pessoa de Meia-Idade , Ontário/epidemiologia , Estudos Retrospectivos , Adulto JovemRESUMO
Across jurisdictions, government and health insurance providers hold a large amount of data from patient interactions with the healthcare system. We aimed to develop a machine learning-based model for predicting adverse outcomes due to diabetes complications using administrative health data from the single-payer health system in Ontario, Canada. A Gradient Boosting Decision Tree model was trained on data from 1,029,366 patients, validated on 272,864 patients, and tested on 265,406 patients. Discrimination was assessed using the AUC statistic and calibration was assessed visually using calibration plots overall and across population subgroups. Our model predicting three-year risk of adverse outcomes due to diabetes complications (hyper/hypoglycemia, tissue infection, retinopathy, cardiovascular events, amputation) included 700 features from multiple diverse data sources and had strong discrimination (average test AUC = 77.7, range 77.7-77.9). Through the design and validation of a high-performance model to predict diabetes complications adverse outcomes at the population level, we demonstrate the potential of machine learning and administrative health data to inform health planning and healthcare resource allocation for diabetes management.