Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 29
Filter
2.
Article in English | MEDLINE | ID: mdl-38470976

ABSTRACT

BACKGROUND: Estimating the risk of revision after arthroplasty could inform patient and surgeon decision-making. However, there is a lack of well-performing prediction models assisting in this task, which may be due to current conventional modeling approaches such as traditional survivorship estimators (such as Kaplan-Meier) or competing risk estimators. Recent advances in machine learning survival analysis might improve decision support tools in this setting. Therefore, this study aimed to assess the performance of machine learning compared with that of conventional modeling to predict revision after arthroplasty. QUESTION/PURPOSE: Does machine learning perform better than traditional regression models for estimating the risk of revision for patients undergoing hip or knee arthroplasty? METHODS: Eleven datasets from published studies from the Dutch Arthroplasty Register reporting on factors associated with revision or survival after partial or total knee and hip arthroplasty between 2018 and 2022 were included in our study. The 11 datasets were observational registry studies, with a sample size ranging from 3038 to 218,214 procedures. We developed a set of time-to-event models for each dataset, leading to 11 comparisons. A set of predictors (factors associated with revision surgery) was identified based on the variables that were selected in the included studies. We assessed the predictive performance of two state-of-the-art statistical time-to-event models for 1-, 2-, and 3-year follow-up: a Fine and Gray model (which models the cumulative incidence of revision) and a cause-specific Cox model (which models the hazard of revision). These were compared with a machine-learning approach (a random survival forest model, which is a decision tree-based machine-learning algorithm for time-to-event analysis). Performance was assessed according to discriminative ability (time-dependent area under the receiver operating curve), calibration (slope and intercept), and overall prediction error (scaled Brier score). Discrimination, known as the area under the receiver operating characteristic curve, measures the model's ability to distinguish patients who achieved the outcomes from those who did not and ranges from 0.5 to 1.0, with 1.0 indicating the highest discrimination score and 0.50 the lowest. Calibration plots the predicted versus the observed probabilities; a perfect plot has an intercept of 0 and a slope of 1. The Brier score calculates a composite of discrimination and calibration, with 0 indicating perfect prediction and 1 the poorest. A scaled version of the Brier score, 1 - (model Brier score/null model Brier score), can be interpreted as the amount of overall prediction error. RESULTS: Using machine learning survivorship analysis, we found no differences between the competing risks estimator and traditional regression models for patients undergoing arthroplasty in terms of discriminative ability (patients who received a revision compared with those who did not). We found no consistent differences between the validated performance (time-dependent area under the receiver operating characteristic curve) of different modeling approaches because these values ranged between -0.04 and 0.03 across the 11 datasets (the time-dependent area under the receiver operating characteristic curve of the models across 11 datasets ranged between 0.52 to 0.68). In addition, the calibration metrics and scaled Brier scores produced comparable estimates, showing no advantage of machine learning over traditional regression models. CONCLUSION: Machine learning did not outperform traditional regression models. CLINICAL RELEVANCE: Neither machine learning modeling nor traditional regression methods were sufficiently accurate in order to offer prognostic information when predicting revision arthroplasty. The benefit of these modeling approaches may be limited in this context.

3.
NPJ Digit Med ; 7(1): 58, 2024 Mar 06.
Article in English | MEDLINE | ID: mdl-38448743

ABSTRACT

Despite artificial intelligence (AI) technology progresses at unprecedented rate, our ability to translate these advancements into clinical value and adoption at the bedside remains comparatively limited. This paper reviews the current use of implementation outcomes in randomized controlled trials evaluating AI-based clinical decision support and found limited adoption. To advance trust and clinical adoption of AI, there is a need to bridge the gap between traditional quantitative metrics and implementation outcomes to better grasp the reasons behind the success or failure of AI systems and improve their translation into clinical value.

4.
Bone Jt Open ; 5(1): 9-19, 2024 Jan 16.
Article in English | MEDLINE | ID: mdl-38226447

ABSTRACT

Aims: Machine-learning (ML) prediction models in orthopaedic trauma hold great promise in assisting clinicians in various tasks, such as personalized risk stratification. However, an overview of current applications and critical appraisal to peer-reviewed guidelines is lacking. The objectives of this study are to 1) provide an overview of current ML prediction models in orthopaedic trauma; 2) evaluate the completeness of reporting following the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement; and 3) assess the risk of bias following the Prediction model Risk Of Bias Assessment Tool (PROBAST) tool. Methods: A systematic search screening 3,252 studies identified 45 ML-based prediction models in orthopaedic trauma up to January 2023. The TRIPOD statement assessed transparent reporting and the PROBAST tool the risk of bias. Results: A total of 40 studies reported on training and internal validation; four studies performed both development and external validation, and one study performed only external validation. The most commonly reported outcomes were mortality (33%, 15/45) and length of hospital stay (9%, 4/45), and the majority of prediction models were developed in the hip fracture population (60%, 27/45). The overall median completeness for the TRIPOD statement was 62% (interquartile range 30 to 81%). The overall risk of bias in the PROBAST tool was low in 24% (11/45), high in 69% (31/45), and unclear in 7% (3/45) of the studies. High risk of bias was mainly due to analysis domain concerns including small datasets with low number of outcomes, complete-case analysis in case of missing data, and no reporting of performance measures. Conclusion: The results of this study showed that despite a myriad of potential clinically useful applications, a substantial part of ML studies in orthopaedic trauma lack transparent reporting, and are at high risk of bias. These problems must be resolved by following established guidelines to instil confidence in ML models among patients and clinicians. Otherwise, there will remain a sizeable gap between the development of ML prediction models and their clinical application in our day-to-day orthopaedic trauma practice.

5.
OTA Int ; 6(5 Suppl): e283, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38152438

ABSTRACT

Objectives: With more than 300,000 patients per year in the United States alone, hip fractures are one of the most common injuries occurring in the elderly. The incidence is predicted to rise to 6 million cases per annum worldwide by 2050. Many fracture registries have been established, serving as tools for quality surveillance and evaluating patient outcomes. Most registries are based on billing and procedural codes, prone to under-reporting of cases. Deep learning (DL) is able to interpret radiographic images and assist in fracture detection; we propose to conduct a DL-based approach intended to autocreate a fracture registry, specifically for the hip fracture population. Methods: Conventional radiographs (n = 18,834) from 2919 patients from Massachusetts General Brigham hospitals were extracted (images designated as hip radiographs within the medical record). We designed a cascade model consisting of 3 submodules for image view classification (MI), postoperative implant detection (MII), and proximal femoral fracture detection (MIII), including data augmentation and scaling, and convolutional neural networks for model development. An ensemble model of 10 models (based on ResNet, VGG, DenseNet, and EfficientNet architectures) was created to detect the presence of a fracture. Results: The accuracy of the developed submodules reached 92%-100%; visual explanations of model predictions were generated through gradient-based methods. Time for the automated model-based fracture-labeling was 0.03 seconds/image, compared with an average of 12 seconds/image for human annotation as calculated in our preprocessing stages. Conclusion: This semisupervised DL approach labeled hip fractures with high accuracy. This mitigates the burden of annotations in a large data set, which is time-consuming and prone to under-reporting. The DL approach may prove beneficial for future efforts to autocreate construct registries that outperform current diagnosis and procedural codes. Clinicians and researchers can use the developed DL approach for quality improvement, diagnostic and prognostic research purposes, and building clinical decision support tools.

6.
BMJ Open ; 13(10): e074700, 2023 10 18.
Article in English | MEDLINE | ID: mdl-37852772

ABSTRACT

INTRODUCTION: Despite technological advancements in recent years, glenoid component loosening remains a common complication after anatomical total shoulder arthroplasty (ATSA) and is one of the main causes of revision surgery. Increasing emphasis is placed on the prevention of glenoid component failure. Previous studies have successfully predicted range of motion, patient-reported outcomes and short-term complications after ATSA using machine learning methods, but an accurate predictive model for (glenoid component) revision is currently lacking. This study aims to use a large international database to accurately predict aseptic loosening of the glenoid component after ATSA using machine learning algorithms. METHODS AND ANALYSIS: For this multicentre, retrospective study, individual patient data will be compiled from previously published studies reporting revision of ATSA. A systematic literature search will be performed in Medline (PubMed) identifying all studies reporting outcomes of ATSA. Authors will be contacted and invited to participate in the Machine Learning Consortium by sharing their anonymised databases. All databases reporting revisions after ATSA will be included, and individual patients with a follow-up less than 2 years or a fracture as the indication for ATSA will be excluded. First, features (predictive variables) will be identified using a random forest feature selection. The resulting features from the compiled database will be used to train various machine learning algorithms (stochastic gradient boosting, random forest, support vector machine, neural network and elastic-net penalised logistic regression). The developed and validated algorithms will be evaluated across discrimination (c-statistic), calibration, the Brier score and the decision curve analysis. The best-performing algorithm will be used to create an open-access online prediction tool. ETHICS AND DISSEMINATION: Data will be collected adhering to the WHO regulation on data sharing. An Institutional Review Board review is not applicable. The study results will be published in a peer-reviewed journal.


Subject(s)
Arthroplasty, Replacement, Shoulder , Humans , Arthroplasty, Replacement, Shoulder/adverse effects , Retrospective Studies , Scapula , Machine Learning , Probability , Treatment Outcome , Multicenter Studies as Topic
8.
JAMIA Open ; 6(2): ooad033, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37266187

ABSTRACT

Objective: When correcting for the "class imbalance" problem in medical data, the effects of resampling applied on classifier algorithms remain unclear. We examined the effect on performance over several combinations of classifiers and resampling ratios. Materials and Methods: Multiple classification algorithms were trained on 7 resampled datasets: no correction, random undersampling, 4 ratios of Synthetic Minority Oversampling Technique (SMOTE), and random oversampling with the Adaptive Synthetic algorithm (ADASYN). Performance was evaluated in Area Under the Curve (AUC), precision, recall, Brier score, and calibration metrics. A case study on prediction modeling for 30-day unplanned readmissions in previously admitted Urology patients was presented. Results: For most algorithms, using resampled data showed a significant increase in AUC and precision, ranging from 0.74 (CI: 0.69-0.79) to 0.93 (CI: 0.92-0.94), and 0.35 (CI: 0.12-0.58) to 0.86 (CI: 0.81-0.92) respectively. All classification algorithms showed significant increases in recall, and significant decreases in Brier score with distorted calibration overestimating positives. Discussion: Imbalance correction resulted in an overall improved performance, yet poorly calibrated models. There can still be clinical utility due to a strong discriminating performance, specifically when predicting only low and high risk cases is clinically more relevant. Conclusion: Resampling data resulted in increased performances in classification algorithms, yet produced an overestimation of positive predictions. Based on the findings from our case study, a thoughtful predefinition of the clinical prediction task may guide the use of resampling techniques in future studies aiming to improve clinical decision support tools.

9.
Injury ; 54(7): 110757, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37164900

ABSTRACT

PURPOSE: Effects of clockwise torque rotation onto proximal femoral fracture fixation have been subject of ongoing debate: fixated right-sided trochanteric fractures seem more rotationally stable than left-sided fractures in the biomechanical setting, but this theoretical advantage has not been demonstrated in the clinical setting to date. The purpose of this study was to identify a difference in early reoperation rate between patients undergoing surgery for left- versus right-sided proximal femur fractures using cephalomedullary nailing (CMN). MATERIALS AND METHODS: The American College of Surgeons National Surgical Quality Improvement Program was queried from 2016-2019 to identify patients aged 50 years and older undergoing CMN for a proximal femoral fracture. The primary outcome was any unplanned reoperation within 30 days following surgery. The difference was calculated using a Chi-square test, and observed power calculated using post-hoc power analysis. RESULTS: In total, of 20,122 patients undergoing CMN for proximal femoral fracture management, 1.8% (n=371) had to undergo an unplanned reoperation within 30 days after surgery. Overall, 208 (2.0%) were left-sided and 163 (1.7%) right-sided fractures (p=0.052, risk ratio [RR] 1.22, 95% confidence interval [CI] 1.00-1.50), odds ratio [OR] 1.23 (95%CI 1.00-1.51), power 49.2% (α=0.05). CONCLUSION: This study shows a higher risk of reoperation for left-sided compared to right-sided proximal femur fractures after CMN in a large sample size. Although results may be underpowered and statistically insignificant, this finding might substantiate the hypothesis that clockwise rotation during implant insertion and (postoperative) weightbearing may lead to higher reoperation rates. LEVEL OF EVIDENCE: Therapeutic level II.


Subject(s)
Femoral Fractures , Fracture Fixation, Intramedullary , Hip Fractures , Proximal Femoral Fractures , Humans , Middle Aged , Aged , Reoperation , Torque , Bone Nails , Treatment Outcome , Femoral Fractures/surgery , Hip Fractures/surgery , Femur , Retrospective Studies
10.
Bone Jt Open ; 4(3): 168-181, 2023 Mar 14.
Article in English | MEDLINE | ID: mdl-37051847

ABSTRACT

To develop prediction models using machine-learning (ML) algorithms for 90-day and one-year mortality prediction in femoral neck fracture (FNF) patients aged 50 years or older based on the Hip fracture Evaluation with Alternatives of Total Hip arthroplasty versus Hemiarthroplasty (HEALTH) and Fixation using Alternative Implants for the Treatment of Hip fractures (FAITH) trials. This study included 2,388 patients from the HEALTH and FAITH trials, with 90-day and one-year mortality proportions of 3.0% (71/2,388) and 6.4% (153/2,388), respectively. The mean age was 75.9 years (SD 10.8) and 65.9% of patients (1,574/2,388) were female. The algorithms included patient and injury characteristics. Six algorithms were developed, internally validated and evaluated across discrimination (c-statistic; discriminative ability between those with risk of mortality and those without), calibration (observed outcome compared to the predicted probability), and the Brier score (composite of discrimination and calibration). The developed algorithms distinguished between patients at high and low risk for 90-day and one-year mortality. The penalized logistic regression algorithm had the best performance metrics for both 90-day (c-statistic 0.80, calibration slope 0.95, calibration intercept -0.06, and Brier score 0.039) and one-year (c-statistic 0.76, calibration slope 0.86, calibration intercept -0.20, and Brier score 0.074) mortality prediction in the hold-out set. Using high-quality data, the ML-based prediction models accurately predicted 90-day and one-year mortality in patients aged 50 years or older with a FNF. The final models must be externally validated to assess generalizability to other populations, and prospectively evaluated in the process of shared decision-making.

11.
Eur J Trauma Emerg Surg ; 49(3): 1545-1553, 2023 Jun.
Article in English | MEDLINE | ID: mdl-36757419

ABSTRACT

PURPOSE: Mortality prediction in elderly femoral neck fracture patients is valuable in treatment decision-making. A previously developed and internally validated clinical prediction model shows promise in identifying patients at risk of 90-day and 2-year mortality. Validation in an independent cohort is required to assess the generalizability; especially in geographically distinct regions. Therefore we questioned, is the SORG Orthopaedic Research Group (SORG) femoral neck fracture mortality algorithm externally valid in an Israeli cohort to predict 90-day and 2-year mortality? METHODS: We previously developed a prediction model in 2022 for estimating the risk of mortality in femoral neck fracture patients using a multicenter institutional cohort of 2,478 patients from the USA. The model included the following input variables that are available on clinical admission: age, male gender, creatinine level, absolute neutrophil, hemoglobin level, international normalized ratio (INR), congestive heart failure (CHF), displaced fracture, hemiplegia, chronic obstructive pulmonary disease (COPD), history of cerebrovascular accident (CVA) and beta-blocker use. To assess the generalizability, we used an intercontinental institutional cohort from the Sheba Medical Center in Israel (level I trauma center), queried between June 2008 and February 2022. Generalizability of the model was assessed using discrimination, calibration, Brier score, and decision curve analysis. RESULTS: The validation cohort included 2,033 patients, aged 65 years or above, that underwent femoral neck fracture surgery. Most patients were female 64.8% (n = 1317), the median age was 81 years (interquartile range = 75-86), and 80.4% (n = 1635) patients sustained a displaced fracture (Garden III/IV). The 90-day mortality was 9.4% (n = 190) and 2-year mortality was 30.0% (n = 610). Despite numerous baseline differences, the model performed acceptably to the validation cohort on discrimination (c-statistic 0.67 for 90-day, 0.67 for 2-year), calibration, Brier score, and decision curve analysis. CONCLUSIONS: The previously developed SORG femoral neck fracture mortality algorithm demonstrated good performance in an independent intercontinental population. Current iteration should not be relied on for patient care, though suggesting potential utility in assessing patients at low risk for 90-day or 2-year mortality. Further studies should evaluate this tool in a prospective setting and evaluate its feasibility and efficacy in clinical practice. The algorithm can be freely accessed: https://sorg-apps.shinyapps.io/hipfracturemortality/ . LEVEL OF EVIDENCE: Level III, Prognostic study.


Subject(s)
Femoral Neck Fractures , Models, Statistical , Aged , Humans , Male , Female , Aged, 80 and over , Prognosis , Israel/epidemiology , Prospective Studies , Femoral Neck Fractures/surgery , Retrospective Studies
12.
Bone Joint J ; 105-B(1): 56-63, 2023 01.
Article in English | MEDLINE | ID: mdl-36587260

ABSTRACT

AIMS: This study aimed to answer the following questions: do 3D-printed models lead to a more accurate recognition of the pattern of complex fractures of the elbow?; do 3D-printed models lead to a more reliable recognition of the pattern of these injuries?; and do junior surgeons benefit more from 3D-printed models than senior surgeons? METHODS: A total of 15 orthopaedic trauma surgeons (seven juniors, eight seniors) evaluated 20 complex elbow fractures for their overall pattern (i.e. varus posterior medial rotational injury, terrible triad injury, radial head fracture with posterolateral dislocation, anterior (trans-)olecranon fracture-dislocation, posterior (trans-)olecranon fracture-dislocation) and their specific characteristics. First, fractures were assessed based on radiographs and 2D and 3D CT scans; and in a subsequent round, one month later, with additional 3D-printed models. Diagnostic accuracy (acc) and inter-surgeon reliability (κ) were determined for each assessment. RESULTS: Accuracy significantly improved with 3D-printed models for the whole group on pattern recognition (acc2D/3D = 0.62 vs acc3Dprint= 0.69; Δacc = 0.07 (95% confidence interval (CI) 0.00 to 0.14); p = 0.025). A significant improvement was also seen in reliability for pattern recognition with the additional 3D-printed models (κ2D/3D = 0.41 (moderate) vs κ3Dprint = 0.59 (moderate); Δκ = 0.18 (95% CI 0.14 to 0.22); p ≤ 0.001). Accuracy was comparable between junior and senior surgeons with the 3D-printed model (accjunior = 0.70 vs accsenior = 0.68; Δacc = -0.02 (95% CI -0.17 to 0.13); p = 0.904). Reliability was also comparable between junior and senior surgeons without the 3D-printed model (κjunior = 0.39 (fair) vs κsenior = 0.43 (moderate); Δκ = 0.03 (95% CI -0.03 to 0.10); p = 0.318). However, junior surgeons showed greater improvement regarding reliability than seniors with 3D-printed models (κjunior = 0.65 (substantial) vs κsenior = 0.54 (moderate); Δκ = 0.11 (95% CI 0.04 to 0.18); p = 0.002). CONCLUSION: The use of 3D-printed models significantly improved the accuracy and reliability of recognizing the pattern of complex fractures of the elbow. However, the current long printing time and non-reusable materials could limit the usefulness of 3D-printed models in clinical practice. They could be suitable as a reusable tool for teaching residents.Cite this article: Bone Joint J 2023;105-B(1):56-63.


Subject(s)
Elbow Injuries , Elbow Joint , Joint Dislocations , Radius Fractures , Ulna Fractures , Humans , Elbow , Reproducibility of Results , Elbow Joint/diagnostic imaging , Elbow Joint/surgery , Radius Fractures/diagnostic imaging , Radius Fractures/surgery , Joint Dislocations/surgery , Ulna Fractures/surgery , Printing, Three-Dimensional
13.
Gen Hosp Psychiatry ; 78: 42-49, 2022.
Article in English | MEDLINE | ID: mdl-35853417

ABSTRACT

OBJECTIVE: Anxiety, depression and greater pain intensity before total knee arthroplasty (TKA) may increase the probability of revision surgery for remaining symptoms even without clear pathology or technical issues. We aimed to assess whether preoperative anxiety/depression and pain intensity are associated with revision TKA for less clear indications. METHODS: Less clear indications for revision were defined after a Delphi process in which consensus was reached among 59 orthopaedic knee experts. We performed a cox regression analyses on primary TKA patients registered in the Dutch Arthroplasty Registry (LROI) who completed the EuroQol 5D 3 L (EQ5D-3 L) anxiety/depression score to examine associations between preoperative anxiety/depression and pain (Numeric Rating Scale (NRS)) with TKA revision for less clear reasons. These analyses were adjusted for age, BMI, sex, smoking, ASA score, EQ5D-3 L thermometer and OKS score. RESULTS: In total, 25.9% patients of the 56,233 included patients reported moderate or severe symptoms of anxiety/depression on the EQ5D-3 L anxiety/depression score. Of those, 615 revisions (45.5%) were performed for less clear reasons for revision (patellar pain, malalignment, instability, progression of osteoarthritis or arthrofibrosis). Not EQ5D-3 L anxiety/depression score, but higher NRS pain at rest and EQ5D-3 L pain score were associated with revision for less clear reason (HR: 1.058, 95% CI 1.019-1.099 & HR: 1.241, 95% CI 1.044-1.476, respectively). CONCLUSION: Our findings suggest that pain intensity is a risk factor for TKA revision for a less clear reason. The finding that preoperative pain intensity was associated with reason for revision confirms a likely influence of subjective, personal factors on offer and acceptance of TKA revision. The association between anxiety/depression and reason for revision after TKA may also be found when including more specific outcome measures to assess anxiety/depression and we therefore hope to encourage further research on this topic with our study, ideally in a prospective setting. STUDY DESIGN: Longitudinal Cohort Study Level III, Delphi Consensus.


Subject(s)
Arthroplasty, Replacement, Knee , Osteoarthritis, Knee , Anxiety/epidemiology , Depression/epidemiology , Humans , Longitudinal Studies , Osteoarthritis, Knee/surgery , Pain/epidemiology , Pain Measurement , Prospective Studies , Treatment Outcome
14.
Eur J Trauma Emerg Surg ; 48(6): 4669-4682, 2022 Dec.
Article in English | MEDLINE | ID: mdl-35643788

ABSTRACT

PURPOSE: Preoperative prediction of mortality in femoral neck fracture patients aged 65 years or above may be valuable in the treatment decision-making. A preoperative clinical prediction model can aid surgeons and patients in the shared decision-making process, and optimize care for elderly femoral neck fracture patients. This study aimed to develop and internally validate a clinical prediction model using machine learning (ML) algorithms for 90 day and 2 year mortality in femoral neck fracture patients aged 65 years or above. METHODS: A retrospective cohort study at two trauma level I centers and three (non-level I) community hospitals was conducted to identify patients undergoing surgical fixation for a femoral neck fracture. Five different ML algorithms were developed and internally validated and assessed by discrimination, calibration, Brier score and decision curve analysis. RESULTS: In total, 2478 patients were included with 90 day and 2 year mortality rates of 9.1% (n = 225) and 23.5% (n = 582) respectively. The models included patient characteristics, comorbidities and laboratory values. The stochastic gradient boosting algorithm had the best performance for 90 day mortality prediction, with good discrimination (c-statistic = 0.74), calibration (intercept = - 0.05, slope = 1.11) and Brier score (0.078). The elastic-net penalized logistic regression algorithm had the best performance for 2 year mortality prediction, with good discrimination (c-statistic = 0.70), calibration (intercept = - 0.03, slope = 0.89) and Brier score (0.16). The models were incorporated into a freely available web-based application, including individual patient explanations for interpretation of the model to understand the reasoning how the model made a certain prediction: https://sorg-apps.shinyapps.io/hipfracturemortality/ CONCLUSIONS: The clinical prediction models show promise in estimating mortality prediction in elderly femoral neck fracture patients. External and prospective validation of the models may improve surgeon ability when faced with the treatment decision-making. LEVEL OF EVIDENCE: Prognostic Level II.


Subject(s)
Femoral Neck Fractures , Aged , Humans , Retrospective Studies , Femoral Neck Fractures/surgery , Models, Statistical , Prognosis , Machine Learning , Algorithms
15.
Clin Orthop Relat Res ; 480(12): 2350-2360, 2022 12 01.
Article in English | MEDLINE | ID: mdl-35767811

ABSTRACT

BACKGROUND: Femoral neck fractures are common and are frequently treated with internal fixation. A major disadvantage of internal fixation is the substantially high number of conversions to arthroplasty because of nonunion, malunion, avascular necrosis, or implant failure. A clinical prediction model identifying patients at high risk of conversion to arthroplasty may help clinicians in selecting patients who could have benefited from arthroplasty initially. QUESTION/PURPOSE: What is the predictive performance of a machine-learning (ML) algorithm to predict conversion to arthroplasty within 24 months after internal fixation in patients with femoral neck fractures? METHODS: We included 875 patients from the Fixation using Alternative Implants for the Treatment of Hip fractures (FAITH) trial. The FAITH trial consisted of patients with low-energy femoral neck fractures who were randomly assigned to receive a sliding hip screw or cancellous screws for internal fixation. Of these patients, 18% (155 of 875) underwent conversion to THA or hemiarthroplasty within the first 24 months. All patients were randomly divided into a training set (80%) and test set (20%). First, we identified 27 potential patient and fracture characteristics that may have been associated with our primary outcome, based on biomechanical rationale and previous studies. Then, random forest algorithms (an ML learning, decision tree-based algorithm that selects variables) identified 10 predictors of conversion: BMI, cardiac disease, Garden classification, use of cardiac medication, use of pulmonary medication, age, lung disease, osteoarthritis, sex, and the level of the fracture line. Based on these variables, five different ML algorithms were trained to identify patterns related to conversion. The predictive performance of these trained ML algorithms was assessed on the training and test sets based on the following performance measures: (1) discrimination (the model's ability to distinguish patients who had conversion from those who did not; expressed with the area under the receiver operating characteristic curve [AUC]), (2) calibration (the plotted estimated versus the observed probabilities; expressed with the calibration curve intercept and slope), and (3) the overall model performance (Brier score: a composite of discrimination and calibration). RESULTS: None of the five ML algorithms performed well in predicting conversion to arthroplasty in the training set and the test set; AUCs of the algorithms in the training set ranged from 0.57 to 0.64, slopes of calibration plots ranged from 0.53 to 0.82, calibration intercepts ranged from -0.04 to 0.05, and Brier scores ranged from 0.14 to 0.15. The algorithms were further evaluated in the test set; AUCs ranged from 0.49 to 0.73, calibration slopes ranged from 0.17 to 1.29, calibration intercepts ranged from -1.28 to 0.34, and Brier scores ranged from 0.13 to 0.15. CONCLUSION: The predictive performance of the trained algorithms was poor, despite the use of one of the best datasets available worldwide on this subject. If the current dataset consisted of different variables or more patients, the performance may have been better. Also, various reasons for conversion to arthroplasty were pooled in this study, but the separate prediction of underlying pathology (such as, avascular necrosis or nonunion) may be more precise. Finally, it may be possible that it is inherently difficult to predict conversion to arthroplasty based on preoperative variables alone. Therefore, future studies should aim to include more variables and to differentiate between the various reasons for arthroplasty. LEVEL OF EVIDENCE: Level III, prognostic study.


Subject(s)
Arthroplasty, Replacement, Hip , Femoral Neck Fractures , Humans , Prognosis , Models, Statistical , Femoral Neck Fractures/surgery , Arthroplasty, Replacement, Hip/adverse effects , Fracture Fixation, Internal/adverse effects , Algorithms , Machine Learning , Necrosis/etiology , Necrosis/surgery , Retrospective Studies , Treatment Outcome
16.
Clin Orthop Relat Res ; 480(11): 2205-2213, 2022 11 01.
Article in English | MEDLINE | ID: mdl-35561268

ABSTRACT

BACKGROUND: Postoperative delirium in patients aged 60 years or older with hip fractures adversely affects clinical and functional outcomes. The economic cost of delirium is estimated to be as high as USD 25,000 per patient, with a total budgetary impact between USD 6.6 to USD 82.4 billion annually in the United States alone. Forty percent of delirium episodes are preventable, and accurate risk stratification can decrease the incidence and improve clinical outcomes in patients. A previously developed clinical prediction model (the SORG Orthopaedic Research Group hip fracture delirium machine-learning algorithm) is highly accurate on internal validation (in 28,207 patients with hip fractures aged 60 years or older in a US cohort) in identifying at-risk patients, and it can facilitate the best use of preventive interventions; however, it has not been tested in an independent population. For an algorithm to be useful in real life, it must be valid externally, meaning that it must perform well in a patient cohort different from the cohort used to "train" it. With many promising machine-learning prediction models and many promising delirium models, only few have also been externally validated, and even fewer are international validation studies. QUESTION/PURPOSE: Does the SORG hip fracture delirium algorithm, initially trained on a database from the United States, perform well on external validation in patients aged 60 years or older in Australia and New Zealand? METHODS: We previously developed a model in 2021 for assessing risk of delirium in hip fracture patients using records of 28,207 patients obtained from the American College of Surgeons National Surgical Quality Improvement Program. Variables included in the original model included age, American Society of Anesthesiologists (ASA) class, functional status (independent or partially or totally dependent for any activities of daily living), preoperative dementia, preoperative delirium, and preoperative need for a mobility aid. To assess whether this model could be applied elsewhere, we used records from an international hip fracture registry. Between June 2017 and December 2018, 6672 patients older than 60 years of age in Australia and New Zealand were treated surgically for a femoral neck, intertrochanteric hip, or subtrochanteric hip fracture and entered into the Australian & New Zealand Hip Fracture Registry. Patients were excluded if they had a pathological hip fracture or septic shock. Of all patients, 6% (402 of 6672) did not meet the inclusion criteria, leaving 94% (6270 of 6672) of patients available for inclusion in this retrospective analysis. Seventy-one percent (4249 of 5986) of patients were aged 80 years or older, after accounting for 5% (284 of 6270) of missing values; 68% (4292 of 6266) were female, after accounting for 0.06% (4 of 6270) of missing values, and 83% (4690 of 5661) of patients were classified as ASA III/IV, after accounting for 10% (609 of 6270) of missing values. Missing data were imputed using the missForest methodology. In total, 39% (2467 of 6270) of patients developed postoperative delirium. The performance of the SORG hip fracture delirium algorithm on the validation cohort was assessed by discrimination, calibration, Brier score, and a decision curve analysis. Discrimination, known as the area under the receiver operating characteristic curves (c-statistic), measures the model's ability to distinguish patients who achieved the outcomes from those who did not and ranges from 0.5 to 1.0, with 1.0 indicating the highest discrimination score and 0.50 the lowest. Calibration plots the predicted versus the observed probabilities, a perfect plot has an intercept of 0 and a slope of 1. The Brier score calculates a composite of discrimination and calibration, with 0 indicating perfect prediction and 1 the poorest. RESULTS: The SORG hip fracture algorithm, when applied to an external patient cohort, distinguished between patients at low risk and patients at moderate to high risk of developing postoperative delirium. The SORG hip fracture algorithm performed with a c-statistic of 0.74 (95% confidence interval 0.73 to 0.76). The calibration plot showed high accuracy in the lower predicted probabilities (intercept -0.28, slope 0.52) and a Brier score of 0.22 (the null model Brier score was 0.24). The decision curve analysis showed that the model can be beneficial compared with no model or compared with characterizing all patients as at risk for developing delirium. CONCLUSION: Algorithms developed with machine learning are a potential tool for refining treatment of at-risk patients. If high-risk patients can be reliably identified, resources can be appropriately directed toward their care. Although the current iteration of SORG should not be relied on for patient care, it suggests potential utility in assessing risk. Further assessment in different populations, made easier by international collaborations and standardization of registries, would be useful in the development of universally valid prediction models. The model can be freely accessed at: https://sorg-apps.shinyapps.io/hipfxdelirium/ . LEVEL OF EVIDENCE: Level III, therapeutic study.


Subject(s)
Delirium , Hip Fractures , Orthopedics , Activities of Daily Living , Algorithms , Australia , Delirium/diagnosis , Delirium/epidemiology , Delirium/etiology , Female , Hip Fractures/surgery , Humans , Male , Middle Aged , Models, Statistical , Prognosis , Retrospective Studies
17.
Clin Orthop Relat Res ; 480(9): 1766-1775, 2022 09 01.
Article in English | MEDLINE | ID: mdl-35412473

ABSTRACT

BACKGROUND: Incidental durotomy is an intraoperative complication in spine surgery that can lead to postoperative complications, increased length of stay, and higher healthcare costs. Natural language processing (NLP) is an artificial intelligence method that assists in understanding free-text notes that may be useful in the automated surveillance of adverse events in orthopaedic surgery. A previously developed NLP algorithm is highly accurate in the detection of incidental durotomy on internal validation and external validation in an independent cohort from the same country. External validation in a cohort with linguistic differences is required to assess the transportability of the developed algorithm, referred to geographical validation. Ideally, the performance of a prediction model, the NLP algorithm, is constant across geographic regions to ensure reproducibility and model validity. QUESTION/PURPOSE: Can we geographically validate an NLP algorithm for the automated detection of incidental durotomy across three independent cohorts from two continents? METHODS: Patients 18 years or older undergoing a primary procedure of (thoraco)lumbar spine surgery were included. In Massachusetts, between January 2000 and June 2018, 1000 patients were included from two academic and three community medical centers. In Maryland, between July 2016 and November 2018, 1279 patients were included from one academic center, and in Australia, between January 2010 and December 2019, 944 patients were included from one academic center. The authors retrospectively studied the free-text operative notes of included patients for the primary outcome that was defined as intraoperative durotomy. Incidental durotomy occurred in 9% (93 of 1000), 8% (108 of 1279), and 6% (58 of 944) of the patients, respectively, in the Massachusetts, Maryland, and Australia cohorts. No missing reports were observed. Three datasets (Massachusetts, Australian, and combined Massachusetts and Australian) were divided into training and holdout test sets in an 80:20 ratio. An extreme gradient boosting (an efficient and flexible tree-based algorithm) NLP algorithm was individually trained on each training set, and the performance of the three NLP algorithms (respectively American, Australian, and combined) was assessed by discrimination via area under the receiver operating characteristic curves (AUC-ROC; this measures the model's ability to distinguish patients who obtained the outcomes from those who did not), calibration metrics (which plot the predicted and the observed probabilities) and Brier score (a composite of discrimination and calibration). In addition, the sensitivity (true positives, recall), specificity (true negatives), positive predictive value (also known as precision), negative predictive value, F1-score (composite of precision and recall), positive likelihood ratio, and negative likelihood ratio were calculated. RESULTS: The combined NLP algorithm (the combined Massachusetts and Australian data) achieved excellent performance on independent testing data from Australia (AUC-ROC 0.97 [95% confidence interval 0.87 to 0.99]), Massachusetts (AUC-ROC 0.99 [95% CI 0.80 to 0.99]) and Maryland (AUC-ROC 0.95 [95% CI 0.93 to 0.97]). The NLP developed based on the Massachusetts cohort had excellent performance in the Maryland cohort (AUC-ROC 0.97 [95% CI 0.95 to 0.99]) but worse performance in the Australian cohort (AUC-ROC 0.74 [95% CI 0.70 to 0.77]). CONCLUSION: We demonstrated the clinical utility and reproducibility of an NLP algorithm with combined datasets retaining excellent performance in individual countries relative to algorithms developed in the same country alone for detection of incidental durotomy. Further multi-institutional, international collaborations can facilitate the creation of universal NLP algorithms that improve the quality and safety of orthopaedic surgery globally. The combined NLP algorithm has been incorporated into a freely accessible web application that can be found at https://sorg-apps.shinyapps.io/nlp_incidental_durotomy/ . Clinicians and researchers can use the tool to help incorporate the model in evaluating spine registries or quality and safety departments to automate detection of incidental durotomy and optimize prevention efforts. LEVEL OF EVIDENCE: Level III, diagnostic study.


Subject(s)
Artificial Intelligence , Natural Language Processing , Algorithms , Australia , Humans , Reproducibility of Results , Retrospective Studies
18.
Clin Orthop Relat Res ; 480(1): 150-159, 2022 01 01.
Article in English | MEDLINE | ID: mdl-34427569

ABSTRACT

BACKGROUND: Reliably recognizing the overall pattern and specific characteristics of proximal humerus fractures may aid in surgical decision-making. With conventional onscreen imaging modalities, there is considerable and undesired interobserver variability, even when observers receive training in the application of the classification systems used. It is unclear whether three-dimensional (3D) models, which now can be fabricated with desktop printers at relatively little cost, can decrease interobserver variability in fracture classification. QUESTIONS/PURPOSES: Do 3D-printed handheld models of proximal humerus fractures improve agreement among residents and attending surgeons regarding (1) specific fracture characteristics and (2) patterns according to the Neer and Hertel classification systems? METHODS: Plain radiographs, as well as two-dimensional (2D) and 3D CT images, were collected from 20 patients (aged 18 years or older) who sustained a three-part or four-part proximal humerus fracture treated at a Level I trauma center between 2015 and 2019. The included images were chosen to comprise images from patients whose fractures were considered as difficult-to-classify, displaced fractures. Consequently, the images were assessed for eight fracture characteristics and categorized according to the Neer and Hertel classifications by four orthopaedic residents and four attending orthopaedic surgeons during two separate sessions. In the first session, the assessment was performed with conventional onscreen imaging (radiographs and 2D and 3D CT images). In the second session, 3D-printed handheld models were used for assessment, while onscreen imaging was also available. Although proximal humerus classifications such as the Neer classification have, in the past, been shown to have low interobserver reliability, we theorized that by receiving direct tactile and visual feedback from 3D-printed handheld fracture models, clinicians would be able to recognize the complex 3D aspects of classification systems reliably. Interobserver agreement was determined with the multirater Fleiss kappa and scored according to the categorical rating by Landis and Koch. To determine whether there was a difference between the two sessions, we calculated the delta (difference in the) kappa value with 95% confidence intervals and a two-tailed p value. Post hoc power analysis revealed that with the current sample size, a delta kappa value of 0.40 could be detected with 80% power at alpha = 0.05. RESULTS: Using 3D-printed models in addition to conventional imaging did not improve interobserver agreement of the following fracture characteristics: more than 2 mm medial hinge displacement, more than 8 mm metaphyseal extension, surgical neck fracture, anatomic neck fracture, displacement of the humeral head, more than 10 mm lesser tuberosity displacement, and more than 10 mm greater tuberosity displacement. Agreement regarding the presence of a humeral head-splitting fracture was improved but only to a level that was insufficient for clinical or scientific use (fair to substantial, delta kappa = 0.33 [95% CI 0.02 to 0.64]). Assessing 3D-printed handheld models adjunct to onscreen conventional imaging did not improve the interobserver agreement for pattern recognition according to Neer (delta kappa = 0.02 [95% CI -0.11 to 0.07]) and Hertel (delta kappa = 0.01 [95% CI -0.11 to 0.08]). There were no differences between residents and attending surgeons in terms of whether 3D models helped them classify the fractures, but there were few differences to identify fracture characteristics. However, none of the identified differences improved to almost perfect agreement (kappa value above 0.80), so even those few differences are unlikely to be clinically useful. CONCLUSION: Using 3D-printed handheld fracture models in addition to conventional onscreen imaging of three-part and four-part proximal humerus fractures does not improve agreement among residents and attending surgeons on specific fracture characteristics and patterns. Therefore, we do not recommend that clinicians expend the time and costs needed to create these models if the goal is to classify or describe patients' fracture characteristics or pattern, since doing so is unlikely to improve clinicians' abilities to select treatment or estimate prognosis. LEVEL OF EVIDENCE: Level III, diagnostic study.


Subject(s)
Shoulder Fractures , Tomography, X-Ray Computed , Humans , Humeral Head , Observer Variation , Printing, Three-Dimensional , Reproducibility of Results , Shoulder Fractures/diagnostic imaging , Shoulder Fractures/surgery
19.
J Orthop Res ; 40(2): 475-483, 2022 Feb.
Article in English | MEDLINE | ID: mdl-33734466

ABSTRACT

Machine learning (ML) studies are becoming increasingly popular in orthopedics but lack a critically appraisal of their adherence to peer-reviewed guidelines. The objective of this review was to (1) evaluate quality and transparent reporting of ML prediction models in orthopedic surgery based on the transparent reporting of multivariable prediction models for individual prognosis or diagnosis (TRIPOD), and (2) assess risk of bias with the Prediction model Risk Of Bias ASsessment Tool. A systematic review was performed to identify all ML prediction studies published in orthopedic surgery through June 18th, 2020. After screening 7138 studies, 59 studies met the study criteria and were included. Two reviewers independently extracted data and discrepancies were resolved by discussion with at least two additional reviewers present. Across all studies, the overall median completeness for the TRIPOD checklist was 53% (interquartile range 47%-60%). The overall risk of bias was low in 44% (n = 26), high in 41% (n = 24), and unclear in 15% (n = 9). High overall risk of bias was driven by incomplete reporting of performance measures, inadequate handling of missing data, and use of small datasets with inadequate outcome numbers. Although the number of ML studies in orthopedic surgery is increasing rapidly, over 40% of the existing models are at high risk of bias. Furthermore, over half incompletely reported their methods and/or performance measures. Until these issues are adequately addressed to give patients and providers trust in ML models, a considerable gap remains between the development of ML prediction models and their implementation in orthopedic practice.


Subject(s)
Orthopedic Procedures , Orthopedics , Bias , Humans , Machine Learning , Prognosis
20.
J Bone Joint Surg Am ; 104(6): 544-551, 2022 03 16.
Article in English | MEDLINE | ID: mdl-34921550

ABSTRACT

BACKGROUND: Statistical models using machine learning (ML) have the potential for more accurate estimates of the probability of binary events than logistic regression. The present study used existing data sets from large musculoskeletal trauma trials to address the following study questions: (1) Do ML models produce better probability estimates than logistic regression models? (2) Are ML models influenced by different variables than logistic regression models? METHODS: We created ML and logistic regression models that estimated the probability of a specific fracture (posterior malleolar involvement in distal spiral tibial shaft and ankle fractures, scaphoid fracture, and distal radial fracture) or adverse event (subsequent surgery [after distal biceps repair or tibial shaft fracture], surgical site infection, and postoperative delirium) using 9 data sets from published musculoskeletal trauma studies. Each data set was split into training (80%) and test (20%) subsets. Fivefold cross-validation of the training set was used to develop the ML models. The best-performing model was then assessed in the independent testing data. Performance was assessed by (1) discrimination (c-statistic), (2) calibration (slope and intercept), and (3) overall performance (Brier score). RESULTS: The mean c-statistic was 0.01 higher for the logistic regression models compared with the best ML models for each data set (range, -0.01 to 0.06). There were fewer variables strongly associated with variation in the ML models, and many were dissimilar from those in the logistic regression models. CONCLUSIONS: The observation that ML models produce probability estimates comparable with logistic regression models for binary events in musculoskeletal trauma suggests that their benefit may be limited in this context.


Subject(s)
Ankle Fractures , Orthopedics , Scaphoid Bone , Tibial Fractures , Algorithms , Ankle Fractures/surgery , Feasibility Studies , Humans , Logistic Models , Machine Learning , Retrospective Studies , Tibial Fractures/surgery
SELECTION OF CITATIONS
SEARCH DETAIL
...