RESUMO
BACKGROUND: It is well documented that routinely collected patient sociodemographic characteristics (such as race and insurance type) and geography-based social determinants of health (SDoH) measures (for example, the Area Deprivation Index) are associated with health disparities, including symptom severity at presentation. However, the association of patient-level SDoH factors (such as housing status) on musculoskeletal health disparities is not as well documented. Such insight might help with the development of more-targeted interventions to help address health disparities in orthopaedic surgery. QUESTIONS/PURPOSES: (1) What percentage of patients presenting for new patient visits in an orthopaedic surgery clinic who were unemployed but seeking work reported transportation issues that could limit their ability to attend a medical appointment or acquire medications, reported trouble paying for medications, and/or had no current housing? (2) Accounting for traditional sociodemographic factors and patient-level SDoH measures, what factors are associated with poorer patient-reported outcome physical health scores at presentation? (3) Accounting for traditional sociodemographic factor patient-level SDoH measures, what factors are associated with poorer patient-reported outcome mental health scores at presentation? METHODS: New patient encounters at one Level 1 trauma center clinic visit from March 2018 to December 2020 were identified. Included patients had to meet two criteria: they had completed the Patient-Reported Outcome Measure Information System (PROMIS) Global-10 at their new orthopaedic surgery clinic encounter as part of routine clinical care, and they had visited their primary care physician and completed a series of specific SDoH questions. The SDoH questionnaire was developed in our institution to improve data that drive interventions to address health disparities as part of our accountable care organization work. Over the study period, the SDoH questionnaire was only distributed at primary care provider visits. The SDoH questions focused on transportation, housing, employment, and ability to pay for medications. Because we do not have a way to determine how many patients had both primary care provider office visits and new orthopaedic surgery clinic visits over the study period, we were unable to determine how many patients could have been included; however, 9057 patients were evaluated in this cross-sectional study. The mean age was 61 ± 15 years, and most patients self-reported being of White race (83% [7561 of 9057]). Approximately half the patient sample had commercial insurance (46% [4167 of 9057]). To get a better sense of how this study cohort compared with the overall patient population seen at the participating center during the time in question, we reviewed all new patient clinic encounters (n = 135,223). The demographic information between the full patient sample and our study subgroup appeared similar. Using our study cohort, two multivariable linear regression models were created to determine which traditional metrics (for example, self-reported race or insurance type) and patient-specific SDoH factors (for example, lack of reliable transportation) were associated with worse physical and mental health symptoms (that is, lower PROMIS scores) at new patient encounters. The variance inflation factor was used to assess for multicollinearity. For all analyses, p values < 0.05 designated statistical significance. The concept of minimum clinically important difference (MCID) was used to assess clinical importance. Regression coefficients represent the projected change in PROMIS physical or mental health symptom scores (that is, the dependent variable in our regression analyses) accounting for the other included variables. Thus, a regression coefficient for a given variable at or above a known MCID value suggests a clinical difference between those patients with and without the presence of that given characteristic. In this manuscript, regression coefficients at or above 4.2 (or at and below -4.2) for PROMIS Global Physical Health and at or above 5.1 (or at and below -5.1) for PROMIS Global Mental Health were considered clinically relevant. RESULTS: Among the included patients, 8% (685 of 9057) were unemployed but seeking work, 4% (399 of 9057) reported transportation issues that could limit their ability to attend a medical appointment or acquire medications, 4% (328 of 9057) reported trouble paying for medications, and 2% (181 of 9057) had no current housing. Lack of reliable transportation to attend doctor visits or pick up medications (ß = -4.52 [95% CI -5.45 to -3.59]; p < 0.001), trouble paying for medications (ß = -4.55 [95% CI -5.55 to -3.54]; p < 0.001), Medicaid insurance (ß = -5.81 [95% CI -6.41 to -5.20]; p < 0.001), and workers compensation insurance (ß = -5.99 [95% CI -7.65 to -4.34]; p < 0.001) were associated with clinically worse function at presentation. Trouble paying for medications (ß = -6.01 [95% CI -7.10 to -4.92]; p < 0.001), Medicaid insurance (ß = -5.35 [95% CI -6.00 to -4.69]; p < 0.001), and workers compensation (ß = -6.07 [95% CI -7.86 to -4.28]; p < 0.001) were associated with clinically worse mental health at presentation. CONCLUSION: Although transportation issues and financial hardship were found to be associated with worse presenting physical function and mental health, Medicaid and workers compensation insurance remained associated with worse presenting physical function and mental health as well even after controlling for these more detailed, patient-level SDoH factors. Because of that, interventions to decrease health disparities should focus on not only sociodemographic variables (for example, insurance type) but also tangible patient-specific SDoH characteristics. For example, this may include giving patients taxi vouchers or ride-sharing credits to attend clinic visits for patients demonstrating such a need, initiating financial assistance programs for necessary medications, and/or identifying and connecting certain patient groups with social support services early on in the care cycle. LEVEL OF EVIDENCE: Level III, prognostic study.
Assuntos
Doenças Musculoesqueléticas , Ortopedia , Estados Unidos , Humanos , Pessoa de Meia-Idade , Idoso , Saúde Mental , Determinantes Sociais da Saúde , Estudos Transversais , Doenças Musculoesqueléticas/diagnóstico , Doenças Musculoesqueléticas/terapiaRESUMO
BACKGROUND: The ability to predict survival accurately in patients with osseous metastatic disease of the extremities is vital for patient counseling and guiding surgical intervention. We, the Skeletal Oncology Research Group (SORG), previously developed a machine-learning algorithm (MLA) based on data from 1999 to 2016 to predict 90-day and 1-year survival of surgically treated patients with extremity bone metastasis. As treatment regimens for oncology patients continue to evolve, this SORG MLA-driven probability calculator requires temporal reassessment of its accuracy. QUESTION/PURPOSE: Does the SORG-MLA accurately predict 90-day and 1-year survival in patients who receive surgical treatment for a metastatic long-bone lesion in a more recent cohort of patients treated between 2016 and 2020? METHODS: Between 2017 and 2021, we identified 674 patients 18 years and older through the ICD codes for secondary malignant neoplasm of bone and bone marrow and CPT codes for completed pathologic fractures or prophylactic treatment of an impending fracture. We excluded 40% (268 of 674) of patients, including 18% (118) who did not receive surgery; 11% (72) who had metastases in places other than the long bones of the extremities; 3% (23) who received treatment other than intramedullary nailing, endoprosthetic reconstruction, or dynamic hip screw; 3% (23) who underwent revision surgery, 3% (17) in whom there was no tumor, and 2% (15) who were lost to follow-up within 1 year. Temporal validation was performed using data on 406 patients treated surgically for bony metastatic disease of the extremities from 2016 to 2020 at the same two institutions where the MLA was developed. Variables used to predict survival in the SORG algorithm included perioperative laboratory values, tumor characteristics, and general demographics. To assess the models' discrimination, we computed the c-statistic, commonly referred to as the area under the receiver operating characteristic (AUC) curve for binary classification. This value ranged from 0.5 (representing chance-level performance) to 1.0 (indicating excellent discrimination) Generally, an AUC of 0.75 is considered high enough for use in clinical practice. To evaluate the agreement between predicted and observed outcomes, a calibration plot was used, and the calibration slope and intercept were calculated. Perfect calibration would result in a slope of 1 and intercept of 0. For overall performance, the Brier score and null-model Brier score were determined. The Brier score can range from 0 (representing perfect prediction) to 1 (indicating the poorest prediction). Proper interpretation of the Brier score necessitates a comparison with the null-model Brier score, which represents the score for an algorithm that predicts a probability equal to the population prevalence of the outcome for each patient. Finally, a decision curve analysis was conducted to compare the potential net benefit of the algorithm with other decision-support methods, such as treating all or none of the patients. Overall, 90-day and 1-year mortality were lower in the temporal validation cohort than in the development cohort (90 day: 23% versus 28%; p < 0.001, and 1 year: 51% versus 59%; p<0.001). RESULTS: Overall survival of the patients in the validation cohort improved from 28% mortality at the 90-day timepoint in the cohort on which the model was trained to 23%, and 59% mortality at the 1-year timepoint to 51%. The AUC was 0.78 (95% CI 0.72 to 0.82) for 90-day survival and 0.75 (95% CI 0.70 to 0.79) for 1-year survival, indicating the model could distinguish the two outcomes reasonably. For the 90-day model, the calibration slope was 0.71 (95% CI 0.53 to 0.89), and the intercept was -0.66 (95% CI -0.94 to -0.39), suggesting the predicted risks were overly extreme, and that in general, the risk of the observed outcome was overestimated. For the 1-year model, the calibration slope was 0.73 (95% CI 0.56 to 0.91) and the intercept was -0.67 (95% CI -0.90 to -0.43). With respect to overall performance, the model's Brier scores for the 90-day and 1-year models were 0.16 and 0.22. These scores were higher than the Brier scores of internal validation of the development study (0.13 and 0.14) models, indicating the models' performance has declined over time. CONCLUSION: The SORG MLA to predict survival after surgical treatment of extremity metastatic disease showed decreased performance on temporal validation. Moreover, in patients undergoing innovative immunotherapy, the possibility of mortality risk was overestimated in varying severity. Clinicians should be aware of this overestimation and discount the prediction of the SORG MLA according to their own experience with this patient population. Generally, these results show that temporal reassessment of these MLA-driven probability calculators is of paramount importance because the predictive performance may decline over time as treatment regimens evolve. The SORG-MLA is available as a freely accessible internet application at https://sorg-apps.shinyapps.io/extremitymetssurvival/ .Level of Evidence Level III, prognostic study.
Assuntos
Neoplasias Ósseas , Humanos , Prognóstico , Neoplasias Ósseas/terapia , Algoritmos , Extremidades , Aprendizado de Máquina , Estudos RetrospectivosRESUMO
BACKGROUND: The Skeletal Oncology Research Group machine-learning algorithm (SORG-MLA) was developed to predict the survival of patients with spinal metastasis. The algorithm was successfully tested in five international institutions using 1101 patients from different continents. The incorporation of 18 prognostic factors strengthens its predictive ability but limits its clinical utility because some prognostic factors might not be clinically available when a clinician wishes to make a prediction. QUESTIONS/PURPOSES: We performed this study to (1) evaluate the SORG-MLA's performance with data and (2) develop an internet-based application to impute the missing data. METHODS: A total of 2768 patients were included in this study. The data of 617 patients who were treated surgically were intentionally erased, and the data of the other 2151 patients who were treated with radiotherapy and medical treatment were used to impute the artificially missing data. Compared with those who were treated nonsurgically, patients undergoing surgery were younger (median 59 years [IQR 51 to 67 years] versus median 62 years [IQR 53 to 71 years]) and had a higher proportion of patients with at least three spinal metastatic levels (77% [474 of 617] versus 72% [1547 of 2151]), more neurologic deficit (normal American Spinal Injury Association [E] 68% [301 of 443] versus 79% [1227 of 1561]), higher BMI (23 kg/m2 [IQR 20 to 25 kg/m2] versus 22 kg/m2 [IQR 20 to 25 kg/m2]), higher platelet count (240 × 103/µL [IQR 173 to 327 × 103/µL] versus 227 × 103/µL [IQR 165 to 302 × 103/µL], higher lymphocyte count (15 × 103/µL [IQR 9 to 21× 103/µL] versus 14 × 103/µL [IQR 8 to 21 × 103/µL]), lower serum creatinine level (0.7 mg/dL [IQR 0.6 to 0.9 mg/dL] versus 0.8 mg/dL [IQR 0.6 to 1.0 mg/dL]), less previous systemic therapy (19% [115 of 617] versus 24% [526 of 2151]), fewer Charlson comorbidities other than cancer (28% [170 of 617] versus 36% [770 of 2151]), and longer median survival. The two patient groups did not differ in other regards. These findings aligned with our institutional philosophy of selecting patients for surgical intervention based on their level of favorable prognostic factors such as BMI or lymphocyte counts and lower levels of unfavorable prognostic factors such as white blood cell counts or serum creatinine level, as well as the degree of spinal instability and severity of neurologic deficits. This approach aims to identify patients with better survival outcomes and prioritize their surgical intervention accordingly. Seven factors (serum albumin and alkaline phosphatase levels, international normalized ratio, lymphocyte and neutrophil counts, and the presence of visceral or brain metastases) were considered possible missing items based on five previous validation studies and clinical experience. Artificially missing data were imputed using the missForest imputation technique, which was previously applied and successfully tested to fit the SORG-MLA in validation studies. Discrimination, calibration, overall performance, and decision curve analysis were applied to evaluate the SORG-MLA's performance. The discrimination ability was measured with an area under the receiver operating characteristic curve. It ranges from 0.5 to 1.0, with 0.5 indicating the worst discrimination and 1.0 indicating perfect discrimination. An area under the curve of 0.7 is considered clinically acceptable discrimination. Calibration refers to the agreement between the predicted outcomes and actual outcomes. An ideal calibration model will yield predicted survival rates that are congruent with the observed survival rates. The Brier score measures the squared difference between the actual outcome and predicted probability, which captures calibration and discrimination ability simultaneously. A Brier score of 0 indicates perfect prediction, whereas a Brier score of 1 indicates the poorest prediction. A decision curve analysis was performed for the 6-week, 90-day, and 1-year prediction models to evaluate their net benefit across different threshold probabilities. Using the results from our analysis, we developed an internet-based application that facilitates real-time data imputation for clinical decision-making at the point of care. This tool allows healthcare professionals to efficiently and effectively address missing data, ensuring that patient care remains optimal at all times. RESULTS: Generally, the SORG-MLA demonstrated good discriminatory ability, with areas under the curve greater than 0.7 in most cases, and good overall performance, with up to 25% improvement in Brier scores in the presence of one to three missing items. The only exceptions were albumin level and lymphocyte count, because the SORG-MLA's performance was reduced when these two items were missing, indicating that the SORG-MLA might be unreliable without these values. The model tended to underestimate the patient survival rate. As the number of missing items increased, the model's discriminatory ability was progressively impaired, and a marked underestimation of patient survival rates was observed. Specifically, when three items were missing, the number of actual survivors was up to 1.3 times greater than the number of expected survivors, while only 10% discrepancy was observed when only one item was missing. When either two or three items were omitted, the decision curves exhibited substantial overlap, indicating a lack of consistent disparities in performance. This finding suggests that the SORG-MLA consistently generates accurate predictions, regardless of the two or three items that are omitted. We developed an internet application (https://sorg-spine-mets-missing-data-imputation.azurewebsites.net/) that allows the use of SORG-MLA with up to three missing items. CONCLUSION: The SORG-MLA generally performed well in the presence of one to three missing items, except for serum albumin level and lymphocyte count (which are essential for adequate predictions, even using our modified version of the SORG-MLA). We recommend that future studies should develop prediction models that allow for their use when there are missing data, or provide a means to impute those missing data, because some data are not available at the time a clinical decision must be made. CLINICAL RELEVANCE: The results suggested the algorithm could be helpful when a radiologic evaluation owing to a lengthy waiting period cannot be performed in time, especially in situations when an early operation could be beneficial. It could help orthopaedic surgeons to decide whether to intervene palliatively or extensively, even when the surgical indication is clear.
RESUMO
INTRODUCTION: Complications after total hip arthroplasty (THA) may result in readmission or reoperation and impose a significant cost on the healthcare system. Understanding which patients are at-risk for complications can potentially allow for targeted interventions to decrease complication rates through pursuing preoperative health optimization. The purpose of the current was to develop and internally validate machine learning (ML) algorithms capable of performing patient-specific predictions of all-cause complications within two years of primary THA. METHODS: This was a retrospective case-control study of clinical registry data from 616 primary THA patients from one large academic and two community hospitals. The primary outcome was all-cause complications at a minimum of 2-years after primary THA. Recursive feature elimination was applied to identify preoperative variables with the greatest predictive value. Five ML algorithms were developed on the training set using tenfold cross-validation and internally validated on the independent testing set of patients. Algorithms were assessed by discrimination, calibration, Brier score, and decision curve analysis to quantify performance. RESULTS: The observed complication rate was 16.6%. The stochastic gradient boosting model achieved the best performance with an AUC = 0.88, calibration intercept = 0.1, calibration slope = 1.22, and Brier score = 0.09. The most important factors for predicting complications were age, drug allergies, prior hip surgery, smoking, and opioid use. Individual patient-level explanations were provided for the algorithm predictions and incorporated into an open access digital application: https://sorg-apps.shinyapps.io/tha_complication/ CONCLUSIONS: The stochastic boosting gradient algorithm demonstrated good discriminatory capacity for identifying patients at high-risk of experiencing a postoperative complication and proof-of-concept for creating office-based applications from ML that can perform real-time prediction. However, this clinical utility of the current algorithm is unknown and definitions of complications broad. Further investigation on larger data sets and rigorous external validation is necessary prior to the assessment of clinical utility with respect to risk-stratification of patients undergoing primary THA. LEVEL OF EVIDENCE: III, therapeutic study.
Assuntos
Artroplastia de Quadril , Humanos , Estudos Retrospectivos , Estudos de Casos e Controles , Artroplastia de Quadril/efeitos adversos , Algoritmos , Aprendizado de MáquinaRESUMO
INTRODUCTION: Arthroplasty care delivery is facing a growing supply-demand mismatch. To meet future demand for joint arthroplasty, systems will need to identify potential surgical candidates prior to evaluation by orthopaedic surgeons. MATERIALS AND METHODS: Retrospective review was conducted at two academic medical centers and three community hospitals from March 1 to July 31, 2020 to identify new patient telemedicine encounters (without prior in-person evaluation) for consideration of hip or knee arthroplasty. The primary outcome was surgical indication for joint replacement. Five machine learning algorithms were developed to predict likelihood of surgical indication and assessed by discrimination, calibration, overall performance, and decision curve analysis. RESULTS: Overall, 158 patients underwent new patient telemedicine evaluation for consideration of THA, TKA, or UKA and 65.2% (n = 103) were indicated for operative intervention prior to in-person evaluation. The median age was 65 (interquartile range 59-70) and 60.8% were women. Variables found to be associated with operative intervention were radiographic degree of arthritis, prior trial of intra-articular injection, trial of physical therapy, opioid use, and tobacco use. In the independent testing set (n = 46) not used for algorithm development, the stochastic gradient boosting algorithm achieved the best performance with AUC 0.83, calibration intercept 0.13, calibration slope 1.03, Brier score 0.15 relative to a null model Brier score of 0.23, and higher net benefit than the default alternatives on decision curve analysis. CONCLUSION: We developed a machine learning algorithm to identify potential surgical candidates for joint arthroplasty in the setting of osteoarthritis without an in-person evaluation or physical examination. If externally validated, this algorithm could be deployed by various stakeholders, including patients, providers, and health systems, to direct appropriate next steps in patients with osteoarthritis and improve efficiency in identifying surgical candidates. LEVEL OF EVIDENCE: III.
Assuntos
Artroplastia de Quadril , Artroplastia do Joelho , Osteoartrite , Humanos , Feminino , Idoso , Masculino , Algoritmos , Aprendizado de Máquina , Estudos RetrospectivosRESUMO
INTRODUCTION: Numerous prognostication models have been developed to estimate survival in patients with extremity metastatic bone disease, but few include albumin despite albumin's role in malnutrition and inflammation. The purpose of this study was to examine two independent datasets to determine the value for albumin in prognosticating survival in this population. MATERIALS AND METHODS: Extremity metastatic bone disease patients undergoing surgical management were identified from two independent populations. Population 1: Retrospective chart review at two tertiary care centers. Population 2: A large, national, North American multicenter surgical registry with 30-day follow-up. Bivariate and multivariate analyses were used to examine albumin's value for prognostication at 1-, 3-, and 12-month after surgery. RESULTS: In Population 1, 1,090 patients were identified with 1-, 3-, and 12-month mortality rates of 95 (8.8%), 305 (28.9%), and 639 (62.0%), respectively. In Population 2, 1,675 patients were identified with one-month postoperative mortality rates of 148 (8.8%). In both populations, hypoalbuminemia was an independent prognostic factor for mortality at 30 days. In the institutional set, hypoalbuminemia was additionally associated with 3- and 12-month mortality. CONCLUSIONS: Hypoalbuminemia is a marker for mortality in extremity metastatic bone disease. Further consideration of this marker could improve existing prognostication models in this population. LEVEL OF EVIDENCE: III.
Assuntos
Doenças Ósseas , Hipoalbuminemia , Albuminas , Biomarcadores , Extremidades/cirurgia , Humanos , Complicações Pós-Operatórias/epidemiologia , Estudos Retrospectivos , Fatores de Risco , Resultado do TratamentoRESUMO
BACKGROUND: The prediction of survival is valuable to optimize treatment of metastatic long-bone disease. The Skeletal Oncology Research Group (SORG) machine-learning (ML) algorithm has been previously developed and internally validated. The purpose of this study was to determine if the SORG ML algorithm accurately predicts 90-day and 1-year survival in an external metastatic long-bone disease patient cohort. METHODS: A retrospective review of 264 patients who underwent surgery for long-bone metastases between 2003 and 2019 was performed. Variables used in the stochastic gradient boosting SORG algorithm were age, sex, primary tumor type, visceral/brain metastases, systemic therapy, and 10 preoperative laboratory values. Model performance was calculated by discrimination, calibration, and overall performance. RESULTS: The SORG ML algorithms retained good discriminative ability (area under the cure [AUC]: 0.83; 95% confidence interval [CI]: 0.76-0.88 for 90-day mortality and AUC: 0.84; 95% CI: 0.79-0.88 for 1-year mortality), calibration, overall performance, and decision curve analysis. CONCLUSION: The previously developed ML algorithms demonstrated good performance in the current study, thereby providing external validation. The models were incorporated into an accessible application (https://sorg-apps.shinyapps.io/extremitymetssurvival/) that may be freely utilized by clinicians in helping predict survival for individual patients and assist in informative decision-making discussion before operative management of long bone metastatic lesions.
Assuntos
Neoplasias Ósseas/mortalidade , Neoplasias Ósseas/secundário , Aprendizado de Máquina , Idoso , Algoritmos , Neoplasias Ósseas/cirurgia , Extremidades , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Estudos RetrospectivosRESUMO
BACKGROUND: Postoperative delirium in patients aged 60 years or older with hip fractures adversely affects clinical and functional outcomes. The economic cost of delirium is estimated to be as high as USD 25,000 per patient, with a total budgetary impact between USD 6.6 to USD 82.4 billion annually in the United States alone. Forty percent of delirium episodes are preventable, and accurate risk stratification can decrease the incidence and improve clinical outcomes in patients. A previously developed clinical prediction model (the SORG Orthopaedic Research Group hip fracture delirium machine-learning algorithm) is highly accurate on internal validation (in 28,207 patients with hip fractures aged 60 years or older in a US cohort) in identifying at-risk patients, and it can facilitate the best use of preventive interventions; however, it has not been tested in an independent population. For an algorithm to be useful in real life, it must be valid externally, meaning that it must perform well in a patient cohort different from the cohort used to "train" it. With many promising machine-learning prediction models and many promising delirium models, only few have also been externally validated, and even fewer are international validation studies. QUESTION/PURPOSE: Does the SORG hip fracture delirium algorithm, initially trained on a database from the United States, perform well on external validation in patients aged 60 years or older in Australia and New Zealand? METHODS: We previously developed a model in 2021 for assessing risk of delirium in hip fracture patients using records of 28,207 patients obtained from the American College of Surgeons National Surgical Quality Improvement Program. Variables included in the original model included age, American Society of Anesthesiologists (ASA) class, functional status (independent or partially or totally dependent for any activities of daily living), preoperative dementia, preoperative delirium, and preoperative need for a mobility aid. To assess whether this model could be applied elsewhere, we used records from an international hip fracture registry. Between June 2017 and December 2018, 6672 patients older than 60 years of age in Australia and New Zealand were treated surgically for a femoral neck, intertrochanteric hip, or subtrochanteric hip fracture and entered into the Australian & New Zealand Hip Fracture Registry. Patients were excluded if they had a pathological hip fracture or septic shock. Of all patients, 6% (402 of 6672) did not meet the inclusion criteria, leaving 94% (6270 of 6672) of patients available for inclusion in this retrospective analysis. Seventy-one percent (4249 of 5986) of patients were aged 80 years or older, after accounting for 5% (284 of 6270) of missing values; 68% (4292 of 6266) were female, after accounting for 0.06% (4 of 6270) of missing values, and 83% (4690 of 5661) of patients were classified as ASA III/IV, after accounting for 10% (609 of 6270) of missing values. Missing data were imputed using the missForest methodology. In total, 39% (2467 of 6270) of patients developed postoperative delirium. The performance of the SORG hip fracture delirium algorithm on the validation cohort was assessed by discrimination, calibration, Brier score, and a decision curve analysis. Discrimination, known as the area under the receiver operating characteristic curves (c-statistic), measures the model's ability to distinguish patients who achieved the outcomes from those who did not and ranges from 0.5 to 1.0, with 1.0 indicating the highest discrimination score and 0.50 the lowest. Calibration plots the predicted versus the observed probabilities, a perfect plot has an intercept of 0 and a slope of 1. The Brier score calculates a composite of discrimination and calibration, with 0 indicating perfect prediction and 1 the poorest. RESULTS: The SORG hip fracture algorithm, when applied to an external patient cohort, distinguished between patients at low risk and patients at moderate to high risk of developing postoperative delirium. The SORG hip fracture algorithm performed with a c-statistic of 0.74 (95% confidence interval 0.73 to 0.76). The calibration plot showed high accuracy in the lower predicted probabilities (intercept -0.28, slope 0.52) and a Brier score of 0.22 (the null model Brier score was 0.24). The decision curve analysis showed that the model can be beneficial compared with no model or compared with characterizing all patients as at risk for developing delirium. CONCLUSION: Algorithms developed with machine learning are a potential tool for refining treatment of at-risk patients. If high-risk patients can be reliably identified, resources can be appropriately directed toward their care. Although the current iteration of SORG should not be relied on for patient care, it suggests potential utility in assessing risk. Further assessment in different populations, made easier by international collaborations and standardization of registries, would be useful in the development of universally valid prediction models. The model can be freely accessed at: https://sorg-apps.shinyapps.io/hipfxdelirium/ . LEVEL OF EVIDENCE: Level III, therapeutic study.
Assuntos
Delírio , Fraturas do Quadril , Ortopedia , Atividades Cotidianas , Algoritmos , Austrália , Delírio/diagnóstico , Delírio/epidemiologia , Delírio/etiologia , Feminino , Fraturas do Quadril/cirurgia , Humanos , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Prognóstico , Estudos RetrospectivosRESUMO
BACKGROUND: Patient-reported outcome measures (PROMs), including the Patient-reported Outcomes Measurement Information System (PROMIS), are increasingly used to measure healthcare value. The minimum clinically important difference (MCID) is a metric that helps clinicians determine whether a statistically detectable improvement in a PROM after surgical care is likely to be large enough to be important to a patient or to justify an intervention that carries risk and cost. There are two major categories of MCID calculation methods, anchor-based and distribution-based. This variability, coupled with heterogeneous surgical cohorts used for existing MCID values, limits their application to clinical care. QUESTIONS/PURPOSES: In our study, we sought (1) to determine MCID thresholds and attainment percentages for PROMIS after common orthopaedic procedures using distribution-based methods, (2) to use anchor-based MCID values from published studies as a comparison, and (3) to compare MCID attainment percentages using PROMIS scores to other validated outcomes tools such as the Hip Disability and Osteoarthritis Outcome Score (HOOS) and Knee Disability and Osteoarthritis Outcome Score (KOOS). METHODS: This was a retrospective study at two academic medical centers and three community hospitals. The inclusion criteria for this study were patients who were age 18 years or older and who underwent elective THA for osteoarthritis, TKA for osteoarthritis, one-level posterior lumbar fusion for lumbar spinal stenosis or spondylolisthesis, anatomic total shoulder arthroplasty or reverse total shoulder arthroplasty for glenohumeral arthritis or rotator cuff arthropathy, arthroscopic anterior cruciate ligament reconstruction, arthroscopic partial meniscectomy, or arthroscopic rotator cuff repair. This yielded 14,003 patients. Patients undergoing revision operations or surgery for nondegenerative pathologies and patients without preoperative PROMs assessments were excluded, leaving 9925 patients who completed preoperative PROMIS assessments and 9478 who completed other preoperative validated outcomes tools (HOOS, KOOS, numerical rating scale for leg pain, numerical rating scale for back pain, and QuickDASH). Approximately 66% (6529 of 9925) of patients had postoperative PROMIS scores (Physical Function, Mental Health, Pain Intensity, Pain Interference, and Upper Extremity) and were included for analysis. PROMIS scores are population normalized with a mean score of 50 ± 10, with most scores falling between 30 to 70. Approximately 74% (7007 of 9478) of patients had postoperative historical assessment scores and were included for analysis. The proportion who reached the MCID was calculated for each procedure cohort at 6 months of follow-up using distribution-based MCID methods, which included a fraction of the SD (1/2 or 1/3 SD) and minimum detectable change (MDC) using statistical significance (such as the MDC 90 from p < 0.1). Previously published anchor-based MCID thresholds from similar procedure cohorts and analogous PROMs were used to calculate the proportion reaching MCID. RESULTS: Within a given distribution-based method, MCID thresholds for PROMIS assessments were similar across multiple procedures. The MCID threshold ranged between 3.4 and 4.5 points across all procedures using the 1/2 SD method. Except for meniscectomy (3.5 points), the anchor-based PROMIS MCID thresholds (range 4.5 to 8.1 points) were higher than the SD distribution-based MCID values (2.3 to 4.5 points). The difference in MCID thresholds based on the calculation method led to a similar trend in MCID attainment. Using THA as an example, MCID attainment using PROMIS was achieved by 76% of patients using an anchor-based threshold of 7.9 points. However, 82% of THA patients attained MCID using the MDC 95 method (6.1 points), and 88% reached MCID using the 1/2 SD method (3.9 points). Using the HOOS metric (scaled from 0 to 100), 86% of THA patients reached the anchor-based MCID threshold (17.5 points). However, 91% of THA patients attained the MCID using the MDC 90 method (12.5 points), and 93% reached MCID using the 1/2 SD method (8.4 points). In general, the proportion of patients reaching MCID was lower for PROMIS than for other validated outcomes tools; for example, with the 1/2 SD method, 72% of patients who underwent arthroscopic partial meniscectomy reached the MCID on PROMIS Physical Function compared with 86% on KOOS. CONCLUSION: MCID calculations can provide clinical correlation for PROM scores interpretation. The PROMIS form is increasingly used because of its generalizability across diagnoses. However, we found lower proportions of MCID attainment using PROMIS scores compared with historical PROMs. By using historical proportions of attainment on common orthopaedic procedures and a spectrum of MCID calculation techniques, the PROMIS MCID benchmarks are realizable for common orthopaedic procedures. For clinical practices that routinely collect PROMIS scores in the clinical setting, these results can be used by individual surgeons to evaluate personal practice trends and by healthcare systems to quantify whether clinical care initiatives result in meaningful differences. Furthermore, these MCID thresholds can be used by researchers conducting retrospective outcomes research with PROMIS. LEVEL OF EVIDENCE: Level III, therapeutic study.
Assuntos
Osteoartrite , Medidas de Resultados Relatados pelo Paciente , Adolescente , Artroscopia , Dor nas Costas , Humanos , Diferença Mínima Clinicamente Importante , Estudos Retrospectivos , Resultado do TratamentoRESUMO
BACKGROUND: Incidental durotomy is an intraoperative complication in spine surgery that can lead to postoperative complications, increased length of stay, and higher healthcare costs. Natural language processing (NLP) is an artificial intelligence method that assists in understanding free-text notes that may be useful in the automated surveillance of adverse events in orthopaedic surgery. A previously developed NLP algorithm is highly accurate in the detection of incidental durotomy on internal validation and external validation in an independent cohort from the same country. External validation in a cohort with linguistic differences is required to assess the transportability of the developed algorithm, referred to geographical validation. Ideally, the performance of a prediction model, the NLP algorithm, is constant across geographic regions to ensure reproducibility and model validity. QUESTION/PURPOSE: Can we geographically validate an NLP algorithm for the automated detection of incidental durotomy across three independent cohorts from two continents? METHODS: Patients 18 years or older undergoing a primary procedure of (thoraco)lumbar spine surgery were included. In Massachusetts, between January 2000 and June 2018, 1000 patients were included from two academic and three community medical centers. In Maryland, between July 2016 and November 2018, 1279 patients were included from one academic center, and in Australia, between January 2010 and December 2019, 944 patients were included from one academic center. The authors retrospectively studied the free-text operative notes of included patients for the primary outcome that was defined as intraoperative durotomy. Incidental durotomy occurred in 9% (93 of 1000), 8% (108 of 1279), and 6% (58 of 944) of the patients, respectively, in the Massachusetts, Maryland, and Australia cohorts. No missing reports were observed. Three datasets (Massachusetts, Australian, and combined Massachusetts and Australian) were divided into training and holdout test sets in an 80:20 ratio. An extreme gradient boosting (an efficient and flexible tree-based algorithm) NLP algorithm was individually trained on each training set, and the performance of the three NLP algorithms (respectively American, Australian, and combined) was assessed by discrimination via area under the receiver operating characteristic curves (AUC-ROC; this measures the model's ability to distinguish patients who obtained the outcomes from those who did not), calibration metrics (which plot the predicted and the observed probabilities) and Brier score (a composite of discrimination and calibration). In addition, the sensitivity (true positives, recall), specificity (true negatives), positive predictive value (also known as precision), negative predictive value, F1-score (composite of precision and recall), positive likelihood ratio, and negative likelihood ratio were calculated. RESULTS: The combined NLP algorithm (the combined Massachusetts and Australian data) achieved excellent performance on independent testing data from Australia (AUC-ROC 0.97 [95% confidence interval 0.87 to 0.99]), Massachusetts (AUC-ROC 0.99 [95% CI 0.80 to 0.99]) and Maryland (AUC-ROC 0.95 [95% CI 0.93 to 0.97]). The NLP developed based on the Massachusetts cohort had excellent performance in the Maryland cohort (AUC-ROC 0.97 [95% CI 0.95 to 0.99]) but worse performance in the Australian cohort (AUC-ROC 0.74 [95% CI 0.70 to 0.77]). CONCLUSION: We demonstrated the clinical utility and reproducibility of an NLP algorithm with combined datasets retaining excellent performance in individual countries relative to algorithms developed in the same country alone for detection of incidental durotomy. Further multi-institutional, international collaborations can facilitate the creation of universal NLP algorithms that improve the quality and safety of orthopaedic surgery globally. The combined NLP algorithm has been incorporated into a freely accessible web application that can be found at https://sorg-apps.shinyapps.io/nlp_incidental_durotomy/ . Clinicians and researchers can use the tool to help incorporate the model in evaluating spine registries or quality and safety departments to automate detection of incidental durotomy and optimize prevention efforts. LEVEL OF EVIDENCE: Level III, diagnostic study.
Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Algoritmos , Austrália , Humanos , Reprodutibilidade dos Testes , Estudos RetrospectivosRESUMO
BACKGROUND: The Skeletal Oncology Research Group machine-learning algorithms (SORG-MLAs) estimate 90-day and 1-year survival in patients with long-bone metastases undergoing surgical treatment and have demonstrated good discriminatory ability on internal validation. However, the performance of a prediction model could potentially vary by race or region, and the SORG-MLA must be externally validated in an Asian cohort. Furthermore, the authors of the original developmental study did not consider the Eastern Cooperative Oncology Group (ECOG) performance status, a survival prognosticator repeatedly validated in other studies, in their algorithms because of missing data. QUESTIONS/PURPOSES: (1) Is the SORG-MLA generalizable to Taiwanese patients for predicting 90-day and 1-year mortality? (2) Is the ECOG score an independent factor associated with 90-day and 1-year mortality while controlling for SORG-MLA predictions? METHODS: All 356 patients who underwent surgery for long-bone metastases between 2014 and 2019 at one tertiary care center in Taiwan were included. Ninety-eight percent (349 of 356) of patients were of Han Chinese descent. The median (range) patient age was 61 years (25 to 95), 52% (184 of 356) were women, and the median BMI was 23 kg/m2 (13 to 39 kg/m2). The most common primary tumors were lung cancer (33% [116 of 356]) and breast cancer (16% [58 of 356]). Fifty-five percent (195 of 356) of patients presented with a complete pathologic fracture. Intramedullary nailing was the most commonly performed type of surgery (59% [210 of 356]), followed by plate screw fixation (23% [81 of 356]) and endoprosthetic reconstruction (18% [65 of 356]). Six patients were lost to follow-up within 90 days; 30 were lost to follow-up within 1 year. Eighty-five percent (301 of 356) of patients were followed until death or for at least 2 years. Survival was 82% (287 of 350) at 90 days and 49% (159 of 326) at 1 year. The model's performance metrics included discrimination (concordance index [c-index]), calibration (intercept and slope), and Brier score. In general, a c-index of 0.5 indicates random guess and a c-index of 0.8 denotes excellent discrimination. Calibration refers to the agreement between the predicted outcomes and the actual outcomes, with a perfect calibration having an intercept of 0 and a slope of 1. The Brier score of a prediction model must be compared with and ideally should be smaller than the score of the null model. A decision curve analysis was then performed for the 90-day and 1-year prediction models to evaluate their net benefit across a range of different threshold probabilities. A multivariate logistic regression analysis was used to evaluate whether the ECOG score was an independent prognosticator while controlling for the SORG-MLA's predictions. We did not perform retraining/recalibration because we were not trying to update the SORG-MLA algorithm in this study. RESULTS: The SORG-MLA had good discriminatory ability at both timepoints, with a c-index of 0.80 (95% confidence interval 0.74 to 0.86) for 90-day survival prediction and a c-index of 0.84 (95% CI 0.80 to 0.89) for 1-year survival prediction. However, the calibration analysis showed that the SORG-MLAs tended to underestimate Taiwanese patients' survival (90-day survival prediction: calibration intercept 0.78 [95% CI 0.46 to 1.10], calibration slope 0.74 [95% CI 0.53 to 0.96]; 1-year survival prediction: calibration intercept 0.75 [95% CI 0.49 to 1.00], calibration slope 1.22 [95% CI 0.95 to 1.49]). The Brier score of the 90-day and 1-year SORG-MLA prediction models was lower than their respective null model (0.12 versus 0.16 for 90-day prediction; 0.16 versus 0.25 for 1-year prediction), indicating good overall performance of SORG-MLAs at these two timepoints. Decision curve analysis showed SORG-MLAs provided net benefits when threshold probabilities ranged from 0.40 to 0.95 for 90-day survival prediction and from 0.15 to 1.0 for 1-year prediction. The ECOG score was an independent factor associated with 90-day mortality (odds ratio 1.94 [95% CI 1.01 to 3.73]) but not 1-year mortality (OR 1.07 [95% CI 0.53 to 2.17]) after controlling for SORG-MLA predictions for 90-day and 1-year survival, respectively. CONCLUSION: SORG-MLAs retained good discriminatory ability in Taiwanese patients with long-bone metastases, although their actual survival time was slightly underestimated. More international validation and incremental value studies that address factors such as the ECOG score are warranted to refine the algorithms, which can be freely accessed online at https://sorg-apps.shinyapps.io/extremitymetssurvival/. LEVEL OF EVIDENCE: Level III, therapeutic study.
Assuntos
Neoplasias Ósseas/mortalidade , Neoplasias Ósseas/secundário , Aprendizado de Máquina , Adulto , Idoso , Idoso de 80 Anos ou mais , Neoplasias Ósseas/cirurgia , Extremidades/patologia , Extremidades/cirurgia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Período Pós-Operatório , Valor Preditivo dos Testes , Prognóstico , TaiwanRESUMO
Artificial Intelligence is gaining traction in medicine for its ease of use and advancements in technology. This study evaluates the current literature on the use of artificial intelligence in adult spinal deformity.
Assuntos
Inteligência Artificial , Aprendizado de Máquina , TecnologiaRESUMO
BACKGROUND: Predicting survival in myxoid liposarcoma (MLS) patients is very challenging given its propensity to metastasize and the controversial role of adjuvant therapy. The purpose of this study was to develop a machine-learning algorithm for the prediction of survival at five years for patients with MLS and externally validate it using our institutional cohort. METHODS: Two databases, the surveillance, epidemiology, and end results program (SEER) database and an institutional database, were used in this study. Five machine learning models were created based on the SEER database and performance was rated using the TRIPOD criteria. The model that performed best on the SEER data was again tested on our institutional database. RESULTS: The net-elastic penalized logistic regression model was the best according to our performance indicators. This model had an area under the curve (AUC) of 0.85 when compared to the SEER testing data and an AUC of 0.76 when tested against institutional database. An application to use this calculator is available at https://sorg-apps.shinyapps.io/myxoid_liposarcoma/. CONCLUSION: MLS is a soft-tissue sarcoma with adjunct treatment options that are, in part, decided by prognostic survival. We developed the first machine-learning predictive algorithm specifically for MLS using the SEER registry that retained performance during external validation with institutional data.
Assuntos
Algoritmos , Lipossarcoma Mixoide/mortalidade , Aprendizado de Máquina , Adulto , Estudos de Coortes , Feminino , Humanos , Modelos Logísticos , Masculino , Pessoa de Meia-Idade , Sistema de Registros , Reprodutibilidade dos Testes , Programa de SEER , Análise de Sobrevida , Estados Unidos/epidemiologiaRESUMO
PURPOSE: To (1) determine the diagnostic efficacy of artificial intelligence (AI) methods for detecting anterior cruciate ligament (ACL) and meniscus tears and to (2) compare the efficacy to human clinical experts. METHODS: PubMed, OVID/Medline, and Cochrane libraries were queried in November 2019 for research articles pertaining to AI use for detection of ACL and meniscus tears. Information regarding AI model, prediction accuracy/area under the curve (AUC), sample sizes of testing/training sets, and imaging modalities were recorded. RESULTS: A total of 11 AI studies were identified: 5 investigated ACL tears, 5 investigated meniscal tears, and 1 investigated both. The AUC of AI models for detecting ACL tears ranged from 0.895 to 0.980, and the prediction accuracy ranged from 86.7% to 100%. Of these studies, 3 compared AI models to clinical experts. Two found no significant differences in diagnostic capability, whereas one found that radiologists had a significantly greater sensitivity for detecting ACL tears (P = .002) and statistically similar specificity and accuracy. Of the 5 studies investigating the meniscus, the AUC for AI models ranged from 0.847 to 0.910 and prediction accuracy ranged from 75.0% to 90.0%. Of these studies, 2 compared AI models with clinical experts. One found no significant differences in diagnostic accuracy, whereas one found that the AI model had a significantly lower specificity (P = .003) and accuracy (P = .015) than radiologists. Two studies reported that the addition of AI models significantly increased the diagnostic performance of clinicians compared to their efforts without these models. CONCLUSIONS: AI prediction capabilities were excellent and may enhance the diagnosis of ACL and meniscal pathology; however, AI did not outperform clinical experts. CLINICAL RELEVANCE: AI models promise to improve diagnosing certain pathologies as well as or better than human experts, are excellent for detecting ACL and meniscus tears, and may enhance the diagnostic capabilities of human experts; however, when compared with these experts, they may not offer any significant advantage.
Assuntos
Lesões do Ligamento Cruzado Anterior/diagnóstico , Ligamento Cruzado Anterior/patologia , Inteligência Artificial , Lesões do Menisco Tibial/diagnóstico , Humanos , Modelos TeóricosRESUMO
Background and purpose - Advancements in software and hardware have enabled the rise of clinical prediction models based on machine learning (ML) in orthopedic surgery. Given their growing popularity and their likely implementation in clinical practice we evaluated which outcomes these new models have focused on and what methodologies are being employed.Material and methods - We performed a systematic search in PubMed, Embase, and Cochrane Library for studies published up to June 18, 2020. Studies reporting on non-ML prediction models or non-orthopedic outcomes were excluded. After screening 7,138 studies, 59 studies reporting on 77 prediction models were included. We extracted data regarding outcome, study design, and reported performance metrics.Results - Of the 77 identified ML prediction models the most commonly reported outcome domain was medical management (17/77). Spinal surgery was the most commonly involved orthopedic subspecialty (28/77). The most frequently employed algorithm was neural networks (42/77). Median size of datasets was 5,507 (IQR 635-26,364). The median area under the curve (AUC) was 0.80 (IQR 0.73-0.86). Calibration was reported for 26 of the models and 14 provided decision-curve analysis.Interpretation - ML prediction models have been developed for a wide variety of topics in orthopedics. Topics regarding medical management were the most commonly studied. Heterogeneity between studies is based on study size, algorithm, and time-point of outcome. Calibration and decision-curve analysis were generally poorly reported.
Assuntos
Tomada de Decisão Clínica , Aprendizado de Máquina , Redes Neurais de Computação , Procedimentos Ortopédicos , Valor Preditivo dos Testes , HumanosRESUMO
Background and purpose - External validation of machine learning (ML) prediction models is an essential step before clinical application. We assessed the proportion, performance, and transparent reporting of externally validated ML prediction models in orthopedic surgery, using the Transparent Reporting for Individual Prognosis or Diagnosis (TRIPOD) guidelines.Material and methods - We performed a systematic search using synonyms for every orthopedic specialty, ML, and external validation. The proportion was determined by using 59 ML prediction models with only internal validation in orthopedic surgical outcome published up until June 18, 2020, previously identified by our group. Model performance was evaluated using discrimination, calibration, and decision-curve analysis. The TRIPOD guidelines assessed transparent reporting.Results - We included 18 studies externally validating 10 different ML prediction models of the 59 available ML models after screening 4,682 studies. All external validations identified in this review retained good discrimination. Other key performance measures were provided in only 3 studies, rendering overall performance evaluation difficult. The overall median TRIPOD completeness was 61% (IQR 43-89), with 6 items being reported in less than 4/18 of the studies.Interpretation - Most current predictive ML models are not externally validated. The 18 available external validation studies were characterized by incomplete reporting of performance measures, limiting a transparent examination of model performance. Further prospective studies are needed to validate or refute the myriad of predictive ML models in orthopedics while adhering to existing guidelines. This ensures clinicians can take full advantage of validated and clinically implementable ML decision tools.
Assuntos
Técnicas de Apoio para a Decisão , Aprendizado de Máquina/normas , Modelos Estatísticos , Procedimentos Ortopédicos , Humanos , Resultado do Tratamento , Estudos de Validação como AssuntoRESUMO
BACKGROUND: The widespread use of electronic patient-generated health data has led to unprecedented opportunities for automated extraction of clinical features from free-text medical notes. However, processing this rich resource of data for clinical and research purposes, depends on labor-intensive and potentially error-prone manual review. The aim of this study was to develop a natural language processing (NLP) algorithm for binary classification (single metastasis versus two or more metastases) in bone scintigraphy reports of patients undergoing surgery for bone metastases. MATERIAL AND METHODS: Bone scintigraphy reports of patients undergoing surgery for bone metastases were labeled each by three independent reviewers using a binary classification (single metastasis versus two or more metastases) to establish a ground truth. A stratified 80:20 split was used to develop and test an extreme-gradient boosting supervised machine learning NLP algorithm. RESULTS: A total of 704 free-text bone scintigraphy reports from 704 patients were included in this study and 617 (88%) had multiple bone metastases. In the independent test set (n = 141) not used for model development, the NLP algorithm achieved an 0.97 AUC-ROC (95% confidence interval [CI], 0.92-0.99) for classification of multiple bone metastases and an 0.99 AUC-PRC (95% CI, 0.99-0.99). At a threshold of 0.90, NLP algorithm correctly identified multiple bone metastases in 117 of the 124 who had multiple bone metastases in the testing cohort (sensitivity 0.94) and yielded 3 false positives (specificity 0.82). At the same threshold, the NLP algorithm had a positive predictive value of 0.97 and F1-score of 0.96. CONCLUSIONS: NLP has the potential to automate clinical data extraction from free text radiology notes in orthopedics, thereby optimizing the speed, accuracy, and consistency of clinical chart review. Pending external validation, the NLP algorithm developed in this study may be implemented as a means to aid researchers in tackling large amounts of data.
Assuntos
Algoritmos , Processamento de Linguagem Natural , Estudos de Coortes , Humanos , Valor Preditivo dos Testes , CintilografiaRESUMO
BACKGROUND: A preoperative estimation of survival is critical for deciding on the operative management of metastatic bone disease of the extremities. Several tools have been developed for this purpose, but there is room for improvement. Machine learning is an increasingly popular and flexible method of prediction model building based on a data set. It raises some skepticism, however, because of the complex structure of these models. QUESTIONS/PURPOSES: The purposes of this study were (1) to develop machine learning algorithms for 90-day and 1-year survival in patients who received surgical treatment for a bone metastasis of the extremity, and (2) to use these algorithms to identify those clinical factors (demographic, treatment related, or surgical) that are most closely associated with survival after surgery in these patients. METHODS: All 1090 patients who underwent surgical treatment for a long-bone metastasis at two institutions between 1999 and 2017 were included in this retrospective study. The median age of the patients in the cohort was 63 years (interquartile range [IQR] 54 to 72 years), 56% of patients (610 of 1090) were female, and the median BMI was 27 kg/m (IQR 23 to 30 kg/m). The most affected location was the femur (70%), followed by the humerus (22%). The most common primary tumors were breast (24%) and lung (23%). Intramedullary nailing was the most commonly performed type of surgery (58%), followed by endoprosthetic reconstruction (22%), and plate screw fixation (14%). Missing data were imputed using the missForest methods. Features were selected by random forest algorithms, and five different models were developed on the training set (80% of the data): stochastic gradient boosting, random forest, support vector machine, neural network, and penalized logistic regression. These models were chosen as a result of their classification capability in binary datasets. Model performance was assessed on both the training set and the validation set (20% of the data) by discrimination, calibration, and overall performance. RESULTS: We found no differences among the five models for discrimination, with an area under the curve ranging from 0.86 to 0.87. All models were well calibrated, with intercepts ranging from -0.03 to 0.08 and slopes ranging from 1.03 to 1.12. Brier scores ranged from 0.13 to 0.14. The stochastic gradient boosting model was chosen to be deployed as freely available web-based application and explanations on both a global and an individual level were provided. For 90-day survival, the three most important factors associated with poorer survivorship were lower albumin level, higher neutrophil-to-lymphocyte ratio, and rapid growth primary tumor. For 1-year survival, the three most important factors associated with poorer survivorship were lower albumin level, rapid growth primary tumor, and lower hemoglobin level. CONCLUSIONS: Although the final models must be externally validated, the algorithms showed good performance on internal validation. The final models have been incorporated into a freely accessible web application that can be found at https://sorg-apps.shinyapps.io/extremitymetssurvival/. Pending external validation, clinicians may use this tool to predict survival for their individual patients to help in shared treatment decision making. LEVEL OF EVIDENCE: Level III, therapeutic study.
Assuntos
Neoplasias Ósseas/cirurgia , Técnicas de Apoio para a Decisão , Aprendizado de Máquina , Procedimentos Ortopédicos , Idoso , Neoplasias Ósseas/diagnóstico por imagem , Neoplasias Ósseas/mortalidade , Neoplasias Ósseas/secundário , Boston , Tomada de Decisão Clínica , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Procedimentos Ortopédicos/efeitos adversos , Procedimentos Ortopédicos/mortalidade , Seleção de Pacientes , Valor Preditivo dos Testes , Reprodutibilidade dos Testes , Estudos Retrospectivos , Medição de Risco , Fatores de Risco , Fatores de Tempo , Resultado do TratamentoRESUMO
BACKGROUND: Machine learning (ML) is a subdomain of artificial intelligence that enables computers to abstract patterns from data without explicit programming. A myriad of impactful ML applications already exists in orthopaedics ranging from predicting infections after surgery to diagnostic imaging. However, no systematic reviews that we know of have compared, in particular, the performance of ML models with that of clinicians in musculoskeletal imaging to provide an up-to-date summary regarding the extent of applying ML to imaging diagnoses. By doing so, this review delves into where current ML developments stand in aiding orthopaedists in assessing musculoskeletal images. QUESTIONS/PURPOSES: This systematic review aimed (1) to compare performance of ML models versus clinicians in detecting, differentiating, or classifying orthopaedic abnormalities on imaging by (A) accuracy, sensitivity, and specificity, (B) input features (for example, plain radiographs, MRI scans, ultrasound), (C) clinician specialties, and (2) to compare the performance of clinician-aided versus unaided ML models. METHODS: A systematic review was performed in PubMed, Embase, and the Cochrane Library for studies published up to October 1, 2019, using synonyms for machine learning and all potential orthopaedic specialties. We included all studies that compared ML models head-to-head against clinicians in the binary detection of abnormalities in musculoskeletal images. After screening 6531 studies, we ultimately included 12 studies. We conducted quality assessment using the Methodological Index for Non-randomized Studies (MINORS) checklist. All 12 studies were of comparable quality, and they all clearly included six of the eight critical appraisal items (study aim, input feature, ground truth, ML versus human comparison, performance metric, and ML model description). This justified summarizing the findings in a quantitative form by calculating the median absolute improvement of the ML models compared with clinicians for the following metrics of performance: accuracy, sensitivity, and specificity. RESULTS: ML models provided, in aggregate, only very slight improvements in diagnostic accuracy and sensitivity compared with clinicians working alone and were on par in specificity (3% (interquartile range [IQR] -2.0% to 7.5%), 0.06% (IQR -0.03 to 0.14), and 0.00 (IQR -0.048 to 0.048), respectively). Inputs used by the ML models were plain radiographs (n = 8), MRI scans (n = 3), and ultrasound examinations (n = 1). Overall, ML models outperformed clinicians more when interpreting plain radiographs than when interpreting MRIs (17 of 34 and 3 of 16 performance comparisons, respectively). Orthopaedists and radiologists performed similarly to ML models, while ML models mostly outperformed other clinicians (outperformance in 7 of 19, 7 of 23, and 6 of 10 performance comparisons, respectively). Two studies evaluated the performance of clinicians aided and unaided by ML models; both demonstrated considerable improvements in ML-aided clinician performance by reporting a 47% decrease of misinterpretation rate (95% confidence interval [CI] 37 to 54; p < 0.001) and a mean increase in specificity of 0.048 (95% CI 0.029 to 0.068; p < 0.001) in detecting abnormalities on musculoskeletal images. CONCLUSIONS: At present, ML models have comparable performance to clinicians in assessing musculoskeletal images. ML models may enhance the performance of clinicians as a technical supplement rather than as a replacement for clinical intelligence. Future ML-related studies should emphasize how ML models can complement clinicians, instead of determining the overall superiority of one versus the other. This can be accomplished by improving transparent reporting, diminishing bias, determining the feasibility of implantation in the clinical setting, and appropriately tempering conclusions. LEVEL OF EVIDENCE: Level III, diagnostic study.
Assuntos
Competência Clínica , Aprendizado de Máquina , Imageamento por Ressonância Magnética , Doenças Musculoesqueléticas/diagnóstico por imagem , Sistema Musculoesquelético/diagnóstico por imagem , Cirurgiões Ortopédicos , Interpretação de Imagem Radiográfica Assistida por Computador , Ultrassonografia , Diagnóstico Diferencial , Humanos , Reconhecimento Automatizado de Padrão , Valor Preditivo dos Testes , Reprodutibilidade dos Testes , Percepção VisualRESUMO
BACKGROUND: The Skeletal Oncology Research Group (SORG) machine learning algorithm for predicting survival in patients with chondrosarcoma was developed using data from the Surveillance, Epidemiology, and End Results (SEER) registry. This algorithm was externally validated on a dataset of patients from the United States in an earlier study, where it demonstrated generally good performance but overestimated 5-year survival. In addition, this algorithm has not yet been validated in patients outside the United States; doing so would be important because external validation is necessary as algorithm performance may be misleading when applied in different populations. QUESTIONS/PURPOSES: Does the SORG algorithm retain validity in patients who underwent surgery for primary chondrosarcoma outside the United States, specifically in Italy? METHODS: A total of 737 patients were treated for chondrosarcoma between January 2000 and October 2014 at the Italian tertiary care center which was used for international validation. We excluded patients whose first surgical procedure was performed elsewhere (n = 25), patients who underwent nonsurgical treatment (n = 27), patients with a chondrosarcoma of the soft tissue or skull (n = 60), and patients with peripheral, periosteal, or mesenchymal chondrosarcoma (n = 161). Thus, 464 patients were ultimately included in this external validation study, as the earlier performed SEER study was used as the training set. Therefore, this study-unlike most of this type-does not have a training and validation set. Although the earlier study overestimated 5-year survival, we did not modify the algorithm in this report, as this is the first international validation and the prior performance in the single-institution validation study from the United States may have been driven by a small sample or non-generalizable patterns related to its single-center setting. Variables needed for the SORG algorithm were manually collected from electronic medical records. These included sex, age, histologic subtype, tumor grade, tumor size, tumor extension, and tumor location. By inputting these variables into the algorithm, we calculated the predicted probabilities of survival for each patient. The performance of the SORG algorithm was assessed in this study through discrimination (the ability of a model to distinguish between a binary outcome), calibration (the agreement of observed and predicted outcomes), overall performance (the accuracy of predictions), and decision curve analysis (establishment on the ability of a model to make a decision better than without using the model). For discrimination, the c-statistic (commonly known as the area under the receiver operating characteristic curve for binary classification) was calculated; this ranged from 0.5 (no better than chance) to 1.0 (excellent discrimination). The agreement between predicted and observed outcomes was visualized with a calibration plot, and the calibration slope and intercept were calculated. Perfect calibration results in a slope of 1 and an intercept of 0. For overall performance, the Brier score and the null-model Brier score were calculated. The Brier score ranges from 0 (perfect prediction) to 1 (poorest prediction). Appropriate interpretation of the Brier score requires comparison with the null-model Brier score. The null-model Brier score is the score for an algorithm that predicts a probability equal to the population prevalence of the outcome for every patient. A decision curve analysis was performed to compare the potential net benefit of the algorithm versus other means of decision support, such as treating all or none of the patients. There were several differences between this study and the earlier SEER study, and such differences are important because they help us to determine the performance of the algorithm in a group different from the initial study population. In this study from Italy, 5-year survival was different from the earlier SEER study (71% [319 of 450 patients] versus 76% [1131 of 1487 patients]; p = 0.03). There were more patients with dedifferentiated chondrosarcoma than in the earlier SEER study (25% [118 of 464 patients] versus 8.5% [131 of 1544 patients]; p < 0.001). In addition, in this study patients were older, tumor size was larger, and there were higher proportions of high-grade tumors than the earlier SEER study (age: 56 years [interquartile range {IQR} 42 to 67] versus 52 years [IQR 40 to 64]; p = 0.007; tumor size: 80 mm [IQR 50 to 120] versus 70 mm [IQR 42 to 105]; p < 0.001; tumor grade: 22% [104 of 464 had Grade 1], 42% [196 of 464 had Grade 2], and 35% [164 of 464 had Grade 3] versus 41% [592 of 1456 had Grade 1], 40% [588 of 1456 had Grade 2], and 19% [276 of 1456 had Grade 3]; p ≤ 0.001). RESULTS: Validation of the SORG algorithm in a primarily Italian population achieved a c-statistic of 0.86 (95% confidence interval 0.82 to 0.89), suggesting good-to-excellent discrimination. The calibration plot showed good agreement between the predicted probability and observed survival in the probability thresholds of 0.8 to 1.0. With predicted survival probabilities lower than 0.8, however, the SORG algorithm underestimated the observed proportion of patients with 5-year survival, reflected in the overall calibration intercept of 0.82 (95% CI 0.67 to 0.98) and calibration slope of 0.68 (95% CI 0.42 to 0.95). The Brier score for 5-year survival was 0.15, compared with a null-model Brier of 0.21. The algorithm showed a favorable decision curve analysis in the validation cohort. CONCLUSIONS: The SORG algorithm to predict 5-year survival for patients with chondrosarcoma held good discriminative ability and overall performance on international external validation; however, it underestimated 5-year survival for patients with predicted probabilities from 0 to 0.8 because the calibration plot was not perfectly aligned for the observed outcomes, which resulted in a maximum underestimation of 20%. The differences may reflect the baseline differences noted between the two study populations. The overall performance of the algorithm supports the utility of the algorithm and validation presented here. The freely available digital application for the algorithm is available here: https://sorg-apps.shinyapps.io/extremitymetssurvival/. LEVEL OF EVIDENCE: Level III, prognostic study.