Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 24
Filtrer
1.
BMC Med Educ ; 24(1): 817, 2024 Jul 29.
Article de Anglais | MEDLINE | ID: mdl-39075511

RÉSUMÉ

CONTEXT: Objective Structured Clinical Examinations (OSCEs) are an increasingly popular evaluation modality for medical students. While the face-to-face interaction allows for more in-depth assessment, it may cause standardization problems. Methods to quantify, limit or adjust for examiner effects are needed. METHODS: Data originated from 3 OSCEs undergone by 900-student classes of 5th- and 6th-year medical students at Université Paris Cité in the 2022-2023 academic year. Sessions had five stations each, and one of the three sessions was scored by consensus by two raters (rather than one). We report OSCEs' longitudinal consistency for one of the classes and staff-related and student variability by session. We also propose a statistical method to adjust for inter-rater variability by deriving a statistical random student effect that accounts for staff-related and station random effects. RESULTS: From the four sessions, a total of 16,910 station scores were collected from 2615 student sessions, with two of the sessions undergone by the same students, and 36, 36, 35 and 20 distinct staff teams in each station for each session. Scores had staff-related heterogeneity (p<10-15), with staff-level standard errors approximately doubled compared to chance. With mixed models, staff-related heterogeneity explained respectively 11.4%, 11.6%, and 4.7% of station score variance (95% confidence intervals, 9.5-13.8, 9.7-14.1, and 3.9-5.8, respectively) with 1, 1 and 2 raters, suggesting a moderating effect of consensus grading. Student random effects explained a small proportion of variance, respectively 8.8%, 11.3%, and 9.6% (8.0-9.7, 10.3-12.4, and 8.7-10.5), and this low amount of signal resulted in student rankings being no more consistent over time with this metric, rather than with average scores (p=0.45). CONCLUSION: Staff variability impacts OSCE scores as much as student variability, and the former can be reduced with dual assessment or adjusted for with mixed models. Both are small compared to unmeasured sources of variability, making them difficult to capture consistently.


Sujet(s)
Compétence clinique , Évaluation des acquis scolaires , Biais de l'observateur , Étudiant médecine , Humains , Évaluation des acquis scolaires/méthodes , Évaluation des acquis scolaires/normes , Compétence clinique/normes , Enseignement médical premier cycle/normes , Paris , Reproductibilité des résultats
2.
J Anesth Analg Crit Care ; 4(1): 50, 2024 Jul 31.
Article de Anglais | MEDLINE | ID: mdl-39085969

RÉSUMÉ

BACKGROUND: Lung ultrasonography (LUS) is a non-invasive imaging method used to diagnose and monitor conditions such as pulmonary edema, pneumonia, and pneumothorax. It is precious where other imaging techniques like CT scan or chest X-rays are of limited access, especially in low- and middle-income countries with reduced resources. Furthermore, LUS reduces radiation exposure and its related blood cancer adverse events, which is particularly relevant in children and young subjects. The score obtained with LUS allows semi-quantification of regional loss of aeration, and it can provide a valuable and reliable assessment of the severity of most respiratory diseases. However, inter-observer reliability of the score has never been systematically assessed. This study aims to assess experienced LUS operators' agreement on a sample of video clips showing predefined findings. METHODS: Twenty-five anonymized video clips comprehensively depicting the different values of LUS score were shown to renowned LUS experts blinded to patients' clinical data and the study's aims using an online form. Clips were acquired from five different ultrasound machines. Fleiss-Cohen weighted kappa was used to evaluate experts' agreement. RESULTS: Over a period of 3 months, 20 experienced operators completed the assessment. Most worked in the ICU (10), ED (6), HDU (2), cardiology ward (1), or obstetric/gynecology department (1). The proportional LUS score mean was 15.3 (SD 1.6). Inter-rater agreement varied: 6 clips had full agreement, 3 had 19 out of 20 raters agreeing, and 3 had 18 agreeing, while the remaining 13 had 17 or fewer people agreeing on the assigned score. Scores 0 and score 3 were more reproducible than scores 1 and 2. Fleiss' Kappa for overall answers was 0.87 (95% CI 0.815-0.931, p < 0.001). CONCLUSIONS: The inter-rater agreement between experienced LUS operators is very high, although not perfect. The strong agreement and the small variance enable us to say that a 20% tolerance around a measured value of a LUS score is a reliable estimate of the patient's true LUS score, resulting in reduced variability in score interpretation and greater confidence in its clinical use.

3.
Langenbecks Arch Surg ; 409(1): 170, 2024 Jun 01.
Article de Anglais | MEDLINE | ID: mdl-38822883

RÉSUMÉ

PURPOSE: Perioperative decision making for large (> 2 cm) rectal polyps with ambiguous features is complex. The most common intraprocedural assessment is clinician judgement alone while radiological and endoscopic biopsy can provide periprocedural detail. Fluorescence-augmented machine learning (FA-ML) methods may optimise local treatment strategy. METHODS: Surgeons of varying grades, all performing colonoscopies independently, were asked to visually judge endoscopic videos of large benign and early-stage malignant (potentially suitable for local excision) rectal lesions on an interactive video platform (Mindstamp) with results compared with and between final pathology, radiology and a novel FA-ML classifier. Statistical analyses of data used Fleiss Multi-rater Kappa scoring, Spearman Coefficient and Frequency tables. RESULTS: Thirty-two surgeons judged 14 ambiguous polyp videos (7 benign, 7 malignant). In all cancers, initial endoscopic biopsy had yielded false-negative results. Five of each lesion type had had a pre-excision MRI with a 60% false-positive malignancy prediction in benign lesions and a 60% over-staging and 40% equivocal rate in cancers. Average clinical visual cancer judgement accuracy was 49% (with only 'fair' inter-rater agreement), many reporting uncertainty and higher reported decision confidence did not correspond to higher accuracy. This compared to 86% ML accuracy. Size was misjudged visually by a mean of 20% with polyp size underestimated in 4/6 and overestimated in 2/6. Subjective narratives regarding decision-making requested for 7/14 lesions revealed wide rationale variation between participants. CONCLUSION: Current available clinical means of ambiguous rectal lesion assessment is suboptimal with wide inter-observer variation. Fluorescence based AI augmentation may advance this field via objective, explainable ML methods.


Sujet(s)
Coloscopie , Tumeurs du rectum , Humains , Tumeurs du rectum/anatomopathologie , Tumeurs du rectum/chirurgie , Tumeurs du rectum/imagerie diagnostique , Polypes intestinaux/anatomopathologie , Polypes intestinaux/chirurgie , Apprentissage machine , Mâle , Fluorescence , Femelle , Biais de l'observateur
4.
Epileptic Disord ; 26(1): 109-120, 2024 Feb.
Article de Anglais | MEDLINE | ID: mdl-38031822

RÉSUMÉ

OBJECTIVE: We published a list of "must-know" routine EEG (rEEG) findings for trainees based on expert opinion. Here, we studied the accuracy and inter-rater agreement (IRA) of these "must-know" rEEG findings among international experts. METHODS: A previously validated online rEEG examination was disseminated to EEG experts. It consisted of a survey and 30 multiple-choice questions predicated on the previously published "must-know" rEEG findings divided into four domains: normal, abnormal, normal variants, and artifacts. Questions contained de-identified 10-20-s epochs of EEG that were considered unequivocal examples by five EEG experts. RESULTS: The examination was completed by 258 international EEG experts. Overall mean accuracy and IRA (AC1) were 81% and substantial (0.632), respectively. The domain-specific mean accuracies and IRA were: 76%, moderate (0.558) (normal); 78%, moderate (0.575) (abnormal); 85%, substantial (0.678) (normal variants); 85%, substantial (0.740) (artifacts). Academic experts had a higher accuracy than private practice experts (82% vs. 77%; p = .035). Country-specific overall mean accuracies and IRA were: 92%, almost perfect (0.836) (U.S.); 86%, substantial (0.762) (Brazil); 79%, substantial (0.646) (Italy); and 72%, moderate (0.496) (India). In conclusion, collective expert accuracy and IRA of "must-know" rEEG findings are suboptimal and heterogeneous. SIGNIFICANCE: We recommend the development and implementation of pragmatic, accessible, country-specific ways to measure and improve the expert accuracy and IRA.


Sujet(s)
Électroencéphalographie , Neurologie , Adulte , Enfant , Humains , Biais de l'observateur , Artéfacts , Italie
5.
Data Brief ; 51: 109662, 2023 Dec.
Article de Anglais | MEDLINE | ID: mdl-37869619

RÉSUMÉ

Accurate segmentation of liver and tumor regions in medical imaging is crucial for the diagnosis, treatment, and monitoring of hepatocellular carcinoma (HCC) patients. However, manual segmentation is time-consuming and subject to inter- and intra-rater variability. Therefore, automated methods are necessary but require rigorous validation of high-quality segmentations based on a consensus of raters. To address the need for reliable and comprehensive data in this domain, we present LiverHccSeg, a dataset that provides liver and tumor segmentations on multiphasic contrast-enhanced magnetic resonance imaging from two board-approved abdominal radiologists, along with an analysis of inter-rater agreement. LiverHccSeg provides a curated resource for liver and HCC tumor segmentation tasks. The dataset includes a scientific reading and co-registered contrast-enhanced multiphasic magnetic resonance imaging (MRI) scans with corresponding manual segmentations by two board-approved abdominal radiologists and relevant metadata and offers researchers a comprehensive foundation for external validation, and benchmarking of liver and tumor segmentation algorithms. The dataset also provides an analysis of the agreement between the two sets of liver and tumor segmentations. Through the calculation of appropriate segmentation metrics, we provide insights into the consistency and variability in liver and tumor segmentations among the radiologists. A total of 17 cases were included for liver segmentation and 14 cases for HCC tumor segmentation. Liver segmentations demonstrates high segmentation agreement (mean Dice, 0.95 ± 0.01 [standard deviation]) and HCC tumor segmentations showed higher variation (mean Dice, 0.85 ± 0.16 [standard deviation]). The applications of LiverHccSeg can be manifold, ranging from testing machine learning algorithms on public external data to radiomic feature analyses. Leveraging the inter-rater agreement analysis within the dataset, researchers can investigate the impact of variability on segmentation performance and explore methods to enhance the accuracy and robustness of liver and tumor segmentation algorithms in HCC patients. By making this dataset publicly available, LiverHccSeg aims to foster collaborations, facilitate innovative solutions, and ultimately improve patient outcomes in the diagnosis and treatment of HCC.

6.
Diagnostics (Basel) ; 13(16)2023 Aug 12.
Article de Anglais | MEDLINE | ID: mdl-37627921

RÉSUMÉ

BACKGROUND: Neonatal pain assessment (NPA) represents a huge global problem of essential importance, as a timely and accurate assessment of neonatal pain is indispensable for implementing pain management. PURPOSE: To investigate the consistency of pain scores derived through video-based NPA (VB-NPA) and on-site NPA (OS-NPA), providing the scientific foundation and feasibility of adopting VB-NPA results in a real-world scenario as the gold standard for neonatal pain in clinical studies and labels for artificial intelligence (AI)-based NPA (AI-NPA) applications. SETTING: A total of 598 neonates were recruited from a pediatric hospital in China. METHODS: This observational study recorded 598 neonates who underwent one of 10 painful procedures, including arterial blood sampling, heel blood sampling, fingertip blood sampling, intravenous injection, subcutaneous injection, peripheral intravenous cannulation, nasopharyngeal suctioning, retention enema, adhesive removal, and wound dressing. Two experienced nurses performed OS-NPA and VB-NPA at a 10-day interval through double-blind scoring using the Neonatal Infant Pain Scale to evaluate the pain level of the neonates. Intra-rater and inter-rater reliability were calculated and analyzed, and a paired samples t-test was used to explore the bias and consistency of the assessors' pain scores derived through OS-NPA and VB-NPA. The impact of different label sources was evaluated using three state-of-the-art AI methods trained with labels given by OS-NPA and VB-NPA, respectively. RESULTS: The intra-rater reliability of the same assessor was 0.976-0.983 across different times, as measured by the intraclass correlation coefficient. The inter-rater reliability was 0.983 for single measures and 0.992 for average measures. No significant differences were observed between the OS-NPA scores and the assessment of an independent VB-NPA assessor. The different label sources only caused a limited accuracy loss of 0.022-0.044 for the three AI methods. CONCLUSION: VB-NPA in a real-world scenario is an effective way to assess neonatal pain due to its high intra-rater and inter-rater reliability compared to OS-NPA and could be used for the labeling of large-scale NPA video databases for clinical studies and AI training.

7.
Appl Neuropsychol Adult ; : 1-5, 2022 Jun 15.
Article de Anglais | MEDLINE | ID: mdl-35705310

RÉSUMÉ

BACKGROUND: Despite its wide use in dementia diagnosis on the basis of cut-off points, the inter-rater variability of the Addenbrooke's Cognitive Examination-Third Edition (ACE-III) has been poorly studied. METHODS: Thirty-one healthcare professionals from an older adults' mental health team scored two ACE-III protocols based on mock patients in a computerised form. Scoring accuracy, as well as total and domain-specific scoring variability, were calculated; factors relevant to participants were obtained, including their level of experience and self-rated confidence administering the ACE-III. RESULTS: There was considerable inter-rater variability (up to 18 points for one of the cases), and one case's mean score was significantly higher (by nearly four points) than the true score. The Fluency, Visuospatial and Attention domains had greater levels of variability than Language and Memory. Higher scoring accuracy was not associated with either greater levels of experience or higher self-confidence in administering the ACE-III. CONCLUSIONS: The results suggest that the ACE-III is susceptible to scoring error and considerable inter-rater variability, which highlights the critical importance of initial, and continued, administration and scoring training.

8.
J Med Imaging (Bellingham) ; 9(2): 024001, 2022 Mar.
Article de Anglais | MEDLINE | ID: mdl-35300345

RÉSUMÉ

Purpose: An accurate zonal segmentation of the prostate is required for prostate cancer (PCa) management with MRI. Approach: The aim of this work is to present UFNet, a deep learning-based method for automatic zonal segmentation of the prostate from T2-weighted (T2w) MRI. It takes into account the image anisotropy, includes both spatial and channelwise attention mechanisms and uses loss functions to enforce prostate partition. The method was applied on a private multicentric three-dimensional T2w MRI dataset and on the public two-dimensional T2w MRI dataset ProstateX. To assess the model performance, the structures segmented by the algorithm on the private dataset were compared with those obtained by seven radiologists of various experience levels. Results: On the private dataset, we obtained a Dice score (DSC) of 93.90 ± 2.85 for the whole gland (WG), 91.00 ± 4.34 for the transition zone (TZ), and 79.08 ± 7.08 for the peripheral zone (PZ). Results were significantly better than other compared networks' ( p - value < 0.05 ). On ProstateX, we obtained a DSC of 90.90 ± 2.94 for WG, 86.84 ± 4.33 for TZ, and 78.40 ± 7.31 for PZ. These results are similar to state-of-the art results and, on the private dataset, are coherent with those obtained by radiologists. Zonal locations and sectorial positions of lesions annotated by radiologists were also preserved. Conclusions: Deep learning-based methods can provide an accurate zonal segmentation of the prostate leading to a consistent zonal location and sectorial position of lesions, and therefore can be used as a helping tool for PCa diagnosis.

9.
Neurophysiol Clin ; 52(2): 157-169, 2022 Apr.
Article de Anglais | MEDLINE | ID: mdl-34906430

RÉSUMÉ

OBJECTIVE: To assess the inter-rater reliability of MScanFit MUNE using a "Round Robin" research design. METHODS: Twelve raters from different centres examined six healthy study participants over two days. Median, ulnar and common peroneal nerves were stimulated, and compound muscle action potential (CMAP)-scans were recorded from abductor pollicis brevis (APB), abductor digiti minimi (ADM) and anterior tibial (TA) muscles respectively. From this we calculated the Motor Unit Number Estimation (MUNE) and "A50", a motor unit size parameter. As statistical analysis we used the measures Limits of Agreement (LOA) and Coefficient of Variation (COV). Study participants scored their perception of pain from the examinations on a rating scale from 0 (no pain) to 10 (unbearable pain). RESULTS: Before this study, 41.6% of the raters had performed MScanFit less than five times. The mean MUNE-values were: 99.6 (APB), 131.4 (ADM) and 126.2 (TA), with LOA: 19.5 (APB), 29.8 (ADM) and 20.7 (TA), and COV: 13.4 (APB), 6.3 (ADM) and 5.6 (TA). MUNE-values correlated to CMAP max amplitudes (R2-values were: 0.463 (APB) (p<0.001), 0.421 (ADM) (p<0.001) and 0.645 (TA) (p<0.001)). The average perception of pain was 4. DISCUSSION: MScanFit indicates a high level of inter-rater reliability, even with only limited rater experience and is overall reasonably well tolerated by patients. These results may indicate MScanFit as a reliable MUNE method with potential as a biomarker in drug trials.


Sujet(s)
Sclérose latérale amyotrophique , Motoneurones , Potentiels d'action/physiologie , Électromyographie/méthodes , Humains , Motoneurones/physiologie , Muscles squelettiques/innervation , Douleur , Reproductibilité des résultats
10.
Eur Radiol ; 32(4): 2798-2809, 2022 Apr.
Article de Anglais | MEDLINE | ID: mdl-34643779

RÉSUMÉ

OBJECTIVE: Automated quantification of infratentorial multiple sclerosis lesions on magnetic resonance imaging is clinically relevant but challenging. To overcome some of these problems, we propose a fully automated lesion segmentation algorithm using 3D convolutional neural networks (CNNs). METHODS: The CNN was trained on a FLAIR image alone or on FLAIR and T1-weighted images from 1809 patients acquired on 156 different scanners. An additional training using an extra class for infratentorial lesions was implemented. Three experienced raters manually annotated three datasets from 123 MS patients from different scanners. RESULTS: The inter-rater sensitivity (SEN) was 80% for supratentorial lesions but only 62% for infratentorial lesions. There was no statistically significant difference between the inter-rater SEN and the SEN of the CNN with respect to the raters. For supratentorial lesions, the CNN featured an intra-rater intra-scanner SEN of 0.97 (R1 = 0.90, R2 = 0.84) and for infratentorial lesion a SEN of 0.93 (R1 = 0.61, R2 = 0.73). CONCLUSION: The performance of the CNN improved significantly for infratentorial lesions when specifically trained on infratentorial lesions using a T1 image as an additional input and matches the detection performance of experienced raters. Furthermore, for infratentorial lesions the CNN was more robust against repeated scans than experienced raters. KEY POINTS: • A 3D convolutional neural network was trained on MRI data from 1809 patients (156 different scanners) for the quantification of supratentorial and infratentorial multiple sclerosis lesions. • Inter-rater variability was higher for infratentorial lesions than for supratentorial lesions. The performance of the 3D convolutional neural network (CNN) improved significantly for infratentorial lesions when specifically trained on infratentorial lesions using a T1 image as an additional input. • The detection performance of the CNN matches the detection performance of experienced raters.


Sujet(s)
Sclérose en plaques , Algorithmes , Humains , Traitement d'image par ordinateur/méthodes , Imagerie par résonance magnétique/méthodes , Sclérose en plaques/imagerie diagnostique , Sclérose en plaques/anatomopathologie ,
11.
J Sleep Res ; 31(2): e13481, 2022 04.
Article de Anglais | MEDLINE | ID: mdl-34510622

RÉSUMÉ

The clinical relevance of rapid eye movement sleep-related obstructive sleep apnea (REM OSA) is supported by its associated adverse health outcomes and impact on optimal treatment strategies. To date, no assessment of REM OSA phenotyping performance has been conducted for any type of sleep testing technology. The objective of this study was to assess this for polysomnography and peripheral arterial tone-based home sleep apnea testing (PAT HSAT). In a dataset comprising 261 participants, the sensitivity and specificity of the agreement on REM OSA phenotyping was assessed for two independent scorings of polysomnography and a synchronously administered PAT HSAT. The sensitivity and specificity of REM OSA phenotyping were 0.87 and 0.89, respectively, for the polysomnography inter-scorer comparison, and 0.68 and 0.97 for the PAT HSAT on a single-night basis, using the conventional minimum required rapid eye movement sleep time of 30 min. Polysomnography-based REM OSA phenotyping was found to be sensitive and specific even for a single-night testing protocol. Peripheral arterial tone-based REM OSA phenotyping showed a lower sensitivity but a slightly higher specificity compared to polysomnography. In order to increase performance and conclusiveness of peripheral arterial tone-based REM OSA phenotyping, a multi-night protocol of 2-5 nights could be considered. Finally, the minimum required rapid eye movement sleep time could be lowered from the conventional 30 min to 15 min without significantly lowering REM OSA phenotyping sensitivity and specificity, while increasing the level of phenotyping conclusiveness.


Sujet(s)
Syndrome d'apnées obstructives du sommeil , Humains , Polysomnographie/méthodes , Sensibilité et spécificité , Sommeil , Syndrome d'apnées obstructives du sommeil/diagnostic , Sommeil paradoxal
12.
Ann Intensive Care ; 11(1): 22, 2021 Feb 03.
Article de Anglais | MEDLINE | ID: mdl-33534010

RÉSUMÉ

PURPOSE: Frailty is a valuable predictor for outcome in elderly ICU patients, and has been suggested to be used in various decision-making processes prior to and during an ICU admission. There are many instruments developed to assess frailty, but few of them can be used in emergency situations. In this setting the clinical frailty scale (CFS) is frequently used. The present study is a sub-study within a larger outcome study of elderly ICU patients in Europe (the VIP-2 study) in order to document the reliability of the CFS. MATERIALS AND METHODS: From the VIP-2 study, 129 ICUs in 20 countries participated in this sub-study. The patients were acute admissions ≥ 80 years of age and frailty was assessed at admission by two independent observers using the CFS. Information was obtained from the patient, if not feasible, from the family/caregivers or from hospital files. The profession of the rater and source of data were recorded along with the score. Interrater variability was calculated using linear weighted kappa analysis. RESULTS: 1923 pairs of assessors were included and background data of patients were similar to the whole cohort (n = 3920). We found a very high inter-rater agreement (weighted kappa 0.86), also in subgroup analyses. The agreement when comparing information from family or hospital records was better than using only direct patient information, and pairs of raters from same profession performed better than from different professions. CONCLUSIONS: Overall, we documented a high reliability using CFS in this setting. This frailty score could be used more frequently in elderly ICU patients in order to create a more holistic and realistic impression of the patient´s condition prior to ICU admission.

13.
Intensive Care Med ; 46(7): 1382-1393, 2020 Jul.
Article de Anglais | MEDLINE | ID: mdl-32451578

RÉSUMÉ

PURPOSE: Definitions of acute respiratory distress syndrome (ARDS) include radiographic criteria, but there are concerns about reliability and prognostic relevance. This study aimed to evaluate the independent relationship between chest imaging and mortality and examine the inter-rater variability of interpretations of chest radiographs (CXR) in pediatric ARDS (PARDS). METHODS: Prospective, international observational study in children meeting Pediatric Acute Lung Injury Consensus Conference (PALICC) criteria for PARDS, which requires new infiltrate(s) consistent with pulmonary parenchymal disease, without mandating bilateral infiltrates. Mortality analysis focused on the entire cohort, whereas inter-observer variability used a subset of patients with blinded, simultaneous interpretation of CXRs by intensivists and radiologists. RESULTS: Bilateral infiltrates and four quadrants of alveolar consolidation were associated with mortality on a univariable basis, using CXRs from 708 patients with PARDS. For patients on either invasive (IMV) or non-invasive ventilation (NIV) with PaO2/FiO2 (PF) ratios (or SpO2/FiO2 (SF) ratio equivalent) > 100, neither bilateral infiltrates (OR 1.3 (95% CI 0.68, 2.5), p = 0.43), nor 4 quadrants of alveolar consolidation (OR 1.6 (0.85, 3), p = 0.14) were associated with mortality. For patients with PF ≤ 100, bilateral infiltrates (OR 3.6 (1.4, 9.4), p = 0.01) and four quadrants of consolidation (OR 2.0 (1.14, 3.5), p = 0.02) were associated with higher mortality. A subset of 702 CXRs from 233 patients had simultaneous interpretations. Interobserver agreement for bilateral infiltrates and quadrants was "slight" (kappa 0.31 and 0.33). Subgroup analysis showed agreement did not differ when stratified by PARDS severity but was slightly higher for children with chronic respiratory support (kappa 0.62), NIV at PARDS diagnosis (kappa 0.53), age > 10 years (kappa 0.43) and fluid balance > 40 ml/kg (kappa 0.48). CONCLUSION: Bilateral infiltrates and quadrants of alveolar consolidation are associated with mortality only for those with PF ratio ≤ 100, although there is high- inter-rater variability in these chest-x ray parameters.


Sujet(s)
, Enfant , Humains , Incidence , Pronostic , Études prospectives , Reproductibilité des résultats , /imagerie diagnostique
14.
J Voice ; 33(4): 453-464, 2019 Jul.
Article de Anglais | MEDLINE | ID: mdl-29731380

RÉSUMÉ

OBJECTIVE: To present and test a production-matching method with external references, looking at the improvement of inter-rater variability of expert evaluations. METHOD: It consists of adjusting quality attribute levels of a synthetic vowel for a simultaneous matching with the natural patient vowel (NPV) attributes. In an initial experiment, seven speech-language pathology (SLP) experts performed this task with the new method and evaluated the same NPV with the standard method. Targets were twelve NPVs with a variety of quality attribute combinations. In a second experiment, we employed the proposed method to assess the evaluation performance of 65 SLP students. RESULTS: Expert evaluations show less dispersion for the proposed method than those obtained using the standard rating method. Student individual responses were compared with overall responses from their own group and were cross referenced with expert responses. A Kappa index is proposed as a measure of SLP students' performance. CONCLUSIONS: The proposed method was readily accepted by both SLP experts and students. Experts' consensus was improved. SLP students could benefit by quickly learning to discriminate complex attributes, which usually demands years of experience.


Sujet(s)
Dysphonie/diagnostic , Jugement , Acoustique de la voix , Perception de la parole , Mesures de production de la parole , Pathologie de la parole et du langage (spécialité)/méthodes , Qualité de la voix , Consensus , Dysphonie/physiopathologie , Humains , Biais de l'observateur , Valeur prédictive des tests , Reproductibilité des résultats
15.
Inflamm Bowel Dis ; 24(2): 254-260, 2018 01 18.
Article de Anglais | MEDLINE | ID: mdl-29361106

RÉSUMÉ

Background: Endoscopy is routinely performed in patients with inflammatory bowel disease to evaluate disease severity and guide important clinical decisions. However, variability in the interpretation of endoscopic findings can significantly impact patient management. Methods: Fifty-eight gastroenterologists were invited to participate in an online survey including pictures and video recordings of colonoscopies performed in patients with ulcerative colitis (UC) and Crohn's disease (CD). Participants were asked to rate the colorectal mucosa in patients with UC using the Mayo endoscopic subscore (MES), and the neo-terminal ileum and anastomosis in operated patients with CD using the Rutgeerts score (RS). Overall interrater agreement (IRA) and for several key end points was assessed using Krippendorff's alpha test. Results: The IRAs for the MES and RS were 0.47 (95% confidence interval [CI], 0.41-0.54) and 0.33 (95% CI, 0.28-0.38). The IRAs for UC mucosal healing (MES ≤ 1) and complete mucosal healing (MES = 0) were 0.57 (95% CI, 0.40-0.72) and 0.89 (95% CI, 0.73-1) and for CD postoperative recurrence (RS ≥ i2), and IRAs for severe postoperative recurrence (RS ≥ 3) were 0.44 (95% CI, 0.24-0.62) and 0.54 (95% CI, 0.36-0.71), respectively. Unexpectedly, although clinical information significantly influenced the IRA, participant expertise and consultation of scores did not produce significant changes in the IRA. Conclusions: A high rate of disagreement in endoscopic scoring was found in this study, even among experienced physicians. The variability in the assessment of mucosal healing and postoperative recurrence may translate into relevant differences in patient management.


Sujet(s)
Coloscopie , Gastro-entérologues , Maladies inflammatoires intestinales/diagnostic , Indice de gravité de la maladie , Humains , Maladies inflammatoires intestinales/physiopathologie , Maladies inflammatoires intestinales/thérapie , Muqueuse intestinale/anatomopathologie , Biais de l'observateur , Portugal , Récidive , Cicatrisation de plaie/physiologie
16.
Magn Reson Med ; 79(5): 2500-2510, 2018 05.
Article de Anglais | MEDLINE | ID: mdl-28994492

RÉSUMÉ

PURPOSE: To investigate and compare human judgment and machine learning tools for quality assessment of clinical MR spectra of brain tumors. METHODS: A very large set of 2574 single voxel spectra with short and long echo time from the eTUMOUR and INTERPRET databases were used for this analysis. Original human quality ratings from these studies as well as new human guidelines were used to train different machine learning algorithms for automatic quality control (AQC) based on various feature extraction methods and classification tools. The performance was compared with variance in human judgment. RESULTS: AQC built using the RUSBoost classifier that combats imbalanced training data performed best. When furnished with a large range of spectral and derived features where the most crucial ones had been selected by the TreeBagger algorithm it showed better specificity (98%) in judging spectra from an independent test-set than previously published methods. Optimal performance was reached with a virtual three-class ranking system. CONCLUSION: Our results suggest that feature space should be relatively large for the case of MR tumor spectra and that three-class labels may be beneficial for AQC. The best AQC algorithm showed a performance in rejecting spectra that was comparable to that of a panel of human expert spectroscopists. Magn Reson Med 79:2500-2510, 2018. © 2017 International Society for Magnetic Resonance in Medicine.


Sujet(s)
Tumeurs du cerveau/imagerie diagnostique , Interprétation d'images assistée par ordinateur/méthodes , Apprentissage machine , Imagerie par résonance magnétique/méthodes , Algorithmes , Encéphale/imagerie diagnostique , Humains , Contrôle de qualité
17.
Neuroinformatics ; 16(1): 51-63, 2018 01.
Article de Anglais | MEDLINE | ID: mdl-29103086

RÉSUMÉ

Quantified volume and count of white-matter lesions based on magnetic resonance (MR) images are important biomarkers in several neurodegenerative diseases. For a routine extraction of these biomarkers an accurate and reliable automated lesion segmentation is required. To objectively and reliably determine a standard automated method, however, creation of standard validation datasets is of extremely high importance. Ideally, these datasets should be publicly available in conjunction with standardized evaluation methodology to enable objective validation of novel and existing methods. For validation purposes, we present a novel MR dataset of 30 multiple sclerosis patients and a novel protocol for creating reference white-matter lesion segmentations based on multi-rater consensus. On these datasets three expert raters individually segmented white-matter lesions, using in-house developed semi-automated lesion contouring tools. Later, the raters revised the segmentations in several joint sessions to reach a consensus on segmentation of lesions. To evaluate the variability, and as quality assurance, the protocol was executed twice on the same MR images, with a six months break. The obtained intra-consensus variability was substantially lower compared to the intra- and inter-rater variabilities, showing improved reliability of lesion segmentation by the proposed protocol. Hence, the obtained reference segmentations may represent a more precise target to evaluate, compare against and also train, the automatic segmentations. To encourage further use and research we will publicly disseminate on our website http://lit.fe.uni-lj.si/tools the tools used to create lesion segmentations, the original and preprocessed MR image datasets and the consensus lesion segmentations.


Sujet(s)
Consensus , Bases de données factuelles , Traitement d'image par ordinateur/méthodes , Imagerie par résonance magnétique/méthodes , Sclérose en plaques/imagerie diagnostique , Adulte , Femelle , Humains , Mâle , Adulte d'âge moyen
18.
J Neurosci Methods ; 289: 48-56, 2017 Sep 01.
Article de Anglais | MEDLINE | ID: mdl-28648717

RÉSUMÉ

BACKGROUND: Manual analysis of behavior is labor intensive and subject to inter-rater variability. Although considerable progress in automation of analysis has been made, complex behavior such as grooming still lacks satisfactory automated quantification. NEW METHOD: We trained a freely available, automated classifier, Janelia Automatic Animal Behavior Annotator (JAABA), to quantify self-grooming duration and number of bouts based on video recordings of SAPAP3 knockout mice (a mouse line that self-grooms excessively) and wild-type animals. RESULTS: We compared the JAABA classifier with human expert observers to test its ability to measure self-grooming in three scenarios: mice in an open field, mice on an elevated plus-maze, and tethered mice in an open field. In each scenario, the classifier identified both grooming and non-grooming with great accuracy and correlated highly with results obtained by human observers. Consistently, the JAABA classifier confirmed previous reports of excessive grooming in SAPAP3 knockout mice. COMPARISON WITH EXISTING METHODS: Thus far, manual analysis was regarded as the only valid quantification method for self-grooming. We demonstrate that the JAABA classifier is a valid and reliable scoring tool, more cost-efficient than manual scoring, easy to use, requires minimal effort, provides high throughput, and prevents inter-rater variability. CONCLUSION: We introduce the JAABA classifier as an efficient analysis tool for the assessment of rodent self-grooming with expert quality. In our "how-to" instructions, we provide all information necessary to implement behavioral classification with JAABA.


Sujet(s)
Laboratoire automatique/méthodes , Soins du pelage , Souris , Activité motrice , Reconnaissance automatique des formes/méthodes , Logiciel , Animaux , Comportement d'exploration , Femelle , Mâle , Protéines de tissu nerveux/déficit , Protéines de tissu nerveux/génétique , Biais de l'observateur , Orexines/génétique , Orexines/métabolisme , Reproductibilité des résultats , Enregistrement sur magnétoscope/méthodes
19.
Crit Care ; 21(1): 12, 2017 01 20.
Article de Anglais | MEDLINE | ID: mdl-28107822

RÉSUMÉ

BACKGROUND: Poor inter-rater reliability in chest radiograph interpretation has been reported in the context of acute respiratory distress syndrome (ARDS), although not for the Berlin definition of ARDS. We sought to examine the effect of training material on the accuracy and consistency of intensivists' chest radiograph interpretations for ARDS diagnosis. METHODS: We conducted a rater agreement study in which 286 intensivists (residents 41.3%, junior attending physicians 35.3%, and senior attending physician 23.4%) independently reviewed the same 12 chest radiographs developed by the ARDS Definition Task Force ("the panel") before and after training. Radiographic diagnoses by the panel were classified into the consistent (n = 4), equivocal (n = 4), and inconsistent (n = 4) categories and were used as a reference. The 1.5-hour training course attended by all 286 intensivists included introduction of the diagnostic rationale, and a subsequent in-depth discussion to reach consensus for all 12 radiographs. RESULTS: Overall diagnostic accuracy, which was defined as the percentage of chest radiographs that were interpreted correctly, improved but remained poor after training (42.0 ± 14.8% before training vs. 55.3 ± 23.4% after training, p < 0.001). Diagnostic sensitivity and specificity improved after training for all diagnostic categories (p < 0.001), with the exception of specificity for the equivocal category (p = 0.883). Diagnostic accuracy was higher for the consistent category than for the inconsistent and equivocal categories (p < 0.001). Comparisons of pre-training and post-training results revealed that inter-rater agreement was poor and did not improve after training, as assessed by overall agreement (0.450 ± 0.406 vs. 0.461 ± 0.575, p = 0.792), Fleiss's kappa (0.133 ± 0.575 vs. 0.178 ± 0.710, p = 0.405), and intraclass correlation coefficient (ICC; 0.219 vs. 0.276, p = 0.470). CONCLUSIONS: The radiographic diagnostic accuracy and inter-rater agreement were poor when the Berlin radiographic definition was used, and were not significantly improved by the training set of chest radiographs developed by the ARDS Definition Task Force. TRIAL REGISTRATION: The study was registered at ClinicalTrials.gov (registration number NCT01704066 ) on 6 October 2012.


Sujet(s)
Compétence clinique/normes , Radiographie thoracique/méthodes , /diagnostic , Enseignement/normes , Compétence clinique/statistiques et données numériques , Femelle , Humains , Mâle , Biais de l'observateur , Études prospectives , Radiographie thoracique/statistiques et données numériques , Reproductibilité des résultats , /imagerie diagnostique , Enseignement/statistiques et données numériques
20.
Eur J Nucl Med Mol Imaging ; 44(5): 850-857, 2017 May.
Article de Anglais | MEDLINE | ID: mdl-27966045

RÉSUMÉ

PURPOSE: The aim of this study was to assess the inter-rater variability of the visual interpretation of 11C-PiB PET images regarding the positivity/negativity of amyloid deposition that were obtained in a multicenter clinical research project, Japanese Alzheimer's Disease Neuroimaging Initiative (J-ADNI). The results of visual interpretation were also compared with a semi-automatic quantitative analysis using mean cortical standardized uptake value ratio to the cerebellar cortex (mcSUVR). METHODS: A total of 162 11C-PiB PET scans, including 45 mild Alzheimer's disease, 60 mild cognitive impairment, and 57 normal cognitive control cases that had been acquired as J-ADNI baseline scans were analyzed. Based on visual interpretation by three independent raters followed by consensus read, each case was classified into positive, equivocal, and negative deposition (ternary criteria) and further dichotomized by merging the former two (binary criteria). RESULTS: Complete agreement of visual interpretation by the three raters was observed for 91.3% of the cases (Cohen κ = 0.88 on average) in ternary criteria and for 92.3% (κ = 0.89) in binary criteria. Cases that were interpreted as visually positive in the consensus read showed significantly higher mcSUVR than those visually negative (2.21 ± 0.37 vs. 1.27 ± 0.09, p < 0.001), and positive or negative decision by visual interpretation was dichotomized by a cut-off value of mcSUVR = 1.5. Significant positive/negative associations were observed between mcSUVR and the number of raters who evaluated as positive (ρ = 0.87, p < 0.0001) and negative (ρ = -0.85, p < 0.0001) interpretation. Cases of disagreement among raters showed generally low mcSUVR. CONCLUSIONS: Inter-rater agreement was almost perfect in 11C-PiB PET scans. Positive or negative decision by visual interpretation was dichotomized by a cut-off value of mcSUVR = 1.5. As some cases of disagreement among raters tended to show low mcSUVR, referring to quantitative method may facilitate correct diagnosis when evaluating images of low amyloid deposition.


Sujet(s)
Maladie d'Alzheimer/imagerie diagnostique , Benzothiazoles , Interprétation d'images assistée par ordinateur/méthodes , Neuroimagerie , Tomographie par émission de positons , Dérivés de l'aniline , Consensus , Femelle , Humains , Mâle , Adulte d'âge moyen , Biais de l'observateur , Thiazoles
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE