Search | VHL Regional Portal

Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports.

Schmidt, Reuben A; Seah, Jarrel C Y; Cao, Ke; Lim, Lincoln; Lim, Wei; Yeung, Justin.

Radiol Artif Intell ; 6(2): e230205, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38265301

ABSTRACT

This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. Keywords: CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning Supplemental material is available for this article.

Subject(s)

Camelids, New World , Radiology Information Systems , Radiology , Speech Perception , Animals , Speech , Speech Recognition Software , Reproducibility of Results

Effects of a comprehensive brain computed tomography deep learning model on radiologist detection accuracy.

Buchlak, Quinlan D; Tang, Cyril H M; Seah, Jarrel C Y; Johnson, Andrew; Holt, Xavier; Bottrell, Georgina M; Wardman, Jeffrey B; Samarasinghe, Gihan; Dos Santos Pinheiro, Leonardo; Xia, Hongze; Ahmad, Hassan K; Pham, Hung; Chiang, Jason I; Ektas, Nalan; Milne, Michael R; Chiu, Christopher H Y; Hachey, Ben; Ryan, Melissa K; Johnston, Benjamin P; Esmaili, Nazanin; Bennett, Christine; Goldschlager, Tony; Hall, Jonathan; Vo, Duc Tan; Oakden-Rayner, Lauren; Leveque, Jean-Christophe; Farrokhi, Farrokh; Abramson, Richard G; Jones, Catherine M; Edelstein, Simon; Brotchie, Peter.

Eur Radiol ; 34(2): 810-822, 2024 Feb.

Article in English | MEDLINE | ID: mdl-37606663

ABSTRACT

OBJECTIVES: Non-contrast computed tomography of the brain (NCCTB) is commonly used to detect intracranial pathology but is subject to interpretation errors. Machine learning can augment clinical decision-making and improve NCCTB scan interpretation. This retrospective detection accuracy study assessed the performance of radiologists assisted by a deep learning model and compared the standalone performance of the model with that of unassisted radiologists. METHODS: A deep learning model was trained on 212,484 NCCTB scans drawn from a private radiology group in Australia. Scans from inpatient, outpatient, and emergency settings were included. Scan inclusion criteria were age ≥ 18 years and series slice thickness ≤ 1.5 mm. Thirty-two radiologists reviewed 2848 scans with and without the assistance of the deep learning system and rated their confidence in the presence of each finding using a 7-point scale. Differences in AUC and Matthews correlation coefficient (MCC) were calculated using a ground-truth gold standard. RESULTS: The model demonstrated an average area under the receiver operating characteristic curve (AUC) of 0.93 across 144 NCCTB findings and significantly improved radiologist interpretation performance. Assisted and unassisted radiologists demonstrated an average AUC of 0.79 and 0.73 across 22 grouped parent findings and 0.72 and 0.68 across 189 child findings, respectively. When assisted by the model, radiologist AUC was significantly improved for 91 findings (158 findings were non-inferior), and reading time was significantly reduced. CONCLUSIONS: The assistance of a comprehensive deep learning model significantly improved radiologist detection accuracy across a wide range of clinical findings and demonstrated the potential to improve NCCTB interpretation. CLINICAL RELEVANCE STATEMENT: This study evaluated a comprehensive CT brain deep learning model, which performed strongly, improved the performance of radiologists, and reduced interpretation time. The model may reduce errors, improve efficiency, facilitate triage, and better enable the delivery of timely patient care. KEY POINTS: â¢ This study demonstrated that the use of a comprehensive deep learning system assisted radiologists in the detection of a wide range of abnormalities on non-contrast brain computed tomography scans. â¢ The deep learning model demonstrated an average area under the receiver operating characteristic curve of 0.93 across 144 findings and significantly improved radiologist interpretation performance. â¢ The assistance of the comprehensive deep learning model significantly reduced the time required for radiologists to interpret computed tomography scans of the brain.

Subject(s)

Deep Learning , Adolescent , Humans , Radiography , Radiologists , Retrospective Studies , Tomography, X-Ray Computed/methods , Adult

Analysis of Line and Tube Detection Performance of a Chest X-ray Deep Learning Model to Evaluate Hidden Stratification.

Tang, Cyril H M; Seah, Jarrel C Y; Ahmad, Hassan K; Milne, Michael R; Wardman, Jeffrey B; Buchlak, Quinlan D; Esmaili, Nazanin; Lambert, John F; Jones, Catherine M.

Diagnostics (Basel) ; 13(14)2023 Jul 09.

Article in English | MEDLINE | ID: mdl-37510062

ABSTRACT

This retrospective case-control study evaluated the diagnostic performance of a commercially available chest radiography deep convolutional neural network (DCNN) in identifying the presence and position of central venous catheters, enteric tubes, and endotracheal tubes, in addition to a subgroup analysis of different types of lines/tubes. A held-out test dataset of 2568 studies was sourced from community radiology clinics and hospitals in Australia and the USA, and was then ground-truth labelled for the presence, position, and type of line or tube from the consensus of a thoracic specialist radiologist and an intensive care clinician. DCNN model performance for identifying and assessing the positioning of central venous catheters, enteric tubes, and endotracheal tubes over the entire dataset, as well as within each subgroup, was evaluated. The area under the receiver operating characteristic curve (AUC) was assessed. The DCNN algorithm displayed high performance in detecting the presence of lines and tubes in the test dataset with AUCs > 0.99, and good position classification performance over a subpopulation of ground truth positive cases with AUCs of 0.86-0.91. The subgroup analysis showed that model performance was robust across the various subtypes of lines or tubes, although position classification performance of peripherally inserted central catheters was relatively lower. Our findings indicated that the DCNN algorithm performed well in the detection and position classification of lines and tubes, supporting its use as an assistant for clinicians. Further work is required to evaluate performance in rarer scenarios, as well as in less common subgroups.

Machine Learning Augmented Interpretation of Chest X-rays: A Systematic Review.

Ahmad, Hassan K; Milne, Michael R; Buchlak, Quinlan D; Ektas, Nalan; Sanderson, Georgina; Chamtie, Hadi; Karunasena, Sajith; Chiang, Jason; Holt, Xavier; Tang, Cyril H M; Seah, Jarrel C Y; Bottrell, Georgina; Esmaili, Nazanin; Brotchie, Peter; Jones, Catherine.

Diagnostics (Basel) ; 13(4)2023 Feb 15.

Article in English | MEDLINE | ID: mdl-36832231

ABSTRACT

Limitations of the chest X-ray (CXR) have resulted in attempts to create machine learning systems to assist clinicians and improve interpretation accuracy. An understanding of the capabilities and limitations of modern machine learning systems is necessary for clinicians as these tools begin to permeate practice. This systematic review aimed to provide an overview of machine learning applications designed to facilitate CXR interpretation. A systematic search strategy was executed to identify research into machine learning algorithms capable of detecting >2 radiographic findings on CXRs published between January 2020 and September 2022. Model details and study characteristics, including risk of bias and quality, were summarized. Initially, 2248 articles were retrieved, with 46 included in the final review. Published models demonstrated strong standalone performance and were typically as accurate, or more accurate, than radiologists or non-radiologist clinicians. Multiple studies demonstrated an improvement in the clinical finding classification performance of clinicians when models acted as a diagnostic assistance device. Device performance was compared with that of clinicians in 30% of studies, while effects on clinical perception and diagnosis were evaluated in 19%. Only one study was prospectively run. On average, 128,662 images were used to train and validate models. Most classified less than eight clinical findings, while the three most comprehensive models classified 54, 72, and 124 findings. This review suggests that machine learning devices designed to facilitate CXR interpretation perform strongly, improve the detection performance of clinicians, and improve the efficiency of radiology workflow. Several limitations were identified, and clinician involvement and expertise will be key to driving the safe implementation of quality CXR machine learning systems.

Evaluation of an Artificial Intelligence Model for Detection of Pneumothorax and Tension Pneumothorax in Chest Radiographs.

Hillis, James M; Bizzo, Bernardo C; Mercaldo, Sarah; Chin, John K; Newbury-Chaet, Isabella; Digumarthy, Subba R; Gilman, Matthew D; Muse, Victorine V; Bottrell, Georgie; Seah, Jarrel C Y; Jones, Catherine M; Kalra, Mannudeep K; Dreyer, Keith J.

JAMA Netw Open ; 5(12): e2247172, 2022 12 01.

Article in English | MEDLINE | ID: mdl-36520432

ABSTRACT

Importance: Early detection of pneumothorax, most often via chest radiography, can help determine need for emergent clinical intervention. The ability to accurately detect and rapidly triage pneumothorax with an artificial intelligence (AI) model could assist with earlier identification and improve care. Objective: To compare the accuracy of an AI model vs consensus thoracic radiologist interpretations in detecting any pneumothorax (incorporating both nontension and tension pneumothorax) and tension pneumothorax. Design, Setting, and Participants: This diagnostic study was a retrospective standalone performance assessment using a data set of 1000 chest radiographs captured between June 1, 2015, and May 31, 2021. The radiographs were obtained from patients aged at least 18 years at 4 hospitals in the Mass General Brigham hospital network in the United States. Included radiographs were selected using 2 strategies from all chest radiography performed at the hospitals, including inpatient and outpatient. The first strategy identified consecutive radiographs with pneumothorax through a manual review of radiology reports, and the second strategy identified consecutive radiographs with tension pneumothorax using natural language processing. For both strategies, negative radiographs were selected by taking the next negative radiograph acquired from the same radiography machine as each positive radiograph. The final data set was an amalgamation of these processes. Each radiograph was interpreted independently by up to 3 radiologists to establish consensus ground-truth interpretations. Each radiograph was then interpreted by the AI model for the presence of pneumothorax and tension pneumothorax. This study was conducted between July and October 2021, with the primary analysis performed between October and November 2021. Main Outcomes and Measures: The primary end points were the areas under the receiver operating characteristic curves (AUCs) for the detection of pneumothorax and tension pneumothorax. The secondary end points were the sensitivities and specificities for the detection of pneumothorax and tension pneumothorax. Results: The final analysis included radiographs from 985 patients (mean [SD] age, 60.8 [19.0] years; 436 [44.3%] female patients), including 307 patients with nontension pneumothorax, 128 patients with tension pneumothorax, and 550 patients without pneumothorax. The AI model detected any pneumothorax with an AUC of 0.979 (95% CI, 0.970-0.987), sensitivity of 94.3% (95% CI, 92.0%-96.3%), and specificity of 92.0% (95% CI, 89.6%-94.2%) and tension pneumothorax with an AUC of 0.987 (95% CI, 0.980-0.992), sensitivity of 94.5% (95% CI, 90.6%-97.7%), and specificity of 95.3% (95% CI, 93.9%-96.6%). Conclusions and Relevance: These findings suggest that the assessed AI model accurately detected pneumothorax and tension pneumothorax in this chest radiograph data set. The model's use in the clinical workflow could lead to earlier identification and improved care for patients with pneumothorax.

Subject(s)

Deep Learning , Pneumothorax , Humans , Female , Adolescent , Adult , Middle Aged , Male , Pneumothorax/diagnostic imaging , Radiography, Thoracic , Artificial Intelligence , Retrospective Studies , Radiography

CLiP, catheter and line position dataset.

Tang, Jennifer S N; Seah, Jarrel C Y; Zia, Adil; Gajera, Jay; Schlegel, Richard N; Wong, Aaron J N; Gai, Dayu; Su, Shu; Bose, Tony; Kok, Marcus L; Jarema, Alex; Harisis, George N; Cheng, Chris-Tin; Kavnoudias, Helen; Wang, Wayland; Stein, Anouk; Shih, George; Gaillard, Frank; Dixon, Andrew; Law, Meng.

Sci Data ; 8(1): 285, 2021 10 28.

Article in English | MEDLINE | ID: mdl-34711836

ABSTRACT

Correct catheter position is crucial to ensuring appropriate function of the catheter and avoid complications. This paper describes a dataset consisting of 50,612 image level and 17,999 manually labelled annotations from 30,083 chest radiographs from the publicly available NIH ChestXRay14 dataset with manually annotated and segmented endotracheal tubes (ETT), nasoenteric tubes (NET) and central venous catheters (CVCs).

Subject(s)

Catheterization , Radiography, Thoracic , Thorax/diagnostic imaging , Catheters , Central Venous Catheters , Humans , Intubation, Gastrointestinal , Intubation, Intratracheal

Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multicase study.

Seah, Jarrel C Y; Tang, Cyril H M; Buchlak, Quinlan D; Holt, Xavier G; Wardman, Jeffrey B; Aimoldin, Anuar; Esmaili, Nazanin; Ahmad, Hassan; Pham, Hung; Lambert, John F; Hachey, Ben; Hogg, Stephen J F; Johnston, Benjamin P; Bennett, Christine; Oakden-Rayner, Luke; Brotchie, Peter; Jones, Catherine M.

Lancet Digit Health ; 3(8): e496-e506, 2021 08.

Article in English | MEDLINE | ID: mdl-34219054

ABSTRACT

BACKGROUND: Chest x-rays are widely used in clinical practice; however, interpretation can be hindered by human error and a lack of experienced thoracic radiologists. Deep learning has the potential to improve the accuracy of chest x-ray interpretation. We therefore aimed to assess the accuracy of radiologists with and without the assistance of a deep-learning model. METHODS: In this retrospective study, a deep-learning model was trained on 821 681 images (284 649 patients) from five data sets from Australia, Europe, and the USA. 2568 enriched chest x-ray cases from adult patients (≥16 years) who had at least one frontal chest x-ray were included in the test dataset; cases were representative of inpatient, outpatient, and emergency settings. 20 radiologists reviewed cases with and without the assistance of the deep-learning model with a 3-month washout period. We assessed the change in accuracy of chest x-ray interpretation across 127 clinical findings when the deep-learning model was used as a decision support by calculating area under the receiver operating characteristic curve (AUC) for each radiologist with and without the deep-learning model. We also compared AUCs for the model alone with those of unassisted radiologists. If the lower bound of the adjusted 95% CI of the difference in AUC between the model and the unassisted radiologists was more than -0·05, the model was considered to be non-inferior for that finding. If the lower bound exceeded 0, the model was considered to be superior. FINDINGS: Unassisted radiologists had a macroaveraged AUC of 0·713 (95% CI 0·645-0·785) across the 127 clinical findings, compared with 0·808 (0·763-0·839) when assisted by the model. The deep-learning model statistically significantly improved the classification accuracy of radiologists for 102 (80%) of 127 clinical findings, was statistically non-inferior for 19 (15%) findings, and no findings showed a decrease in accuracy when radiologists used the deep-learning model. Unassisted radiologists had a macroaveraged mean AUC of 0·713 (0·645-0·785) across all findings, compared with 0·957 (0·954-0·959) for the model alone. Model classification alone was significantly more accurate than unassisted radiologists for 117 (94%) of 124 clinical findings predicted by the model and was non-inferior to unassisted radiologists for all other clinical findings. INTERPRETATION: This study shows the potential of a comprehensive deep-learning model to improve chest x-ray interpretation across a large breadth of clinical practice. FUNDING: Annalise.ai.

Subject(s)

Deep Learning , Mass Screening/methods , Models, Biological , Radiographic Image Interpretation, Computer-Assisted , Radiography, Thoracic , X-Rays , Adolescent , Adult , Aged , Aged, 80 and over , Area Under Curve , Artificial Intelligence , Female , Humans , Infections/diagnosis , Infections/diagnostic imaging , Male , Middle Aged , ROC Curve , Radiologists , Retrospective Studies , Thoracic Injuries/diagnosis , Thoracic Injuries/diagnostic imaging , Thoracic Neoplasms/diagnosis , Thoracic Neoplasms/diagnostic imaging , Young Adult

Chest Radiographs in Congestive Heart Failure: Visualizing Neural Network Learning.

Seah, Jarrel C Y; Tang, Jennifer S N; Kitchen, Andy; Gaillard, Frank; Dixon, Andrew F.

Radiology ; 290(2): 514-522, 2019 02.

Article in English | MEDLINE | ID: mdl-30398431

ABSTRACT

Purpose To examine Generative Visual Rationales (GVRs) as a tool for visualizing neural network learning of chest radiograph features in congestive heart failure (CHF). Materials and Methods A total of 103 489 frontal chest radiographs in 46 712 patients acquired from January 1, 2007, to December 31, 2016, were divided into a labeled data set (with B-type natriuretic peptide [BNP] result as a marker of CHF) and unlabeled data set (without BNP result). A generative model was trained on the unlabeled data set, and a neural network was trained on the encoded representations of the labeled data set to estimate BNP. The model was used to visualize how a radiograph with high estimated BNP would look without disease (a "healthy" radiograph). An overfitted model was developed for comparison, and 100 GVRs were blindly assessed by two experts for features of CHF. Area under the receiver operating characteristic curve (AUC), κ coefficient, and mixed-effects logistic regression were used for statistical analyses. Results At a cutoff BNP of 100 ng/L as a marker of CHF, the correctly trained model achieved an AUC of 0.82. Assessment of GVRs revealed that the correctly trained model highlighted conventional radiographic features of CHF as reasons for an elevated BNP prediction more frequently than the overfitted model, including cardiomegaly (153 [76.5%] of 200 vs 64 [32%] of 200, respectively; P < .001) and pleural effusions (47 [23.5%] of 200 vs 16 [8%] of 200, respectively; P = .003). Conclusion Features of congestive heart failure on chest radiographs learned by neural networks can be identified using Generative Visual Rationales, enabling detection of bias and overfitted models. © RSNA, 2018 See also the editorial by Ngo in this issue.

Subject(s)

Heart Failure/diagnostic imaging , Neural Networks, Computer , Radiographic Image Interpretation, Computer-Assisted/methods , Radiography, Thoracic/methods , Adolescent , Adult , Aged , Aged, 80 and over , Child , Child, Preschool , Databases, Factual , Female , Heart Failure/blood , Humans , Infant , Infant, Newborn , Male , Middle Aged , Natriuretic Peptide, Brain/blood , ROC Curve , Thorax/diagnostic imaging , Young Adult

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL