0 27

Cited 0 times in

Large-Scale Validation of the Feasibility of GPT-4 as a Proofreading Tool for Head CT Reports

Authors
 Songsoo Kim  ;  Donghyun Kim  ;  Hyun Joo Shin  ;  Seung Hyun Lee  ;  Yeseul Kang  ;  Sejin Jeong  ;  Jaewoong Kim  ;  Miran Han  ;  Seong-Joon Lee  ;  Joonho Kim  ;  Jungyon Yum  ;  Changho Han  ;  Dukyong Yoon 
Citation
 RADIOLOGY, Vol.314(1) : e240701, 2025-01 
Journal Title
RADIOLOGY
ISSN
 0033-8419 
Issue Date
2025-01
MeSH
Diagnostic Errors* / prevention & control ; Diagnostic Errors* / statistics & numerical data ; Feasibility Studies ; Head* / diagnostic imaging ; Humans ; Retrospective Studies ; Tomography, X-Ray Computed* / methods
Abstract
Background The increasing workload of radiologists can lead to burnout and errors in radiology reports. Large language models, such as OpenAI's GPT-4, hold promise as error revision tools for radiology. Purpose To test the feasibility of GPT-4 use by determining its error detection, reasoning, and revision performance on head CT reports with varying error types and to validate its clinical utility by comparison with human readers. Materials and Methods A total of 10 300 head CT reports were retrospectively extracted from the Medical Information Mart for Intensive Care III public dataset. In experiment 1, among the 300 unaltered reports and 300 versions with applied errors, GPT-4 optimization was initially conducted with 200 reports. The remaining 400 were used for evaluation of error type detection, reasoning, and revision, as well as the analysis of reports with undetected errors. The performance was also compared with that of human readers. In experiment 2, the detection performance of GPT-4 was validated on 10 000 unaltered reports that were deemed error-free by physicians, and an analysis of false-positive results was conducted. A permutation test was conducted to assess differences in performance. Results GPT-4 demonstrated commendable performance in error detection (sensitivity, 84% for interpretive error and 89% for factual error), reasoning, and revision. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33-0.69 vs 0.89; P = .008 for radiologist 4, P < .001 for others) and took longer to review (82-121 seconds vs 16 seconds, P < .001). In 10 000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial. Conclusion GPT-4 effectively detects, reasons, and revises errors in radiology reports. While it shows excellent performance in identifying factual errors, its ability to prioritize clinically significant findings is limited. Recognizing its strengths and limitations, GPT-4 could serve as a feasible tool.
Full Text
https://pubs.rsna.org/doi/10.1148/radiol.240701
DOI
10.1148/radiol.240701
Appears in Collections:
1. College of Medicine (의과대학) > Dept. of Radiology (영상의학교실) > 1. Journal Papers
1. College of Medicine (의과대학) > Dept. of Biomedical Systems Informatics (의생명시스템정보학교실) > 1. Journal Papers
Yonsei Authors
Shin, Hyun Joo(신현주) ORCID logo https://orcid.org/0000-0002-7462-2609
Yoon, Dukyong(윤덕용)
URI
https://ir.ymlib.yonsei.ac.kr/handle/22282913/207043
사서에게 알리기
  feedback

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse

Links