Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study

Min, Dabin; Jin, Kwang Nam; Bang, SangHeum; Kim, Moon Young; Kim, Hack-Lyoung; Jeong, Won Gi; Lee, Hye-Jeong; Beck, Kyongmin Sarah; Hwang, Sung Ho; Kim, Eun Young; Park, Chang Min

doi:10.3348/kjr.2025.0293

YUHSpace

BROWSE

12 126

Cited 5 times in

Cited 0 times in

Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study

Authors: Min, Dabin ; Jin, Kwang Nam ; Bang, SangHeum ; Kim, Moon Young ; Kim, Hack-Lyoung ; Jeong, Won Gi ; Lee, Hye-Jeong ; Beck, Kyongmin Sarah ; Hwang, Sung Ho ; Kim, Eun Young ; Park, Chang Min

Citation: KOREAN JOURNAL OF RADIOLOGY, Vol.26(9) : 817-831, 2025-09

Journal Title: KOREAN JOURNAL OF RADIOLOGY

ISSN: 1229-6929

Issue Date: 2025-09

Keywords: Coronary CT angiography ; CAD-RADS 2.0 ; Information extraction ; Large language model ; Prompting strategy

Abstract: Objective: To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies. Materials and Methods: In this multi-institutional study, we collected 319 synthetic, semi-structured CCTA reports from six institutions to protect patient privacy while maintaining clinical relevance. The dataset included 150 reports from a primary institution (100 for instruction development and 50 for internal testing) and 169 reports from five external institutions for external testing. Board-certified radiologists established reference standards following the CAD-RADS 2.0 guidelines for all three components: stenosis severity, plaque burden, and modifiers. Six LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet, o1-mini, Gemini-1.5-Pro, and DeepSeek-R1-Distill-Qwen-14B) were evaluated using an optimized instruction with prompting strategies, including zero-shot or few-shot with or without chain-of-thought (CoT) prompting. The accuracy was assessed and compared using McNemar's test. Results: LLMs demonstrated robust accuracy across all CAD-RADS 2.0 components. Peak stenosis severity accuracies reached 0.980 (48/49, Claude-3.5-Sonnet and o1-mini) in internal testing and 0.946 (158/167, GPT-4o and o1-mini) in external testing. Plaque burden extraction showed exceptional accuracy, with multiple models achieving perfect accuracy (43/43) in internal testing and 0.993 (137/138, GPT-4o, and o1-mini) in external testing. Modifier detection demonstrated consistently high accuracy (>= 0.990) across most models. One open-source model, DeepSeek-R1-Distill-Qwen-14B, showed a relatively low accuracy for stenosis severity: 0.898 (44/49, internal) and 0.820 (137/167, external). CoT prompting significantly enhanced the accuracy of several models, with GPT-4 showing the most substantial improvements: stenosis severity accuracy increased by 0.192 (P < 0.001) and plaque burden accuracy by 0.152 (P < 0.001) in external testing. Conclusion: LLMs demonstrated high accuracy in automated extraction of CAD-RADS 2.0 components from semi-structured CCTA reports, particularly when used with CoT prompting.