Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics

Suh, Pae Sun; Ko, Ji Su; Shim, Woo Hyun; Heo, Hwon; Woo, Chang-Yun; Park, Hyungjun; Suh, Chong Hyun

doi:10.3348/kjr.2025.1045

YUHSpace

BROWSE

54 58

Cited 3 times in

Cited 2 times in

Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics

DC Field	Value	Language
dc.contributor.author	Suh, Pae Sun	-
dc.contributor.author	Ko, Ji Su	-
dc.contributor.author	Shim, Woo Hyun	-
dc.contributor.author	Heo, Hwon	-
dc.contributor.author	Woo, Chang-Yun	-
dc.contributor.author	Park, Hyungjun	-
dc.contributor.author	Suh, Chong Hyun	-
dc.date.accessioned	2026-03-26T02:47:48Z	-
dc.date.available	2026-03-26T02:47:48Z	-
dc.date.created	2026-03-20	-
dc.date.issued	2026-03	-
dc.identifier.issn	1229-6929	-
dc.identifier.uri	https://ir.ymlib.yonsei.ac.kr/handle/22282913/211526	-
dc.description.abstract	Objective: To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation. Materials and Methods: This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses. Results: Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]). Conclusion: LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.	-
dc.language	English	-
dc.publisher	Korean Society of Radiology	-
dc.relation.isPartOf	KOREAN JOURNAL OF RADIOLOGY	-
dc.relation.isPartOf	KOREAN JOURNAL OF RADIOLOGY	-
dc.subject.MESH	Clinical Competence	-
dc.subject.MESH	Clinical Reasoning*	-
dc.subject.MESH	Diagnosis, Differential	-
dc.subject.MESH	Humans	-
dc.subject.MESH	Language*	-
dc.subject.MESH	Large Language Models	-
dc.title	Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics	-
dc.type	Article	-
dc.contributor.googleauthor	Suh, Pae Sun	-
dc.contributor.googleauthor	Ko, Ji Su	-
dc.contributor.googleauthor	Shim, Woo Hyun	-
dc.contributor.googleauthor	Heo, Hwon	-
dc.contributor.googleauthor	Woo, Chang-Yun	-
dc.contributor.googleauthor	Park, Hyungjun	-
dc.contributor.googleauthor	Suh, Chong Hyun	-
dc.identifier.doi	10.3348/kjr.2025.1045	-
dc.relation.journalcode	J02884	-
dc.identifier.eissn	2005-8330	-
dc.identifier.pmid	41735814	-
dc.subject.keyword	Large language model	-
dc.subject.keyword	Vision capability	-
dc.subject.keyword	Image interpretation	-
dc.subject.keyword	Rationale evaluation	-
dc.contributor.affiliatedAuthor	Suh, Pae Sun	-
dc.identifier.scopusid	2-s2.0-105031191032	-
dc.identifier.wosid	001699173100005	-
dc.citation.volume	27	-
dc.citation.number	3	-
dc.citation.startPage	214	-
dc.citation.endPage	226	-
dc.identifier.bibliographicCitation	KOREAN JOURNAL OF RADIOLOGY, Vol.27(3) : 214-226, 2026-03	-
dc.identifier.rimsid	92046	-
dc.type.rims	ART	-
dc.description.journalClass	1	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	Large language model	-
dc.subject.keywordAuthor	Vision capability	-
dc.subject.keywordAuthor	Image interpretation	-
dc.subject.keywordAuthor	Rationale evaluation	-
dc.type.docType	Article	-
dc.identifier.kciid	ART003304465	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.description.journalRegisteredClass	kci	-
dc.relation.journalWebOfScienceCategory	Radiology, Nuclear Medicine & Medical Imaging	-
dc.relation.journalResearchArea	Radiology, Nuclear Medicine & Medical Imaging	-

Appears in Collections:: 1. College of Medicine (의과대학) > Dept. of Radiology (영상의학교실) > 1. Journal Papers

Show simple item record Find it @ YMLIB

License

YUHSpace: Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics

YUHSpace

BROWSE

Browse

Links