Evaluating diagnostic accuracy of large language models in neuroradiology cases using image inputs from JAMA neurology and JAMA clinical challenges

Albaqshi, Ahmed; Ko, Ji Su; Suh, Chong Hyun; Suh, Pae Sun; Shim, Woo Hyun; Heo, Hwon; Woo, Chang-Yun; Park, Hyungjun

doi:10.1038/s41598-025-06458-z

YUHSpace

BROWSE

4 10

Cited 0 times in

Evaluating diagnostic accuracy of large language models in neuroradiology cases using image inputs from JAMA neurology and JAMA clinical challenges

DC Field	Value	Language
dc.contributor.author	Albaqshi, Ahmed	-
dc.contributor.author	Ko, Ji Su	-
dc.contributor.author	Suh, Chong Hyun	-
dc.contributor.author	Suh, Pae Sun	-
dc.contributor.author	Shim, Woo Hyun	-
dc.contributor.author	Heo, Hwon	-
dc.contributor.author	Woo, Chang-Yun	-
dc.contributor.author	Park, Hyungjun	-
dc.date.accessioned	2026-01-20T02:39:39Z	-
dc.date.available	2026-01-20T02:39:39Z	-
dc.date.created	2026-01-14	-
dc.date.issued	2025-11	-
dc.identifier.uri	https://ir.ymlib.yonsei.ac.kr/handle/22282913/210001	-
dc.description.abstract	This study assesses the diagnostic performance of six LLMs -GPT-4v, GPT-4o, Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.0, and Claude 3.5-on complex neurology cases from JAMA Neurology and JAMA, focusing on their image interpretation abilities. We selected 56 radiology cases from JAMA Neurology and JAMA (from May 2015 to April 2024), rephrasing the text and reshuffling multiple-choice answer. Each LLM processed four input types: original quiz with images, rephrased text with images, rephrased text only, and images only. Model performance was compared with three neuroradiologists, and consistency was assessed across five repetitions using Fleiss' kappa. In the image-only condition, LLMs answered six specific questions regarding modality, sequence, contrast, plane, anatomical, and pathologic locations, and their accuracy was evaluated. Claude 3.5 achieved the highest accuracy (80.4%) on original image and text inputs. The accuracy using the rephrased quiz text with image ranged from 62.5% (35/56) to 76.8% (43/56). The accuracy using the rephrased quiz text only ranged from 51.8% (29/56) to 76.8% (43/56). LLMs performed on par with first-year fellows (71.4% [40/56]) but surpassed junior faculty (51.8% [29/56]) and second-year fellows (48.2% [27/56]). All LLMs showed almost similar results across the five repetitions (0.860-1.000). In image-only tasks, LLM accuracy in identifying pathologic locations ranged from 21.5% (28/130) to 63.1% (82/130). LLMs exhibit strong diagnostic performance with clinical text, yet their ability to interpret complex radiologic images independently is limited. Further refinement in image analysis is essential for these models to integrate fully into radiologic workflows.	-
dc.language	English	-
dc.publisher	Nature Publishing Group	-
dc.relation.isPartOf	SCIENTIFIC REPORTS	-
dc.relation.isPartOf	SCIENTIFIC REPORTS	-
dc.subject.MESH	Humans	-
dc.subject.MESH	Jamaica	-
dc.subject.MESH	Language*	-
dc.subject.MESH	Large Language Models	-
dc.subject.MESH	Neuroimaging* / methods	-
dc.subject.MESH	Neurology*	-
dc.title	Evaluating diagnostic accuracy of large language models in neuroradiology cases using image inputs from JAMA neurology and JAMA clinical challenges	-
dc.type	Article	-
dc.contributor.googleauthor	Albaqshi, Ahmed	-
dc.contributor.googleauthor	Ko, Ji Su	-
dc.contributor.googleauthor	Suh, Chong Hyun	-
dc.contributor.googleauthor	Suh, Pae Sun	-
dc.contributor.googleauthor	Shim, Woo Hyun	-
dc.contributor.googleauthor	Heo, Hwon	-
dc.contributor.googleauthor	Woo, Chang-Yun	-
dc.contributor.googleauthor	Park, Hyungjun	-
dc.identifier.doi	10.1038/s41598-025-06458-z	-
dc.relation.journalcode	J02646	-
dc.identifier.eissn	2045-2322	-
dc.identifier.pmid	41309648	-
dc.subject.keyword	Artificial intelligence	-
dc.subject.keyword	Deep learning	-
dc.subject.keyword	Image interpretation	-
dc.subject.keyword	Computer-assisted	-
dc.subject.keyword	Neuroimaging	-
dc.contributor.affiliatedAuthor	Suh, Pae Sun	-
dc.identifier.scopusid	2-s2.0-105023653878	-
dc.identifier.wosid	001630276300001	-
dc.citation.volume	15	-
dc.citation.number	1	-
dc.identifier.bibliographicCitation	SCIENTIFIC REPORTS, Vol.15(1), 2025-11	-
dc.identifier.rimsid	90963	-
dc.type.rims	ART	-
dc.description.journalClass	1	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	Artificial intelligence	-
dc.subject.keywordAuthor	Deep learning	-
dc.subject.keywordAuthor	Image interpretation	-
dc.subject.keywordAuthor	Computer-assisted	-
dc.subject.keywordAuthor	Neuroimaging	-
dc.type.docType	Article	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalWebOfScienceCategory	Multidisciplinary Sciences	-
dc.relation.journalResearchArea	Science & Technology - Other Topics	-
dc.identifier.articleno	43027	-

Appears in Collections:: 1. College of Medicine (의과대학) > Dept. of Radiology (영상의학교실) > 1. Journal Papers

Show simple item record Find it @ YMLIB

License

YUHSpace: Evaluating diagnostic accuracy of large language models in neuroradiology cases using image inputs from JAMA neurology and JAMA clinical challenges

YUHSpace

BROWSE

Browse

Links