Comparative analysis of multimodal large language models GPT-4o and o1 versus clinicians in clinical case challenge questions: Retrospective cross-sectional study

Jung, Jaewon; Kim, Hyunjae; Bae, SungA; Park, Jin Young

doi:10.1097/MD.0000000000047071

YUHSpace

BROWSE

0 6

Cited 0 times in

Comparative analysis of multimodal large language models GPT-4o and o1 versus clinicians in clinical case challenge questions: Retrospective cross-sectional study

DC Field	Value	Language
dc.contributor.author	Jung, Jaewon	-
dc.contributor.author	Kim, Hyunjae	-
dc.contributor.author	Bae, SungA	-
dc.contributor.author	Park, Jin Young	-
dc.date.accessioned	2026-03-11T00:17:26Z	-
dc.date.available	2026-03-11T00:17:26Z	-
dc.date.created	2026-03-09	-
dc.date.issued	2026-01	-
dc.identifier.issn	0025-7974	-
dc.identifier.uri	https://ir.ymlib.yonsei.ac.kr/handle/22282913/211092	-
dc.description.abstract	Generative pretrained transformer 4 (GPT-4) has demonstrated strong performance in standardized medical examinations but has limitations in real-world clinical settings. To address these limitations, the multimodal GPT-4o model integrates text and image inputs, and the multimodal o1 model incorporates advanced reasoning. This study compared the performance of GPT-4o and o1 against that of Medscape respondents (majority vote) in real-world clinical case challenges. This retrospective, cross-sectional study used 1426 Medscape case challenge questions from May 2011 to June 2024. Each case included text and images of patient history, physical examinations, diagnostic tests, and imaging studies. Medscape respondents were required to choose 1 answer from among multiple options, with the most frequent response defined as the Medscape respondent's decision. GPT models (3.5 Turbo, 4 Turbo, 4 Omni, and o1) were used to interpret the text and images and generate formatted responses. We compared the performances of the Medscape respondents and GPT models using mixed-effects logistic regression analysis. Medscape respondents (majority vote) achieved an overall accuracy of 85.0%, whereas GPT-4o and o1 demonstrated higher accuracies of 88.4% (P = .005) and 94.3% (P < .001), respectively. In multimodal analysis involving images (n = 917), GPT-4o achieved an accuracy of 88.3% (P = .005), while o1 achieved 93.9% (P < .001), both significantly outperforming Medscape respondents. o1 demonstrated the highest accuracy across all question categories, achieving 92.6% (P < .001) in diagnosis, 97.0% (P < .001) in disease characteristics, 92.6% (P = .002) in examination, and 94.8% (P = .005) in treatment. In terms of medical specialty, o1 achieved 93.6% (P < .001) accuracy in internal medicine, 96.6% (P = .030) in major surgery, 97.3% (P = .030) in psychiatry, and 95.4% (P < .001) in minor specialties, significantly surpassing Medscape respondents. Across 5 trials, GPT-4o and o1 provided the correct answer 5/5 times in 86.2% and 90.7% of the cases, respectively. The GPT-4o and o1 models achieved higher accuracy than Medscape respondents (majority vote) in clinical case evaluations, particularly in disease diagnosis. GPT-4o and o1 could serve as valuable tools to assist healthcare professionals in structured scenarios.	-
dc.language	English	-
dc.publisher	Lippincott Williams & Wilkins	-
dc.relation.isPartOf	MEDICINE	-
dc.relation.isPartOf	MEDICINE	-
dc.subject.MESH	Cross-Sectional Studies	-
dc.subject.MESH	Diagnosis*	-
dc.subject.MESH	Humans	-
dc.subject.MESH	Large Language Models*	-
dc.subject.MESH	Retrospective Studies	-
dc.title	Comparative analysis of multimodal large language models GPT-4o and o1 versus clinicians in clinical case challenge questions: Retrospective cross-sectional study	-
dc.type	Article	-
dc.contributor.googleauthor	Jung, Jaewon	-
dc.contributor.googleauthor	Kim, Hyunjae	-
dc.contributor.googleauthor	Bae, SungA	-
dc.contributor.googleauthor	Park, Jin Young	-
dc.identifier.doi	10.1097/MD.0000000000047071	-
dc.relation.journalcode	J02214	-
dc.identifier.eissn	1536-5964	-
dc.identifier.pmid	41578521	-
dc.subject.keyword	artificial intelligence	-
dc.subject.keyword	clinical decision-making	-
dc.subject.keyword	diagnostic accuracy	-
dc.subject.keyword	GPT-4 omni	-
dc.subject.keyword	multimodal large language model	-
dc.subject.keyword	o1	-
dc.contributor.affiliatedAuthor	Jung, Jaewon	-
dc.contributor.affiliatedAuthor	Kim, Hyunjae	-
dc.contributor.affiliatedAuthor	Bae, SungA	-
dc.contributor.affiliatedAuthor	Park, Jin Young	-
dc.identifier.scopusid	2-s2.0-105028457347	-
dc.identifier.wosid	001679956900033	-
dc.citation.volume	105	-
dc.citation.number	4	-
dc.identifier.bibliographicCitation	MEDICINE, Vol.105(4), 2026-01	-
dc.identifier.rimsid	91838	-
dc.type.rims	ART	-
dc.description.journalClass	1	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	artificial intelligence	-
dc.subject.keywordAuthor	clinical decision-making	-
dc.subject.keywordAuthor	diagnostic accuracy	-
dc.subject.keywordAuthor	GPT-4 omni	-
dc.subject.keywordAuthor	multimodal large language model	-
dc.subject.keywordAuthor	o1	-
dc.type.docType	Article	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalWebOfScienceCategory	Medicine, General & Internal	-
dc.relation.journalResearchArea	General & Internal Medicine	-
dc.identifier.articleno	e47071	-

Appears in Collections:: 1. College of Medicine (의과대학) > Dept. of Internal Medicine (내과학교실) > 1. Journal Papers
1. College of Medicine (의과대학) > Dept. of Psychiatry (정신과학교실) > 1. Journal Papers

Show simple item record Find it @ YMLIB

License

YUHSpace: Comparative analysis of multimodal large language models GPT-4o and o1 versus clinicians in clinical case challenge questions: Retrospective cross-sectional study

YUHSpace

BROWSE

Browse

Links