Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

Ong, Kai Tzu-iunn; Seo, Junwon; Kim, Hyojun; Kim, Jiwoo; Kim, Jihoon; Kim, Sunghwan; Yeo, Jinyoung; Choi, Eun Young

doi:10.1016/j.ijmedinf.2026.106342

YUHSpace

BROWSE

0 46

Cited 1 times in

Cited 0 times in

Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

DC Field	Value	Language
dc.contributor.author	Ong, Kai Tzu-iunn	-
dc.contributor.author	Seo, Junwon	-
dc.contributor.author	Kim, Hyojun	-
dc.contributor.author	Kim, Jiwoo	-
dc.contributor.author	Kim, Jihoon	-
dc.contributor.author	Kim, Sunghwan	-
dc.contributor.author	Yeo, Jinyoung	-
dc.contributor.author	Choi, Eun Young	-
dc.contributor.author	최은영	-
dc.date.accessioned	2026-03-25T03:10:36Z	-
dc.date.available	2026-03-25T03:10:36Z	-
dc.date.created	2026-03-20	-
dc.date.issued	2026-05	-
dc.identifier.issn	1386-5056	-
dc.identifier.uri	https://ir.ymlib.yonsei.ac.kr/handle/22282913/211452	-
dc.description.abstract	Background: While conversational human-AI collaboration (HAC) using large language models (LLM) has shown potential to enhance clinical reasoning, its effectiveness in highly specialized and challenging clinical scenarios remains unclear. This study aimed to evaluate the effectiveness of HAC and analyzed the causes of its success and failure. Methods: A crossover experimental study was conducted using 30 challenging cases from JAMA Ophthalmology. Thirty participants (10 board-certified ophthalmologist, 10 ophthalmology resident, and 10 senior medical students) completed the cases under two conditions: independent work (human-only) and collaboration through free-text conversation with Claude-3.5-Sonnet (HAC). Performance accuracy, along with self-rated confidence and cognitive burden, were assessed. HAC interaction logs were analyzed to evaluate the appropriateness of the LLM's accepting and arguing behaviors, which were categorized into six patterns. Sliding paired t-tests across incremental thresholds were used to assess how accuracy gains from HAC varied by task difficulty. Results: HAC significantly improved mean accuracy compared to the human-only condition (from 0.45 to 0.60, P < 0.001), although 20% of participants showed a decline in performance and the mean remained below the LLMonly accuracy (0.70). HAC significantly increased confidence and reduced cognitive burden (both P < 0.001) in both successful and failed HAC. The appropriateness of LLM behaviors was substantially higher in successful HAC than in failed HAC (F1 score = 0.92 vs. 0.29, P < 0.001). In successful HAC, 92.6% followed the pattern LLM presents correct insight/human accepts, while 58.6% of failures involved LLM presents incorrect insight/human accepts. HAC improved accuracy significantly in tasks where the human-only correct response rate exceeded 47% (P < 0.05), but not below 30% (P >= 0.188). Conclusions: These findings suggest that HAC benefits complex clinical decisions in ophthalmology but remains limited by human, model, and task-level factors requiring further improvement.	-
dc.language	English	-
dc.publisher	Elsevier Science Ireland Ltd.	-
dc.relation.isPartOf	INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS	-
dc.relation.isPartOf	INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS	-
dc.subject.MESH	Adult	-
dc.subject.MESH	Clinical Reasoning*	-
dc.subject.MESH	Cooperative Behavior*	-
dc.subject.MESH	Cross-Over Studies	-
dc.subject.MESH	Female	-
dc.subject.MESH	Humans	-
dc.subject.MESH	Male	-
dc.title	Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases	-
dc.type	Article	-
dc.contributor.googleauthor	Ong, Kai Tzu-iunn	-
dc.contributor.googleauthor	Seo, Junwon	-
dc.contributor.googleauthor	Kim, Hyojun	-
dc.contributor.googleauthor	Kim, Jiwoo	-
dc.contributor.googleauthor	Kim, Jihoon	-
dc.contributor.googleauthor	Kim, Sunghwan	-
dc.contributor.googleauthor	Yeo, Jinyoung	-
dc.contributor.googleauthor	Choi, Eun Young	-
dc.identifier.doi	10.1016/j.ijmedinf.2026.106342	-
dc.relation.journalcode	J01129	-
dc.identifier.eissn	1872-8243	-
dc.identifier.pmid	41689881	-
dc.subject.keyword	human-AI collaboration	-
dc.subject.keyword	Clinical reasoning	-
dc.subject.keyword	Ophthalmology	-
dc.subject.keyword	Large language model	-
dc.subject.keyword	Confidence	-
dc.subject.keyword	Cognitive burden	-
dc.subject.keyword	Model behaviors	-
dc.subject.keyword	Task difficulty	-
dc.contributor.affiliatedAuthor	Seo, Junwon	-
dc.contributor.affiliatedAuthor	Kim, Jiwoo	-
dc.contributor.affiliatedAuthor	Choi, Eun Young	-
dc.identifier.scopusid	2-s2.0-105029904759	-
dc.identifier.wosid	001702499200001	-
dc.citation.volume	211	-
dc.identifier.bibliographicCitation	INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, Vol.211, 2026-05	-
dc.identifier.rimsid	91992	-
dc.type.rims	ART	-
dc.description.journalClass	1	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	human-AI collaboration	-
dc.subject.keywordAuthor	Clinical reasoning	-
dc.subject.keywordAuthor	Ophthalmology	-
dc.subject.keywordAuthor	Large language model	-
dc.subject.keywordAuthor	Confidence	-
dc.subject.keywordAuthor	Cognitive burden	-
dc.subject.keywordAuthor	Model behaviors	-
dc.subject.keywordAuthor	Task difficulty	-
dc.subject.keywordPlus	OVERCONFIDENCE	-
dc.type.docType	Article	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalWebOfScienceCategory	Computer Science, Information Systems	-
dc.relation.journalWebOfScienceCategory	Health Care Sciences & Services	-
dc.relation.journalWebOfScienceCategory	Medical Informatics	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalResearchArea	Health Care Sciences & Services	-
dc.relation.journalResearchArea	Medical Informatics	-
dc.identifier.articleno	106342	-

Appears in Collections:: 1. College of Medicine (의과대학) > Dept. of Ophthalmology (안과학교실) > 1. Journal Papers

Show simple item record Find it @ YMLIB

License

YUHSpace: Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

YUHSpace

BROWSE

Browse

Links