Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

Ong, Kai Tzu-iunn; Seo, Junwon; Kim, Hyojun; Kim, Jiwoo; Kim, Jihoon; Kim, Sunghwan; Yeo, Jinyoung; Choi, Eun Young

doi:10.1016/j.ijmedinf.2026.106342

YUHSpace

BROWSE

0 46

Cited 1 times in

Cited 0 times in

Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

Authors: Ong, Kai Tzu-iunn ; Seo, Junwon ; Kim, Hyojun ; Kim, Jiwoo ; Kim, Jihoon ; Kim, Sunghwan ; Yeo, Jinyoung ; Choi, Eun Young

Citation: INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, Vol.211, 2026-05

Article Number: 106342

Journal Title: INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS

ISSN: 1386-5056

Issue Date: 2026-05

MeSH: Adult ; Clinical Reasoning* ; Cooperative Behavior* ; Cross-Over Studies ; Female ; Humans ; Male

Keywords: human-AI collaboration ; Clinical reasoning ; Ophthalmology ; Large language model ; Confidence ; Cognitive burden ; Model behaviors ; Task difficulty

Abstract: Background: While conversational human-AI collaboration (HAC) using large language models (LLM) has shown potential to enhance clinical reasoning, its effectiveness in highly specialized and challenging clinical scenarios remains unclear. This study aimed to evaluate the effectiveness of HAC and analyzed the causes of its success and failure. Methods: A crossover experimental study was conducted using 30 challenging cases from JAMA Ophthalmology. Thirty participants (10 board-certified ophthalmologist, 10 ophthalmology resident, and 10 senior medical students) completed the cases under two conditions: independent work (human-only) and collaboration through free-text conversation with Claude-3.5-Sonnet (HAC). Performance accuracy, along with self-rated confidence and cognitive burden, were assessed. HAC interaction logs were analyzed to evaluate the appropriateness of the LLM's accepting and arguing behaviors, which were categorized into six patterns. Sliding paired t-tests across incremental thresholds were used to assess how accuracy gains from HAC varied by task difficulty. Results: HAC significantly improved mean accuracy compared to the human-only condition (from 0.45 to 0.60, P < 0.001), although 20% of participants showed a decline in performance and the mean remained below the LLMonly accuracy (0.70). HAC significantly increased confidence and reduced cognitive burden (both P < 0.001) in both successful and failed HAC. The appropriateness of LLM behaviors was substantially higher in successful HAC than in failed HAC (F1 score = 0.92 vs. 0.29, P < 0.001). In successful HAC, 92.6% followed the pattern LLM presents correct insight/human accepts, while 58.6% of failures involved LLM presents incorrect insight/human accepts. HAC improved accuracy significantly in tasks where the human-only correct response rate exceeded 47% (P < 0.05), but not below 30% (P >= 0.188). Conclusions: These findings suggest that HAC benefits complex clinical decisions in ophthalmology but remains limited by human, model, and task-level factors requiring further improvement.

DOI: 10.1016/j.ijmedinf.2026.106342

Appears in Collections:: 1. College of Medicine (의과대학) > Dept. of Ophthalmology (안과학교실) > 1. Journal Papers

Yonsei Authors: Choi, Eun Young(최은영) https://orcid.org/0000-0002-1668-6452

URI: https://ir.ymlib.yonsei.ac.kr/handle/22282913/211452

사서에게 알리기

Show full item record Find it @ YMLIB

License

YUHSpace: Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

YUHSpace

BROWSE

Browse

Links