0 2

Cited 0 times in

Cited 0 times in

Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

Authors
 Ong, Kai Tzu-iunn  ;  Seo, Junwon  ;  Kim, Hyojun  ;  Kim, Jiwoo  ;  Kim, Jihoon  ;  Kim, Sunghwan  ;  Yeo, Jinyoung  ;  Choi, Eun Young 
Citation
 INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, Vol.211, 2026-05 
Article Number
 106342 
Journal Title
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS
ISSN
 1386-5056 
Issue Date
2026-05
MeSH
Adult ; Clinical Reasoning* ; Cooperative Behavior* ; Cross-Over Studies ; Female ; Humans ; Male
Keywords
human-AI collaboration ; Clinical reasoning ; Ophthalmology ; Large language model ; Confidence ; Cognitive burden ; Model behaviors ; Task difficulty
Abstract
Background: While conversational human-AI collaboration (HAC) using large language models (LLM) has shown potential to enhance clinical reasoning, its effectiveness in highly specialized and challenging clinical scenarios remains unclear. This study aimed to evaluate the effectiveness of HAC and analyzed the causes of its success and failure. Methods: A crossover experimental study was conducted using 30 challenging cases from JAMA Ophthalmology. Thirty participants (10 board-certified ophthalmologist, 10 ophthalmology resident, and 10 senior medical students) completed the cases under two conditions: independent work (human-only) and collaboration through free-text conversation with Claude-3.5-Sonnet (HAC). Performance accuracy, along with self-rated confidence and cognitive burden, were assessed. HAC interaction logs were analyzed to evaluate the appropriateness of the LLM's accepting and arguing behaviors, which were categorized into six patterns. Sliding paired t-tests across incremental thresholds were used to assess how accuracy gains from HAC varied by task difficulty. Results: HAC significantly improved mean accuracy compared to the human-only condition (from 0.45 to 0.60, P < 0.001), although 20% of participants showed a decline in performance and the mean remained below the LLMonly accuracy (0.70). HAC significantly increased confidence and reduced cognitive burden (both P < 0.001) in both successful and failed HAC. The appropriateness of LLM behaviors was substantially higher in successful HAC than in failed HAC (F1 score = 0.92 vs. 0.29, P < 0.001). In successful HAC, 92.6% followed the pattern LLM presents correct insight/human accepts, while 58.6% of failures involved LLM presents incorrect insight/human accepts. HAC improved accuracy significantly in tasks where the human-only correct response rate exceeded 47% (P < 0.05), but not below 30% (P >= 0.188). Conclusions: These findings suggest that HAC benefits complex clinical decisions in ophthalmology but remains limited by human, model, and task-level factors requiring further improvement.
DOI
10.1016/j.ijmedinf.2026.106342
Appears in Collections:
1. College of Medicine (의과대학) > Dept. of Ophthalmology (안과학교실) > 1. Journal Papers
Yonsei Authors
Choi, Eun Young(최은영) ORCID logo https://orcid.org/0000-0002-1668-6452
URI
https://ir.ymlib.yonsei.ac.kr/handle/22282913/211452
사서에게 알리기
  feedback

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse

Links