0 2

Cited 0 times in

Cited 0 times in

Success and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases

DC Field Value Language
dc.contributor.authorOng, Kai Tzu-iunn-
dc.contributor.authorSeo, Junwon-
dc.contributor.authorKim, Hyojun-
dc.contributor.authorKim, Jiwoo-
dc.contributor.authorKim, Jihoon-
dc.contributor.authorKim, Sunghwan-
dc.contributor.authorYeo, Jinyoung-
dc.contributor.authorChoi, Eun Young-
dc.contributor.author최은영-
dc.date.accessioned2026-03-25T03:10:36Z-
dc.date.available2026-03-25T03:10:36Z-
dc.date.created2026-03-20-
dc.date.issued2026-05-
dc.identifier.issn1386-5056-
dc.identifier.urihttps://ir.ymlib.yonsei.ac.kr/handle/22282913/211452-
dc.description.abstractBackground: While conversational human-AI collaboration (HAC) using large language models (LLM) has shown potential to enhance clinical reasoning, its effectiveness in highly specialized and challenging clinical scenarios remains unclear. This study aimed to evaluate the effectiveness of HAC and analyzed the causes of its success and failure. Methods: A crossover experimental study was conducted using 30 challenging cases from JAMA Ophthalmology. Thirty participants (10 board-certified ophthalmologist, 10 ophthalmology resident, and 10 senior medical students) completed the cases under two conditions: independent work (human-only) and collaboration through free-text conversation with Claude-3.5-Sonnet (HAC). Performance accuracy, along with self-rated confidence and cognitive burden, were assessed. HAC interaction logs were analyzed to evaluate the appropriateness of the LLM's accepting and arguing behaviors, which were categorized into six patterns. Sliding paired t-tests across incremental thresholds were used to assess how accuracy gains from HAC varied by task difficulty. Results: HAC significantly improved mean accuracy compared to the human-only condition (from 0.45 to 0.60, P < 0.001), although 20% of participants showed a decline in performance and the mean remained below the LLMonly accuracy (0.70). HAC significantly increased confidence and reduced cognitive burden (both P < 0.001) in both successful and failed HAC. The appropriateness of LLM behaviors was substantially higher in successful HAC than in failed HAC (F1 score = 0.92 vs. 0.29, P < 0.001). In successful HAC, 92.6% followed the pattern LLM presents correct insight/human accepts, while 58.6% of failures involved LLM presents incorrect insight/human accepts. HAC improved accuracy significantly in tasks where the human-only correct response rate exceeded 47% (P < 0.05), but not below 30% (P >= 0.188). Conclusions: These findings suggest that HAC benefits complex clinical decisions in ophthalmology but remains limited by human, model, and task-level factors requiring further improvement.-
dc.languageEnglish-
dc.publisherElsevier Science Ireland Ltd.-
dc.relation.isPartOfINTERNATIONAL JOURNAL OF MEDICAL INFORMATICS-
dc.relation.isPartOfINTERNATIONAL JOURNAL OF MEDICAL INFORMATICS-
dc.subject.MESHAdult-
dc.subject.MESHClinical Reasoning*-
dc.subject.MESHCooperative Behavior*-
dc.subject.MESHCross-Over Studies-
dc.subject.MESHFemale-
dc.subject.MESHHumans-
dc.subject.MESHMale-
dc.titleSuccess and failure of human-AI collaboration in clinical reasoning: An experimental study on challenging real-world cases-
dc.typeArticle-
dc.contributor.googleauthorOng, Kai Tzu-iunn-
dc.contributor.googleauthorSeo, Junwon-
dc.contributor.googleauthorKim, Hyojun-
dc.contributor.googleauthorKim, Jiwoo-
dc.contributor.googleauthorKim, Jihoon-
dc.contributor.googleauthorKim, Sunghwan-
dc.contributor.googleauthorYeo, Jinyoung-
dc.contributor.googleauthorChoi, Eun Young-
dc.identifier.doi10.1016/j.ijmedinf.2026.106342-
dc.relation.journalcodeJ01129-
dc.identifier.eissn1872-8243-
dc.identifier.pmid41689881-
dc.subject.keywordhuman-AI collaboration-
dc.subject.keywordClinical reasoning-
dc.subject.keywordOphthalmology-
dc.subject.keywordLarge language model-
dc.subject.keywordConfidence-
dc.subject.keywordCognitive burden-
dc.subject.keywordModel behaviors-
dc.subject.keywordTask difficulty-
dc.contributor.affiliatedAuthorSeo, Junwon-
dc.contributor.affiliatedAuthorKim, Jiwoo-
dc.contributor.affiliatedAuthorChoi, Eun Young-
dc.identifier.scopusid2-s2.0-105029904759-
dc.identifier.wosid001702499200001-
dc.citation.volume211-
dc.identifier.bibliographicCitationINTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, Vol.211, 2026-05-
dc.identifier.rimsid91992-
dc.type.rimsART-
dc.description.journalClass1-
dc.description.journalClass1-
dc.subject.keywordAuthorhuman-AI collaboration-
dc.subject.keywordAuthorClinical reasoning-
dc.subject.keywordAuthorOphthalmology-
dc.subject.keywordAuthorLarge language model-
dc.subject.keywordAuthorConfidence-
dc.subject.keywordAuthorCognitive burden-
dc.subject.keywordAuthorModel behaviors-
dc.subject.keywordAuthorTask difficulty-
dc.subject.keywordPlusOVERCONFIDENCE-
dc.type.docTypeArticle-
dc.description.isOpenAccessY-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalWebOfScienceCategoryComputer Science, Information Systems-
dc.relation.journalWebOfScienceCategoryHealth Care Sciences & Services-
dc.relation.journalWebOfScienceCategoryMedical Informatics-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalResearchAreaHealth Care Sciences & Services-
dc.relation.journalResearchAreaMedical Informatics-
dc.identifier.articleno106342-
Appears in Collections:
1. College of Medicine (의과대학) > Dept. of Ophthalmology (안과학교실) > 1. Journal Papers

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.