1 1

Cited 0 times in

Cited 0 times in

Comparative diagnostic agreement of a supervised machine learning model and a general-purpose, zero-shot, non-domain-adapted large language model for classifying headache disorders using structured questionnaires

Authors
 Katsuki, Masahito  ;  Moran, Kieran  ;  O'Connor, Siobhan  ;  Ward, Tomas  ;  Fuse, Yutaro  ;  Gargari, Omid Kohandel  ;  Romozzi, Marina  ;  Gonzalez-Martinez, Alicia  ;  Huerta, Miguel A.  ;  Ha, Woo-Seok  ;  Cheung, Jackson T. S.  ;  Matsumori, Yasuhiko 
Citation
 CEPHALALGIA, Vol.46(4), 2026-04 
Article Number
 03331024261441574 
Journal Title
CEPHALALGIA
ISSN
 0333-1024 
Issue Date
2026-04
Keywords
artificial intelligence (AI) ; ChatGPT ; headache diagnosis ; large language models (LLMs) ; machine learning (ML)
Abstract
Background Accurate diagnosis of headache disorders is essential in clinical practice. Supervised machine learning models trained on structured clinical data have shown good performance, whereas the diagnostic ability of large language models (LLMs) for headache disorders has not been evaluated. This study compared a validated machine learning classifier with a general-purpose, zero-shot, non-domain-adapted LLM using the same structured patient questionnaire data, focusing on their agreement with specialist-confirmed diagnoses as the ground truth. This study was designed to reflect current real-world use scenarios, in which clinicians may apply off-the-shelf LLMs for diagnostic purposes without few-shot prompting, domain-specific fine-tuning, or adaptation, rather than to assess the theoretical upper limits of LLM capabilities.Methods We analyzed 1818 patients from an independent hold-out test cohort who completed a 22-item structured headache questionnaire and received specialist-confirmed diagnoses. A previously developed machine learning model and a general-purpose, non-domain-adapted LLM (GPT-4.1 with zero-shot prompting) each generated five-class International Classification of Headache Disorders, 3rd edition (ICHD-3)-based predictions: migraine and/or medication-overuse headache (MOH), tension-type headache (TTH), trigeminal autonomic cephalalgias (TACs), other primary headache disorders, and secondary headaches. Agreement with the specialist's diagnosis and diagnostic performance metrics were calculated. Class-wise sensitivity and specificity were compared using McNemar's test.Results The machine learning classifier showed significantly higher diagnostic agreement with the specialist than the LLM (Cohen's kappa: 0.46 vs. 0.26; 95% confidence interval of the difference: 0.15-0.25). Although the LLM showed slightly higher macro-averaged sensitivity (balanced accuracy) than the machine learning model, the machine learning classifier showed higher macro-averaged precision, specificity, and F-value. Class-wise analysis showed that the machine learning model demonstrated greater sensitivity for migraine and/or MOH and secondary headaches, while the LLM showed higher sensitivity for TTH. Regarding specificity, the machine learning model outperformed the LLM in TTH, TACs, and other primary headache disorders, whereas the LLM showed higher specificity only for migraine and/or MOH.Conclusions A supervised machine learning model trained on real-world clinical data showed better agreement with a specialist-confirmed diagnosis than a general-purpose, zero-shot, non-domain-adapted LLM. These findings indicate that, in its current off-the-shelf configuration under this experimental setting, the diagnostic agreement between a general-purpose LLM and specialists can be limited for headache disorders.
Files in This Item:
93122.pdf Download
DOI
10.1177/03331024261441574
Appears in Collections:
1. College of Medicine (의과대학) > Dept. of Neurology (신경과학교실) > 1. Journal Papers
Yonsei Authors
Ha, Woo Seok(하우석) ORCID logo https://orcid.org/0000-0003-1188-449X
URI
https://ir.ymlib.yonsei.ac.kr/handle/22282913/212510
사서에게 알리기
  feedback

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse

Links