Comparative diagnostic agreement of a supervised machine learning model and a general-purpose, zero-shot, non-domain-adapted large language model for classifying headache disorders using structured questionnaires

Katsuki, Masahito; Moran, Kieran; O&apos;Connor, Siobhan; Ward, Tomas; Fuse, Yutaro; Gargari, Omid Kohandel; Romozzi, Marina; Gonzalez-Martinez, Alicia; Huerta, Miguel A.; Ha, Woo-Seok; Cheung, Jackson T. S.; Matsumori, Yasuhiko

doi:10.1177/03331024261441574

YUHSpace

BROWSE

17 29

Cited 0 times in

Comparative diagnostic agreement of a supervised machine learning model and a general-purpose, zero-shot, non-domain-adapted large language model for classifying headache disorders using structured questionnaires

Authors: Katsuki, Masahito ; Moran, Kieran ; O'Connor, Siobhan ; Ward, Tomas ; Fuse, Yutaro ; Gargari, Omid Kohandel ; Romozzi, Marina ; Gonzalez-Martinez, Alicia ; Huerta, Miguel A. ; Ha, Woo-Seok ; Cheung, Jackson T. S. ; Matsumori, Yasuhiko

Citation: CEPHALALGIA, Vol.46(4), 2026-04

Article Number: 03331024261441574

Journal Title: CEPHALALGIA

ISSN: 0333-1024

Issue Date: 2026-04

Keywords: artificial intelligence (AI) ; ChatGPT ; headache diagnosis ; large language models (LLMs) ; machine learning (ML)

Abstract: Background Accurate diagnosis of headache disorders is essential in clinical practice. Supervised machine learning models trained on structured clinical data have shown good performance, whereas the diagnostic ability of large language models (LLMs) for headache disorders has not been evaluated. This study compared a validated machine learning classifier with a general-purpose, zero-shot, non-domain-adapted LLM using the same structured patient questionnaire data, focusing on their agreement with specialist-confirmed diagnoses as the ground truth. This study was designed to reflect current real-world use scenarios, in which clinicians may apply off-the-shelf LLMs for diagnostic purposes without few-shot prompting, domain-specific fine-tuning, or adaptation, rather than to assess the theoretical upper limits of LLM capabilities.Methods We analyzed 1818 patients from an independent hold-out test cohort who completed a 22-item structured headache questionnaire and received specialist-confirmed diagnoses. A previously developed machine learning model and a general-purpose, non-domain-adapted LLM (GPT-4.1 with zero-shot prompting) each generated five-class International Classification of Headache Disorders, 3rd edition (ICHD-3)-based predictions: migraine and/or medication-overuse headache (MOH), tension-type headache (TTH), trigeminal autonomic cephalalgias (TACs), other primary headache disorders, and secondary headaches. Agreement with the specialist's diagnosis and diagnostic performance metrics were calculated. Class-wise sensitivity and specificity were compared using McNemar's test.Results The machine learning classifier showed significantly higher diagnostic agreement with the specialist than the LLM (Cohen's kappa: 0.46 vs. 0.26; 95% confidence interval of the difference: 0.15-0.25). Although the LLM showed slightly higher macro-averaged sensitivity (balanced accuracy) than the machine learning model, the machine learning classifier showed higher macro-averaged precision, specificity, and F-value. Class-wise analysis showed that the machine learning model demonstrated greater sensitivity for migraine and/or MOH and secondary headaches, while the LLM showed higher sensitivity for TTH. Regarding specificity, the machine learning model outperformed the LLM in TTH, TACs, and other primary headache disorders, whereas the LLM showed higher specificity only for migraine and/or MOH.Conclusions A supervised machine learning model trained on real-world clinical data showed better agreement with a specialist-confirmed diagnosis than a general-purpose, zero-shot, non-domain-adapted LLM. These findings indicate that, in its current off-the-shelf configuration under this experimental setting, the diagnostic agreement between a general-purpose LLM and specialists can be limited for headache disorders.