Insufficient reporting quality in large language model studies in the field of radiology

Suh, Pae Sun; Jeong, So Yeong; Ueda, Daiju; Shim, Woo Hyun; Heo, Hwon; Woo, Chang-Yun; Park, Hyungjun; Suh, Chong Hyun

doi:10.1186/s13244-026-02236-1

YUHSpace

BROWSE

36 38

Cited 0 times in

Insufficient reporting quality in large language model studies in the field of radiology

Authors: Suh, Pae Sun ; Jeong, So Yeong ; Ueda, Daiju ; Shim, Woo Hyun ; Heo, Hwon ; Woo, Chang-Yun ; Park, Hyungjun ; Suh, Chong Hyun

Citation: INSIGHTS INTO IMAGING, Vol.17(1), 2026-03

Article Number: 71

Journal Title: INSIGHTS INTO IMAGING

ISSN: 1869-4101

Issue Date: 2026-03

Keywords: Large language model ; Radiology ; Reporting quality ; Systematic review

Abstract: ObjectivesOur systematic review aimed to evaluate the quality of reporting in research articles involving LLMs in the radiology field.Materials and methodsAfter searching the PubMed-MEDLINE and EMBASE databases, a total of 246 eligible studies published between November 30, 2022, and December 31, 2024, were included. The analysis assessed the percentage of studies adhering to key elements required for LLM research, based on the MInimum reporting items for CLear Evaluation of Accuracy Reports of Large Language Models in healthcare (MI-CLEAR-LLM) and the Transparent Reporting of a Multivariable Model for Individual Prognosis Or Diagnosis-large language models (TRIPOD-LLM) checklists. Studies published before and after July 25, 2024, were compared using a chi-square test.ResultsThe most common topic was performance evaluation of LLMs using radiologic cases (44.3%, 109/246), followed by radiology reporting (37.8%, 93/246). Although all studies reported LLM's name, only 27.6% (68/246) specified the model version, 35.8% (88/246) mentioned access date, and 25.2% (62/246) mentioned application programming interface usage. Full prompts were provided in 41.1% (101/246) of studies. Output probability-related issues, including the number of attempts (22.8%, 56/246) and factors such as temperature (16.7%, 41/246), were under-reported. These reporting insufficiencies persisted in studies published before and after July 25, 2024.ConclusionMost studies assessing large language models in radiology lacked sufficient reporting of key elements required for large language model research. We recommend that authors strive to adhere to these elements to ensure transparency and improve the reproducibility of future studies.Critical relevance statementOur study highlighted the need for improved reporting quality and adherence to key elements to ensure transparent reporting and improve the reproducibility of future studies using large language models.Key PointsNumerous studies on large language models (LLMs) in radiology lack standardized methodologies, leading to high variability and inconsistent reporting.Our review demonstrated insufficiency in key elements for LLM research, particularly in model details and output probability.Better reporting and adherence to key elements are essential for enhancing transparency and reproducibility in future LLM research.