Analyzing evaluation methods for large language models in the medical field: a scoping review

Lee, Junbok; Park, Sungkyung; Shin, Jaeyong; Cho, Belong

doi:10.1186/s12911-024-02709-7

YUHSpace

BROWSE

84 214

Cited 0 times in

Cited 6 times in

Analyzing evaluation methods for large language models in the medical field: a scoping review

DC Field	Value	Language
dc.contributor.author	Lee, Junbok	-
dc.contributor.author	Park, Sungkyung	-
dc.contributor.author	Shin, Jaeyong	-
dc.contributor.author	Cho, Belong	-
dc.date.accessioned	2025-02-03T09:26:18Z	-
dc.date.available	2025-02-03T09:26:18Z	-
dc.date.created	2025-03-20	-
dc.date.issued	2024-11	-
dc.identifier.issn	1472-6947	-
dc.identifier.uri	https://ir.ymlib.yonsei.ac.kr/handle/22282913/202463	-
dc.description.abstract	BackgroundOwing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.ObjectiveThis study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.Methods & materialsWe conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.ResultsA total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.ConclusionsMore research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.	-
dc.description.statementOfResponsibility	open	-
dc.language	English	-
dc.publisher	BioMed Central	-
dc.relation.isPartOf	BMC MEDICAL INFORMATICS AND DECISION MAKING	-
dc.relation.isPartOf	BMC MEDICAL INFORMATICS AND DECISION MAKING	-
dc.rights	CC BY-NC-ND 2.0 KR	-
dc.title	Analyzing evaluation methods for large language models in the medical field: a scoping review	-
dc.type	Article	-
dc.contributor.college	College of Medicine (의과대학)	-
dc.contributor.department	Dept. of Preventive Medicine (예방의학교실)	-
dc.contributor.googleauthor	Lee, Junbok	-
dc.contributor.googleauthor	Park, Sungkyung	-
dc.contributor.googleauthor	Shin, Jaeyong	-
dc.contributor.googleauthor	Cho, Belong	-
dc.identifier.doi	10.1186/s12911-024-02709-7	-
dc.relation.journalcode	J00363	-
dc.identifier.eissn	1472-6947	-
dc.identifier.pmid	39614219	-
dc.subject.keyword	Large language model	-
dc.subject.keyword	LLM	-
dc.subject.keyword	Evaluation methods	-
dc.contributor.alternativeName	Shin, Jae Yong	-
dc.contributor.affiliatedAuthor	Shin, Jaeyong	-
dc.identifier.scopusid	2-s2.0-85211120645	-
dc.identifier.wosid	001366888900001	-
dc.citation.volume	24	-
dc.citation.number	1	-
dc.identifier.bibliographicCitation	BMC MEDICAL INFORMATICS AND DECISION MAKING, Vol.24(1), 2024-11	-
dc.identifier.rimsid	85528	-
dc.type.rims	ART	-
dc.description.journalClass	1	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	Large language model	-
dc.subject.keywordAuthor	LLM	-
dc.subject.keywordAuthor	Evaluation methods	-
dc.subject.keywordPlus	CHATGPT	-
dc.subject.keywordPlus	PERFORMANCE	-
dc.subject.keywordPlus	QUESTIONS	-
dc.subject.keywordPlus	EDUCATION	-
dc.subject.keywordPlus	ACCURACY	-
dc.type.docType	Article	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalWebOfScienceCategory	Medical Informatics	-
dc.relation.journalResearchArea	Medical Informatics	-
dc.identifier.articleno	366	-

Appears in Collections:: 1. College of Medicine (의과대학) > Dept. of Preventive Medicine (예방의학교실) > 1. Journal Papers

Show simple item record Find it @ YMLIB

License

YUHSpace: Analyzing evaluation methods for large language models in the medical field: a scoping review

YUHSpace

BROWSE

Browse

Links