Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types

Park, Phillip; Choi, Yeonho; Han, Nayoung; Park, Soobin; Park, Ye-Lin; Hwang, Juyeon; Choi, Kui Son; Yoo, Chong Woo; Kim, Hyun-Jin

doi:10.3346/jkms.2026.41.e79

YUHSpace

BROWSE

12 19

Cited 0 times in

Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types

DC Field	Value	Language
dc.contributor.author	Park, Phillip	-
dc.contributor.author	Choi, Yeonho	-
dc.contributor.author	Han, Nayoung	-
dc.contributor.author	Park, Soobin	-
dc.contributor.author	Park, Ye-Lin	-
dc.contributor.author	Hwang, Juyeon	-
dc.contributor.author	Choi, Kui Son	-
dc.contributor.author	Yoo, Chong Woo	-
dc.contributor.author	Kim, Hyun-Jin	-
dc.date.accessioned	2026-03-25T07:12:49Z	-
dc.date.available	2026-03-25T07:12:49Z	-
dc.date.created	2026-03-20	-
dc.date.issued	2026-03	-
dc.identifier.issn	1011-8934	-
dc.identifier.uri	https://ir.ymlib.yonsei.ac.kr/handle/22282913/211475	-
dc.description.abstract	Background: Pathological reports provide comprehensive insights into the clinical and pathological features of different cancer types. However, extraction of this semi-structured data for research is challenging. To better utilize pathology reports in cancer studies, we developed an efficient natural language processing (NLP) system to automate the extraction of items from pathology reports, facilitating streamlined storage, retrieval, and analysis of clinical data in a centralized database. Methods: To determine the optimal model for our study, we conducted a comparative analysis of various deep learning architectures, including long short-term memory, convolutional neural network, and transformer-based models such as bidirectional encoder representations from transformers (BERT), BioBERT, and ClinicalBERT. The proficiency of the ClinicalBERT model in medical terminology and context significantly enhanced the accuracy and efficiency of data extraction from these reports. Results: Among the aforementioned models, ClinicalBERT exhibited the best performance and was selected as the base model. The ClinicalBERT model demonstrated an exceptional performance in accurately classifying variables across multiple cancer types. Regarding stomach cancer, F1 scores (F1 = 1.0) were achieved for variables such as angiolymphatic invasion, and operation name (F1 = 1.0); however, a lower performance was observed for distant metastasis (F1 = 0.3889). Regarding liver cancer, high performance was consistently observed for most variables, with F1 scores above 0.99. Regarding colorectal cancer, F1 scores were achieved for variables such as Dworak's grade, lymph node, operation name, and total mesorectal excision (F1 = 1.0), while slightly lower but acceptable performance was noted for surgical margin (F1 = 0.9259). Regarding breast cancer, F1 scores were achieved for several variables including nipple margin, organ, and superficial margin (F1 = 1.0), while strong performances were noted for lateral and medial margins (F1 > 0.94). Conclusion: This study underscores the efficacy of NLP systems, specifically the ClinicalBERT model, in automating the extraction of important clinical data from pathology reports across various cancer types. This approach can not only simplify the process but also enhance the accuracy of the extracted information.	-
dc.language	English	-
dc.publisher	대한의학회(The Korean Academy of Medical Sciences)	-
dc.relation.isPartOf	JOURNAL OF KOREAN MEDICAL SCIENCE	-
dc.relation.isPartOf	JOURNAL OF KOREAN MEDICAL SCIENCE	-
dc.subject.MESH	Data Mining	-
dc.subject.MESH	Databases, Factual	-
dc.subject.MESH	Deep Learning	-
dc.subject.MESH	Humans	-
dc.subject.MESH	Liver Neoplasms / pathology	-
dc.subject.MESH	Natural Language Processing*	-
dc.subject.MESH	Neoplasms* / classification	-
dc.subject.MESH	Neoplasms* / pathology	-
dc.subject.MESH	Neural Networks, Computer	-
dc.subject.MESH	Stomach Neoplasms / pathology	-
dc.title	Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types	-
dc.type	Article	-
dc.contributor.googleauthor	Park, Phillip	-
dc.contributor.googleauthor	Choi, Yeonho	-
dc.contributor.googleauthor	Han, Nayoung	-
dc.contributor.googleauthor	Park, Soobin	-
dc.contributor.googleauthor	Park, Ye-Lin	-
dc.contributor.googleauthor	Hwang, Juyeon	-
dc.contributor.googleauthor	Choi, Kui Son	-
dc.contributor.googleauthor	Yoo, Chong Woo	-
dc.contributor.googleauthor	Kim, Hyun-Jin	-
dc.identifier.doi	10.3346/jkms.2026.41.e79	-
dc.relation.journalcode	J01517	-
dc.identifier.eissn	1598-6357	-
dc.identifier.pmid	41775279	-
dc.subject.keyword	Pathology Report	-
dc.subject.keyword	Natural Language Processing	-
dc.subject.keyword	Cancer	-
dc.subject.keyword	Database	-
dc.contributor.affiliatedAuthor	Hwang, Juyeon	-
dc.identifier.scopusid	2-s2.0-105031873203	-
dc.identifier.wosid	001705947200001	-
dc.citation.volume	41	-
dc.citation.number	8	-
dc.identifier.bibliographicCitation	JOURNAL OF KOREAN MEDICAL SCIENCE, Vol.41(8), 2026-03	-
dc.identifier.rimsid	92011	-
dc.type.rims	ART	-
dc.description.journalClass	1	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	Pathology Report	-
dc.subject.keywordAuthor	Natural Language Processing	-
dc.subject.keywordAuthor	Cancer	-
dc.subject.keywordAuthor	Database	-
dc.type.docType	Article	-
dc.identifier.kciid	ART003310021	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.description.journalRegisteredClass	kci	-
dc.relation.journalWebOfScienceCategory	Medicine, General & Internal	-
dc.relation.journalResearchArea	General & Internal Medicine	-
dc.identifier.articleno	e79	-

Appears in Collections:: 4. Graduate School of Public Health (보건대학원) > Graduate School of Public Health (보건대학원) > 1. Journal Papers

Show simple item record Find it @ YMLIB

License

YUHSpace: Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types

YUHSpace

BROWSE

Browse

Links