2 6

Cited 0 times in

Cited 0 times in

Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study

DC Field Value Language
dc.contributor.authorPark, Phillip-
dc.contributor.authorChoi, Yeonho-
dc.contributor.authorHan, Nayoung-
dc.contributor.authorPark, Ye-Lin-
dc.contributor.authorHwang, Juyeon-
dc.contributor.authorChae, Heejung-
dc.contributor.authorYoo, Chong Woo-
dc.contributor.authorChoi, Kui Son-
dc.contributor.authorKim, Hyun-Jin-
dc.date.accessioned2025-11-12T06:50:57Z-
dc.date.available2025-11-12T06:50:57Z-
dc.date.created2025-07-24-
dc.date.issued2025-02-
dc.identifier.issn1932-6203-
dc.identifier.urihttps://ir.ymlib.yonsei.ac.kr/handle/22282913/208705-
dc.description.abstractBackground Pathology reports provide important information for accurate diagnosis of cancer and optimal treatment decision making. In particular, breast cancer has known to be the most common cancer in women worldwide.Objective For the data extraction of breast cancer pathology reports in a single institute, we assessed the accuracy of methods between regular expression and natural language processing (NLP).Methods A total of 1,215 breast cancer pathology reports were annotated for NLP model development. As NLP models, we considered three BERT models with specific vocabularies including BERT-basic, BioBERT, and ClinicalBERT. K-fold cross-validation was used to verify the performance of the BERT model. The results between the regular expression and the BERT model were compared using the named entity recognition (NER) techniques.Results Among three BERT models, BioBERT was the most accurate parsing model (average performance = 0.99901) for breast cancer pathology when set to k = 5. BioBERT also had the lowest error rate for all items in the breast cancer pathology report compared to other BERT models (accuracy for all variables >= 0.9). Therefore, we finally selected BioBERT as the NLP model. When comparing the results of BioBERT and regular expressions using NER, we identified that BioBERT was more accurate than regular expression method, especially for some items such as intraductal component (BioBERT: 1.0, RegEx: 0.1644), lymph node (BioBERT: 0.9886, RegEx: 0.4792), and lymphovascular invasion (BioBERT: 0.9918, RegEx: 0.3759).Conclusions Our results showed that the NLP model, BioBERT, had higher accuracy than regular expression, suggesting the importance of BioBERT in the processing of breast cancer pathology reports.-
dc.formatapplication/pdf-
dc.languageEnglish-
dc.publisherPublic Library of Science-
dc.relation.isPartOfPLOS ONE-
dc.relation.isPartOfPLOS ONE-
dc.subject.MESHBreast Neoplasms* / pathology-
dc.subject.MESHFemale-
dc.subject.MESHHumans-
dc.subject.MESHInformation Storage and Retrieval* / methods-
dc.subject.MESHNatural Language Processing*-
dc.titleLeveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study-
dc.typeArticle-
dc.contributor.googleauthorPark, Phillip-
dc.contributor.googleauthorChoi, Yeonho-
dc.contributor.googleauthorHan, Nayoung-
dc.contributor.googleauthorPark, Ye-Lin-
dc.contributor.googleauthorHwang, Juyeon-
dc.contributor.googleauthorChae, Heejung-
dc.contributor.googleauthorYoo, Chong Woo-
dc.contributor.googleauthorChoi, Kui Son-
dc.contributor.googleauthorKim, Hyun-Jin-
dc.identifier.doi10.1371/journal.pone.0318726-
dc.relation.journalcodeJ02540-
dc.identifier.eissn1932-6203-
dc.identifier.pmid39965024-
dc.subject.keywordArticle-
dc.subject.keywordBreast Cancer-
dc.subject.keywordData Extraction-
dc.subject.keywordDecision Making-
dc.subject.keywordEfficient Information Extraction-
dc.subject.keywordElectronic Health Record-
dc.subject.keywordHistology-
dc.subject.keywordLymph Node-
dc.subject.keywordLymph Vessel Metastasis-
dc.subject.keywordNatural Language Processing-
dc.subject.keywordTransfer Learning (machine Learning)-
dc.subject.keywordBreast Tumor-
dc.subject.keywordData Mining-
dc.subject.keywordFemale-
dc.subject.keywordHuman-
dc.subject.keywordPathology-
dc.subject.keywordProcedures-
dc.subject.keywordBreast Neoplasms-
dc.subject.keywordData Mining-
dc.subject.keywordFemale-
dc.subject.keywordHumans-
dc.subject.keywordNatural Language Processing-
dc.contributor.affiliatedAuthorHwang, Juyeon-
dc.identifier.scopusid2-s2.0-85218078033-
dc.identifier.wosid001425429000012-
dc.citation.volume20-
dc.citation.number2-
dc.identifier.bibliographicCitationPLOS ONE, Vol.20(2), 2025-02-
dc.identifier.rimsid88147-
dc.type.rimsART-
dc.description.journalClass1-
dc.description.journalClass1-
dc.subject.keywordAuthorArticle-
dc.subject.keywordAuthorBreast Cancer-
dc.subject.keywordAuthorData Extraction-
dc.subject.keywordAuthorDecision Making-
dc.subject.keywordAuthorEfficient Information Extraction-
dc.subject.keywordAuthorElectronic Health Record-
dc.subject.keywordAuthorHistology-
dc.subject.keywordAuthorLymph Node-
dc.subject.keywordAuthorLymph Vessel Metastasis-
dc.subject.keywordAuthorNatural Language Processing-
dc.subject.keywordAuthorTransfer Learning (machine Learning)-
dc.subject.keywordAuthorBreast Tumor-
dc.subject.keywordAuthorData Mining-
dc.subject.keywordAuthorFemale-
dc.subject.keywordAuthorHuman-
dc.subject.keywordAuthorPathology-
dc.subject.keywordAuthorProcedures-
dc.subject.keywordAuthorBreast Neoplasms-
dc.subject.keywordAuthorData Mining-
dc.subject.keywordAuthorFemale-
dc.subject.keywordAuthorHumans-
dc.subject.keywordAuthorNatural Language Processing-
dc.subject.keywordPlusCLASSIFICATION-
dc.type.docTypeArticle-
dc.description.isOpenAccessY-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalWebOfScienceCategoryMultidisciplinary Sciences-
dc.relation.journalResearchAreaScience & Technology - Other Topics-
dc.identifier.articlenoe0318726-
Appears in Collections:
4. Graduate School of Public Health (보건대학원) > Graduate School of Public Health (보건대학원) > 1. Journal Papers

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.