Cardiovascular system ; Risk prediction ; Big data ; Health analytics
Abstract
As the importance of the prevention and premanagement of cardiovascular and cerebrovascular diseases continues to emerge, research is being conducted globally to create and compare risk factor prediction models using health examination big data. In this study, health insurance data were used to predict the incidence of cardiocerebrovascular disease using various models and compare the performance of the models on samples with different initial risk levels. This study analyzed data from 410,859 individuals from the National Health Insurance Service between 2002 and 2019. This study deployed various linear models to predict the occurrence of cardiocerebrovascular diseases in two distinct samples. Models based on logistic regression analysis with penalty terms on the objective function were used, and their predictive performances were compared using multiple evaluation metrics, including the area under the receiver operating characteristic curve. The logistic regression model incorporating variables selected by the LASSO algorithm exhibited superior predictive performance relative to other models, although the differences were not statistically significant. The models demonstrated improved performance for samples with higher incidence rates and initial risk levels. This study predicted and compared the incidence of cardiocerebrovascular disease (CCVD) in patients with different health conditions using national sample cohort data from the National Health Insurance Service. The findings underscore the importance of developing diverse models to predict diseases like CCVD, which have high medical costs and incidence rates, thus informing the development of healthcare policies.