이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교

587 844

Cited 0 times in

Other Titles: Comparison of resampling methods for dealing with imbalanced data in binary classification problem

Citation: Korean Journal of Applied Statistics (응용통계연구), Vol.32(3) : 349-374, 2019

Abstract: A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.