Optimal Selection of Resampling Methods for Imbalanced Data with High Complexity

YUHSpace

BROWSE

212 407

Cited 0 times in

Optimal Selection of Resampling Methods for Imbalanced Data with High Complexity

Other Titles: 복잡도가 높은 불균형 자료에서 최적의 표본 재추출 방법

Authors: 김애니

College: College of Medicine (의과대학)

Department: Others (기타)

Degree: 석사

Issue Date: 2021-08

Abstract: Class imbalance is considered to be a major problem in classification tasks. In the case of a class imbalance, the decision boundary is easily biased toward the majority class. The possible solutions for this can be divided into two groups: a data level solution (resampling) and an algorithm level solution. The resampling method is more adaptable in comparison with other methods since it can be used with various classifiers. However, some results have shown that oversampling, the most widely used resampling method, worsens classification performance. This is due to an overgeneralization problem. The overgeneralization problem occurs when examples produced by the oversampling technique are introduced into the majority class domain when they should be represented in the minority class domain. This paper claims that this overgeneralization problem is aggravated in complex data settings. To mitigate the problem of overgeneralization in complex datasets, two alternative approaches are provided. The first approach is to incorporate the filtering method into oversampling. The second approach is to simply apply undersampling. In this study, the researchers investigated the relationship between complexity and imbalance for classification. Simulation studies and real data analysis were performed to compare resampling results in various scenarios that took into account different complexities, imbalances, and sample sizes. In conclusion, this study aids researchers in choosing the optimal resampling method for complex datasets.

대부분의 학습 알고리즘은 계급 간 균형을 가정한다. 계급 간 불균형한 자료의 경우 분류 알고리즘은 다수 계급에 쉽게 편향되기에 이는 중요한 문제 이다. 계급 불균형을 해결하기 위한 방법은 크게 데이터 수준과 알고리즘 수 준에서의 해결책으로 나뉜다. 데이터 수준에서의 해결책 즉 표본 재추출 방법 은 다양한 분류 알고리즘과 함께 사용이 가능하다는 점에서 많은 연구에서 사 용되고 새로운 방법들도 개발되고 있다. 하지만 몇몇 연구 결과에서는 가장 많이 사용되는 표본 재추출 방법인 오버샘플링이 분류 성능을 악화시키는 것 을 보인다. 이는 오버샘플링의 과잉 일반화 문제 때문이다. 과잉 일반화 문제 는 오버샘플링으로 생성된 임의 데이터가 소수 계급을 대표해야 하지만 다수 계급 공간에 생성된 경우를 말한다. 본 논문에서는 과잉 일반화 문제는 복잡 한 자료에서 악화한다고 주장한다. 본 연구는 복잡한 자료에서 과잉 일반화 문제를 해결하기 위한 두 가지 방법을 제안한다. 첫 번째 방법은 오버샘플링 에 필터를 결합하는 것, 두 번째는 언더샘플링을 적용하는 것이다. 본 연구의 목적은 복잡한 자료에서 최적의 표본 재추출 방법을 제안하는 것이다. 이를 위해 다양한 복잡성, 불균형 정도, 표본 크기를 고려하는 모의실험과 실제 자 료를 이용해 표본 재추출 방법을 비교하였다. 본 연구를 통해 자료의 복잡도 에 따라 적합한 표본 재추출 방법을 제시하는 바이다.

Files in This Item:: TA03006.pdf Download

Appears in Collections:: 1. College of Medicine (의과대학) > Others (기타) > 2. Thesis

URI: https://ir.ymlib.yonsei.ac.kr/handle/22282913/185530

사서에게 알리기

Show full item record Find it @ YMLIB

License

YUHSpace: Optimal Selection of Resampling Methods for Imbalanced Data with High Complexity

YUHSpace

BROWSE

Browse

Links