024 불균형 데이터 처리 전략

키워드: 불균형, imbalanced

개요

현실의 많은 분류 문제는 클래스 불균형을 가지고 있습니다. 사기 탐지(사기 1%, 정상 99%), 질병 진단(양성 5%, 음성 95%) 등이 대표적입니다. 이 글에서는 불균형 데이터를 효과적으로 처리하는 전략을 알아봅니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], scikit-learn, imbalanced-learn

pip install flaml[automl] scikit-learn imbalanced-learn

불균형 데이터란?

정의

클래스 간 샘플 수가 현저히 다른 데이터입니다.

from sklearn.datasets import make_classification
import numpy as np

# 024 불균형 데이터 생성 (95:5)
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_classes=2,
    weights=[0.95, 0.05],  # 95% vs 5%
    random_state=42
)

print("클래스 분포:")
unique, counts = np.unique(y, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  클래스 {cls}: {count}개 ({count/len(y)*100:.1f}%)")

실행 결과

클래스 분포:
  클래스 0: 9500개 (95.0%)
  클래스 1: 500개 (5.0%)

문제점

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 024 다수 클래스만 예측하는 모델
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)

print("항상 다수 클래스(0) 예측:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred_dummy):.4f}")  # 0.95!
print(f"  F1 (macro): {f1_score(y_test, y_pred_dummy, average='macro'):.4f}")  # 낮음

문제: Accuracy 95%지만 소수 클래스는 전혀 예측하지 못함!

전략 1: 적절한 평가 지표 선택

잘못된 선택: Accuracy

from flaml import AutoML

# 024 Accuracy 최적화 (비권장)
automl_acc = AutoML()
automl_acc.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,
    metric="accuracy",
    verbose=0
)

y_pred_acc = automl_acc.predict(X_test)
print("Accuracy 최적화:")
print(classification_report(y_test, y_pred_acc))

올바른 선택: F1, AUC

# 024 ROC AUC 최적화 (권장)
automl_auc = AutoML()
automl_auc.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,
    metric="roc_auc",
    verbose=0
)

y_pred_auc = automl_auc.predict(X_test)
print("ROC AUC 최적화:")
print(classification_report(y_test, y_pred_auc))

전략 2: 클래스 가중치

sample_weight 사용

# 024 클래스 가중치 계산
from sklearn.utils.class_weight import compute_sample_weight

sample_weights = compute_sample_weight('balanced', y_train)

print("샘플 가중치 예시:")
print(f"  클래스 0 샘플 가중치: {sample_weights[y_train==0][0]:.4f}")
print(f"  클래스 1 샘플 가중치: {sample_weights[y_train==1][0]:.4f}")

# 024 FLAML에 가중치 적용
automl_weighted = AutoML()
automl_weighted.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,
    metric="f1",
    sample_weight=sample_weights,
    verbose=0
)

y_pred_weighted = automl_weighted.predict(X_test)
print("\n가중치 적용 모델:")
print(classification_report(y_test, y_pred_weighted))

전략 3: 리샘플링

오버샘플링 (Oversampling)

소수 클래스의 샘플을 늘립니다.

from imblearn.over_sampling import SMOTE, RandomOverSampler

# 024 SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("SMOTE 적용 후:")
unique, counts = np.unique(y_train_smote, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  클래스 {cls}: {count}개")

# 024 SMOTE 데이터로 FLAML 학습
automl_smote = AutoML()
automl_smote.fit(
    X_train_smote, y_train_smote,
    task="classification",
    time_budget=30,
    metric="f1",
    verbose=0
)

y_pred_smote = automl_smote.predict(X_test)
print("\nSMOTE + FLAML:")
print(classification_report(y_test, y_pred_smote))

언더샘플링 (Undersampling)

다수 클래스의 샘플을 줄입니다.

from imblearn.under_sampling import RandomUnderSampler

# 024 랜덤 언더샘플링
rus = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)

print("언더샘플링 적용 후:")
unique, counts = np.unique(y_train_under, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  클래스 {cls}: {count}개")

복합 샘플링

from imblearn.combine import SMOTETomek

# 024 SMOTE + Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_train_st, y_train_st = smote_tomek.fit_resample(X_train, y_train)

print("SMOTETomek 적용 후:")
unique, counts = np.unique(y_train_st, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  클래스 {cls}: {count}개")

전략 4: 임계값 조정

# 024 확률 예측
y_prob = automl_auc.predict_proba(X_test)[:, 1]

# 024 다양한 임계값 테스트
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

# 024 F1 최대화하는 임계값 찾기
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5

print(f"최적 임계값: {optimal_threshold:.4f}")
print(f"해당 F1: {f1_scores[optimal_idx]:.4f}")

# 024 최적 임계값으로 예측
y_pred_optimal = (y_prob >= optimal_threshold).astype(int)
print("\n최적 임계값 적용:")
print(classification_report(y_test, y_pred_optimal))

전략 비교

from sklearn.metrics import f1_score, roc_auc_score

strategies = {
    'Accuracy 최적화': y_pred_acc,
    'ROC AUC 최적화': y_pred_auc,
    'Sample Weight': y_pred_weighted,
    'SMOTE': y_pred_smote,
    '임계값 조정': y_pred_optimal
}

print("전략별 성능 비교:")
print("-" * 60)
print(f"{'전략':<20} {'F1 (class 1)':<15} {'F1 (macro)':<15}")
print("-" * 60)

for name, y_pred in strategies.items():
    f1_cls1 = f1_score(y_test, y_pred, pos_label=1)
    f1_macro = f1_score(y_test, y_pred, average='macro')
    print(f"{name:<20} {f1_cls1:<15.4f} {f1_macro:<15.4f}")

전략 선택 가이드

상황	권장 전략
약간 불균형 (70:30)	적절한 metric 선택만으로 충분
중간 불균형 (90:10)	Sample weight + F1/AUC metric
심한 불균형 (99:1)	SMOTE + Sample weight + AUC
데이터가 적음	오버샘플링 (SMOTE)
데이터가 많음	언더샘플링도 고려

FLAML 불균형 처리 종합

from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_sample_weight

def train_with_imbalance_handling(X_train, y_train, X_test, y_test,
                                   use_smote=True, use_weight=True):
    """불균형 처리를 적용한 FLAML 학습"""

    # 1. SMOTE 적용
    if use_smote:
        smote = SMOTE(random_state=42)
        X_train_processed, y_train_processed = smote.fit_resample(X_train, y_train)
    else:
        X_train_processed, y_train_processed = X_train, y_train

    # 2. Sample weight 계산
    if use_weight:
        weights = compute_sample_weight('balanced', y_train_processed)
    else:
        weights = None

    # 3. FLAML 학습
    automl = AutoML()
    automl.fit(
        X_train_processed, y_train_processed,
        task="classification",
        time_budget=30,
        metric="f1",  # 불균형에 적합한 metric
        sample_weight=weights,
        verbose=0
    )

    # 4. 평가
    y_pred = automl.predict(X_test)

    return automl, y_pred

# 024 사용
automl, y_pred = train_with_imbalance_handling(
    X_train, y_train, X_test, y_test,
    use_smote=True, use_weight=True
)

print("종합 불균형 처리 결과:")
print(classification_report(y_test, y_pred))

정리

불균형 데이터에서 Accuracy는 의미 없을 수 있습니다.
F1 Score, ROC AUC를 평가 지표로 사용하세요.
Sample weight로 소수 클래스에 가중치를 줄 수 있습니다.
SMOTE로 소수 클래스를 오버샘플링할 수 있습니다.
임계값 조정으로 Precision-Recall 균형을 맞출 수 있습니다.
상황에 맞는 복합 전략을 사용하세요.

다음 글 예고

다음 글에서는 sample_weight로 클래스 가중치 적용에 대해 더 상세히 알아보겠습니다.

FLAML AutoML 마스터 시리즈 #024

개요​

실습 환경​

불균형 데이터란?​

정의​

실행 결과​

문제점​

전략 1: 적절한 평가 지표 선택​

잘못된 선택: Accuracy​

올바른 선택: F1, AUC​

전략 2: 클래스 가중치​

sample_weight 사용​

전략 3: 리샘플링​

오버샘플링 (Oversampling)​

언더샘플링 (Undersampling)​

복합 샘플링​

전략 4: 임계값 조정​

전략 비교​

전략 선택 가이드​

FLAML 불균형 처리 종합​

정리​

다음 글 예고​

개요