025 sample_weight로 클래스 가중치 적용

키워드: sample_weight, 가중치

개요

불균형 데이터를 처리하는 가장 간단한 방법 중 하나는 샘플 가중치(sample_weight)를 사용하는 것입니다. 소수 클래스 샘플에 더 높은 가중치를 부여하여 모델이 해당 샘플을 더 중요하게 학습하도록 합니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], scikit-learn

pip install flaml[automl] scikit-learn

sample_weight란?

개념

각 샘플의 중요도를 나타내는 가중치입니다.

import numpy as np

# 025 예시: 5개 샘플
y = np.array([0, 0, 0, 1, 1])

# 025 가중치 없음 (모두 동등)
weights_none = np.array([1, 1, 1, 1, 1])

# 025 가중치 있음 (클래스 1에 높은 가중치)
weights_balanced = np.array([1, 1, 1, 1.5, 1.5])

print("클래스별 총 가중치:")
print(f"  가중치 없음: 클래스0={weights_none[y==0].sum()}, 클래스1={weights_none[y==1].sum()}")
print(f"  가중치 있음: 클래스0={weights_balanced[y==0].sum()}, 클래스1={weights_balanced[y==1].sum()}")

가중치 계산 방법

방법 1: compute_sample_weight

sklearn의 유틸리티 함수입니다.

from sklearn.utils.class_weight import compute_sample_weight

# 025 불균형 데이터 생성
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_classes=2,
    weights=[0.9, 0.1],  # 90:10 불균형
    random_state=42
)

# 025 자동 가중치 계산
sample_weights = compute_sample_weight('balanced', y)

print("클래스별 샘플 가중치:")
print(f"  클래스 0 샘플 가중치: {sample_weights[y==0][0]:.4f}")
print(f"  클래스 1 샘플 가중치: {sample_weights[y==1][0]:.4f}")
print(f"  비율: 1:{sample_weights[y==1][0]/sample_weights[y==0][0]:.2f}")

실행 결과

클래스별 샘플 가중치:
  클래스 0 샘플 가중치: 0.5556
  클래스 1 샘플 가중치: 5.0000
  비율: 1:9.00

방법 2: compute_class_weight

클래스별 가중치를 계산합니다.

from sklearn.utils.class_weight import compute_class_weight

# 025 클래스 가중치 계산
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)

print("클래스 가중치:")
for cls, weight in enumerate(class_weights):
    print(f"  클래스 {cls}: {weight:.4f}")

# 025 sample_weight로 변환
sample_weights_from_class = np.array([class_weights[yi] for yi in y])

방법 3: 수동 계산

def compute_weights_manual(y, strategy='balanced'):
    """수동으로 샘플 가중치 계산"""
    classes, counts = np.unique(y, return_counts=True)
    n_samples = len(y)
    n_classes = len(classes)

    if strategy == 'balanced':
        # balanced: n_samples / (n_classes * count_per_class)
        weights_per_class = n_samples / (n_classes * counts)
    elif strategy == 'inverse':
        # inverse: 1 / count_per_class (정규화)
        weights_per_class = 1 / counts
        weights_per_class = weights_per_class / weights_per_class.sum() * n_classes

    sample_weights = np.array([weights_per_class[cls] for cls in y])
    return sample_weights

# 025 테스트
manual_weights = compute_weights_manual(y, 'balanced')
print(f"수동 계산 결과: {manual_weights[y==0][0]:.4f}, {manual_weights[y==1][0]:.4f}")

FLAML에서 sample_weight 사용

기본 사용법

from flaml import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 025 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 025 가중치 계산
train_weights = compute_sample_weight('balanced', y_train)

# 025 FLAML 학습 (가중치 적용)
automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,
    metric="f1",
    sample_weight=train_weights,  # 가중치 전달
    verbose=0
)

y_pred = automl.predict(X_test)
print("sample_weight 적용 결과:")
print(classification_report(y_test, y_pred))

가중치 적용 vs 미적용 비교

# 025 가중치 없이 학습
automl_no_weight = AutoML()
automl_no_weight.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,
    metric="accuracy",  # 일반적인 metric
    verbose=0
)

y_pred_no_weight = automl_no_weight.predict(X_test)

# 025 가중치 적용하여 학습
automl_weighted = AutoML()
automl_weighted.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,
    metric="f1",
    sample_weight=train_weights,
    verbose=0
)

y_pred_weighted = automl_weighted.predict(X_test)

# 025 비교
from sklearn.metrics import f1_score

print("비교 결과:")
print(f"{'방법':<25} {'F1 (class 1)':<15} {'F1 (macro)':<15}")
print("-" * 55)

f1_no = f1_score(y_test, y_pred_no_weight, pos_label=1)
f1_macro_no = f1_score(y_test, y_pred_no_weight, average='macro')
print(f"{'가중치 없음':<25} {f1_no:<15.4f} {f1_macro_no:<15.4f}")

f1_w = f1_score(y_test, y_pred_weighted, pos_label=1)
f1_macro_w = f1_score(y_test, y_pred_weighted, average='macro')
print(f"{'가중치 적용':<25} {f1_w:<15.4f} {f1_macro_w:<15.4f}")

커스텀 가중치

비용 기반 가중치

실제 비용을 반영한 가중치입니다.

def cost_based_weights(y, cost_fp=1, cost_fn=10):
    """
    비용 기반 가중치 계산

    cost_fp: False Positive 비용 (정상을 양성으로)
    cost_fn: False Negative 비용 (양성을 정상으로) - 보통 더 높음
    """
    weights = np.ones(len(y))

    # 양성 클래스(1) 샘플에 FN 비용 기반 가중치
    # 음성 클래스(0) 샘플에 FP 비용 기반 가중치
    weights[y == 0] = cost_fp
    weights[y == 1] = cost_fn

    # 정규화 (평균 가중치가 1이 되도록)
    weights = weights / weights.mean()

    return weights

# 025 예: 양성을 놓치는 것이 10배 더 비용이 큰 경우
custom_weights = cost_based_weights(y_train, cost_fp=1, cost_fn=10)

print("커스텀 가중치:")
print(f"  클래스 0 평균 가중치: {custom_weights[y_train==0].mean():.4f}")
print(f"  클래스 1 평균 가중치: {custom_weights[y_train==1].mean():.4f}")

이상치 다운웨이팅

이상치에 낮은 가중치를 부여합니다.

from sklearn.ensemble import IsolationForest

def outlier_aware_weights(X, y, base_weights, contamination=0.1):
    """이상치에 낮은 가중치 부여"""
    iso = IsolationForest(contamination=contamination, random_state=42)
    outlier_labels = iso.fit_predict(X)

    # 이상치(-1)에 0.5 가중치, 정상(1)에 1.0 가중치
    outlier_weights = np.where(outlier_labels == 1, 1.0, 0.5)

    return base_weights * outlier_weights

# 025 적용
outlier_weights = outlier_aware_weights(X_train, y_train, train_weights)

교차 검증에서 가중치

from sklearn.model_selection import cross_val_score, StratifiedKFold

# 025 교차 검증용 가중치
full_weights = compute_sample_weight('balanced', y)

# 025 FLAML은 내부적으로 교차 검증을 수행하므로
# 025 fit 시 전체 학습 데이터의 가중치를 전달
automl_cv = AutoML()
automl_cv.fit(
    X, y,
    task="classification",
    time_budget=60,
    metric="f1",
    sample_weight=full_weights,
    n_splits=5,  # 5-fold CV
    verbose=0
)

print(f"교차 검증 최적 모델: {automl_cv.best_estimator}")
print(f"검증 F1: {1 - automl_cv.best_loss:.4f}")

주의사항

1. 과도한 가중치

# 025 주의: 너무 극단적인 가중치
extreme_weights = np.where(y_train == 1, 100, 1)  # 100배 차이

# 025 이런 경우 모델이 소수 클래스에만 집중하여
# 025 다수 클래스 성능이 크게 떨어질 수 있음

2. 가중치와 metric 조합

# 025 권장 조합
# 025 - sample_weight + metric="f1" (또는 "roc_auc")
# 025 - sample_weight + metric="accuracy" (비권장: 효과 상쇄될 수 있음)

3. 테스트 시 가중치 미적용

# 025 학습 시에만 가중치 적용
automl.fit(X_train, y_train, sample_weight=train_weights, ...)

# 025 예측/평가 시에는 가중치 사용하지 않음
y_pred = automl.predict(X_test)  # 가중치 없음

정리

sample_weight는 각 샘플의 중요도를 나타냅니다.
compute_sample_weight('balanced', y)로 자동 계산할 수 있습니다.
FLAML의 fit()에 sample_weight 파라미터로 전달합니다.
불균형 데이터에서 소수 클래스의 성능을 향상시킵니다.
비용 기반 가중치로 실제 비즈니스 비용을 반영할 수 있습니다.
metric="f1" 또는 "roc_auc"와 함께 사용을 권장합니다.

다음 글 예고

다음 글에서는 교차 검증 설정 - n_splits에 대해 알아보겠습니다. FLAML의 교차 검증 설정과 활용법을 다룹니다.

FLAML AutoML 마스터 시리즈 #025

개요​

실습 환경​

sample_weight란?​

개념​

가중치 계산 방법​

방법 1: compute_sample_weight​

실행 결과​

방법 2: compute_class_weight​

방법 3: 수동 계산​

FLAML에서 sample_weight 사용​

기본 사용법​

가중치 적용 vs 미적용 비교​

커스텀 가중치​

비용 기반 가중치​

이상치 다운웨이팅​

교차 검증에서 가중치​

주의사항​

1. 과도한 가중치​

2. 가중치와 metric 조합​

3. 테스트 시 가중치 미적용​

정리​

다음 글 예고​

개요