028 분류 프로젝트 - 신용카드 사기 탐지

키워드: 사기 탐지, 불균형 데이터, 실전 프로젝트

개요

신용카드 사기 탐지는 대표적인 불균형 분류 문제입니다. 실제 데이터에서 사기 거래는 전체의 0.1~1% 정도로 매우 드물지만, 이를 정확히 탐지하는 것이 매우 중요합니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], pandas, scikit-learn, imbalanced-learn

pip install flaml[automl] pandas scikit-learn imbalanced-learn

프로젝트 개요

목표

신용카드 거래에서 사기 거래를 탐지하는 모델 개발

평가 지표

Recall (재현율): 실제 사기를 얼마나 잘 탐지하는가
Precision (정밀도): 사기라고 예측한 것 중 실제 사기 비율
F1 Score: Precision과 Recall의 조화 평균
ROC AUC: 전반적인 분류 성능

Step 1: 데이터 준비

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 028 실제와 유사한 불균형 데이터 생성
# 028 (실제 프로젝트에서는 Kaggle Credit Card Fraud 데이터 사용 권장)
X, y = make_classification(
    n_samples=10000,
    n_features=30,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.99, 0.01],  # 99% 정상, 1% 사기
    flip_y=0.01,  # 약간의 노이즈
    random_state=42
)

# 028 DataFrame으로 변환
feature_names = [f'V{i}' for i in range(1, 31)]
df = pd.DataFrame(X, columns=feature_names)
df['Class'] = y  # 0: 정상, 1: 사기

print("데이터셋 정보:")
print(f"  전체 거래: {len(df):,}건")
print(f"  정상 거래: {(df['Class'] == 0).sum():,}건 ({(df['Class'] == 0).mean()*100:.2f}%)")
print(f"  사기 거래: {(df['Class'] == 1).sum():,}건 ({(df['Class'] == 1).mean()*100:.2f}%)")

실행 결과

데이터셋 정보:
  전체 거래: 10,000건
  정상 거래: 9,900건 (99.00%)
  사기 거래: 100건 (1.00%)

Step 2: 탐색적 데이터 분석

import matplotlib.pyplot as plt

# 028 클래스 분포 시각화
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 028 클래스 분포
class_counts = df['Class'].value_counts()
axes[0].bar(['정상', '사기'], class_counts.values, color=['green', 'red'])
axes[0].set_title('클래스 분포')
axes[0].set_ylabel('거래 수')

# 028 로그 스케일
axes[1].bar(['정상', '사기'], class_counts.values, color=['green', 'red'])
axes[1].set_yscale('log')
axes[1].set_title('클래스 분포 (로그 스케일)')
axes[1].set_ylabel('거래 수 (로그)')

plt.tight_layout()
plt.show()

# 028 특성별 통계
print("\n정상 vs 사기 거래 특성 비교 (처음 5개 특성):")
print(df.groupby('Class')[feature_names[:5]].mean())

Step 3: 데이터 분할

# 028 특성과 타겟 분리
X = df[feature_names]
y = df['Class']

# 028 학습/테스트 분할 (계층화)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # 클래스 비율 유지
)

print("데이터 분할:")
print(f"  학습 데이터: {len(X_train)}건")
print(f"  테스트 데이터: {len(X_test)}건")
print(f"\n학습 데이터 클래스 분포:")
print(f"  정상: {(y_train == 0).sum()}건 ({(y_train == 0).mean()*100:.2f}%)")
print(f"  사기: {(y_train == 1).sum()}건 ({(y_train == 1).mean()*100:.2f}%)")

Step 4: 기준 모델 (Baseline)

from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 028 항상 다수 클래스 예측
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)

print("기준 모델 (항상 정상 예측):")
print(classification_report(y_test, y_pred_dummy, target_names=['정상', '사기']))

실행 결과

기준 모델 (항상 정상 예측):
              precision    recall  f1-score   support

          정상       0.99      1.00      1.00      1980
          사기       0.00      0.00      0.00        20

    accuracy                           0.99      2000
   macro avg       0.50      0.50      0.49      2000
weighted avg       0.98      0.99      0.99      2000

→ Accuracy 99%이지만 사기를 전혀 탐지하지 못함!

Step 5: FLAML 기본 모델

from flaml import AutoML

# 028 기본 FLAML (가중치 없음)
automl_basic = AutoML()
automl_basic.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="accuracy",  # 주의: 불균형 데이터에 부적합
    verbose=0
)

y_pred_basic = automl_basic.predict(X_test)
print("FLAML 기본 모델 (accuracy 최적화):")
print(classification_report(y_test, y_pred_basic, target_names=['정상', '사기']))

Step 6: 불균형 데이터 처리

방법 1: 적절한 metric 사용

# 028 F1 Score 최적화
automl_f1 = AutoML()
automl_f1.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="f1",  # F1 최적화
    verbose=0
)

y_pred_f1 = automl_f1.predict(X_test)
print("FLAML (F1 최적화):")
print(classification_report(y_test, y_pred_f1, target_names=['정상', '사기']))

방법 2: Sample Weight 적용

from sklearn.utils.class_weight import compute_sample_weight

# 028 클래스 가중치 계산
sample_weights = compute_sample_weight('balanced', y_train)

print(f"가중치:")
print(f"  정상 거래: {sample_weights[y_train == 0][0]:.4f}")
print(f"  사기 거래: {sample_weights[y_train == 1][0]:.4f}")

# 028 가중치 적용 학습
automl_weighted = AutoML()
automl_weighted.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="f1",
    sample_weight=sample_weights,
    verbose=0
)

y_pred_weighted = automl_weighted.predict(X_test)
print("\nFLAML (가중치 적용):")
print(classification_report(y_test, y_pred_weighted, target_names=['정상', '사기']))

방법 3: SMOTE 오버샘플링

from imblearn.over_sampling import SMOTE

# 028 SMOTE 적용
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"SMOTE 적용 전: 정상={sum(y_train==0)}, 사기={sum(y_train==1)}")
print(f"SMOTE 적용 후: 정상={sum(y_train_smote==0)}, 사기={sum(y_train_smote==1)}")

# 028 SMOTE 데이터로 학습
automl_smote = AutoML()
automl_smote.fit(
    X_train_smote, y_train_smote,
    task="classification",
    time_budget=60,
    metric="f1",
    verbose=0
)

y_pred_smote = automl_smote.predict(X_test)
print("\nFLAML (SMOTE):")
print(classification_report(y_test, y_pred_smote, target_names=['정상', '사기']))

Step 7: 종합 비교

from sklearn.metrics import f1_score, recall_score, precision_score, roc_auc_score

models = {
    '기준 모델': y_pred_dummy,
    'FLAML 기본': y_pred_basic,
    'FLAML (F1)': y_pred_f1,
    'FLAML (가중치)': y_pred_weighted,
    'FLAML (SMOTE)': y_pred_smote
}

print("모델별 성능 비교:")
print("-" * 70)
print(f"{'모델':<20} {'F1':<10} {'Recall':<10} {'Precision':<10} {'AUC':<10}")
print("-" * 70)

for name, y_pred in models.items():
    f1 = f1_score(y_test, y_pred, pos_label=1)
    recall = recall_score(y_test, y_pred, pos_label=1)
    precision = precision_score(y_test, y_pred, pos_label=1, zero_division=0)

    # AUC는 확률이 필요하므로 해당 모델의 predict_proba 사용
    if name == '기준 모델':
        auc = 0.5
    elif name == 'FLAML 기본':
        auc = roc_auc_score(y_test, automl_basic.predict_proba(X_test)[:, 1])
    elif name == 'FLAML (F1)':
        auc = roc_auc_score(y_test, automl_f1.predict_proba(X_test)[:, 1])
    elif name == 'FLAML (가중치)':
        auc = roc_auc_score(y_test, automl_weighted.predict_proba(X_test)[:, 1])
    else:
        auc = roc_auc_score(y_test, automl_smote.predict_proba(X_test)[:, 1])

    print(f"{name:<20} {f1:<10.4f} {recall:<10.4f} {precision:<10.4f} {auc:<10.4f}")

Step 8: 임계값 조정

# 028 최적 모델의 확률 예측
y_prob = automl_weighted.predict_proba(X_test)[:, 1]

# 028 다양한 임계값 테스트
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]

print("\n임계값별 성능:")
print("-" * 60)
print(f"{'임계값':<10} {'F1':<10} {'Recall':<10} {'Precision':<10}")
print("-" * 60)

for threshold in thresholds:
    y_pred_th = (y_prob >= threshold).astype(int)
    f1 = f1_score(y_test, y_pred_th, pos_label=1)
    recall = recall_score(y_test, y_pred_th, pos_label=1)
    precision = precision_score(y_test, y_pred_th, pos_label=1, zero_division=0)
    print(f"{threshold:<10} {f1:<10.4f} {recall:<10.4f} {precision:<10.4f}")

Step 9: 비용 기반 의사결정

def calculate_cost(y_true, y_pred, cost_fn=100, cost_fp=1):
    """
    비용 계산
    cost_fn: False Negative 비용 (사기를 놓침)
    cost_fp: False Positive 비용 (정상을 사기로 오탐)
    """
    from sklearn.metrics import confusion_matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

    total_cost = fn * cost_fn + fp * cost_fp
    return total_cost, {'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp}

# 028 비용 비교 (사기를 놓치는 비용이 100배 높다고 가정)
print("\n비용 분석 (FN 비용=100, FP 비용=1):")
print("-" * 50)

for name, y_pred in models.items():
    cost, metrics = calculate_cost(y_test, y_pred, cost_fn=100, cost_fp=1)
    print(f"{name:<20}: 총 비용={cost:,}, FN={metrics['fn']}, FP={metrics['fp']}")

Step 10: 최종 모델 저장

import pickle

# 028 최적 모델 저장
best_model = automl_weighted

with open('fraud_detection_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

print("모델이 저장되었습니다: fraud_detection_model.pkl")

# 028 모델 로드 및 예측
with open('fraud_detection_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# 028 새 거래 예측
new_transaction = X_test.iloc[0:1]
prediction = loaded_model.predict(new_transaction)
probability = loaded_model.predict_proba(new_transaction)[0, 1]

print(f"\n새 거래 예측:")
print(f"  예측 결과: {'사기' if prediction[0] == 1 else '정상'}")
print(f"  사기 확률: {probability:.4f}")

정리

사기 탐지는 극심한 불균형 데이터 문제입니다.
Accuracy는 부적절하며, F1, Recall, AUC를 사용해야 합니다.
Sample weight와 SMOTE로 불균형을 처리합니다.
임계값 조정으로 Recall과 Precision 균형을 맞춥니다.
비용 분석으로 비즈니스 관점의 최적 모델을 선택합니다.
실제 환경에서는 False Negative 비용이 매우 높습니다.

다음 글 예고

다음 글에서는 분류 프로젝트 - 고객 이탈 예측에 대해 알아보겠습니다. 마케팅에서 중요한 고객 이탈 예측 문제를 FLAML로 해결합니다.

FLAML AutoML 마스터 시리즈 #028

개요​

실습 환경​

프로젝트 개요​

목표​

평가 지표​

Step 1: 데이터 준비​

실행 결과​

Step 2: 탐색적 데이터 분석​

Step 3: 데이터 분할​

Step 4: 기준 모델 (Baseline)​

실행 결과​

Step 5: FLAML 기본 모델​

Step 6: 불균형 데이터 처리​

방법 1: 적절한 metric 사용​

방법 2: Sample Weight 적용​

방법 3: SMOTE 오버샘플링​

Step 7: 종합 비교​

Step 8: 임계값 조정​

Step 9: 비용 기반 의사결정​

Step 10: 최종 모델 저장​

정리​

다음 글 예고​

개요

실습 환경

프로젝트 개요

목표

평가 지표

Step 1: 데이터 준비

실행 결과

Step 2: 탐색적 데이터 분석

Step 3: 데이터 분할

Step 4: 기준 모델 (Baseline)

실행 결과

Step 5: FLAML 기본 모델

Step 6: 불균형 데이터 처리

방법 1: 적절한 metric 사용

방법 2: Sample Weight 적용

방법 3: SMOTE 오버샘플링

Step 7: 종합 비교

Step 8: 임계값 조정

Step 9: 비용 기반 의사결정

Step 10: 최종 모델 저장

정리

다음 글 예고