019 분류 평가 지표 - F1 Score, AUC

키워드: F1, AUC, ROC

개요

이전 글에서 Precision과 Recall을 배웠습니다. 이 두 지표를 종합한 F1 Score와 분류 모델의 전반적인 성능을 측정하는 ROC AUC에 대해 알아봅니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], scikit-learn, matplotlib

pip install flaml[automl] scikit-learn matplotlib

F1 Score

정의

F1 Score는 Precision과 Recall의 조화 평균입니다.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

왜 조화 평균인가?

조화 평균은 두 값 중 낮은 값에 더 큰 가중치를 줍니다.

# 019 산술 평균 vs 조화 평균
precision = 0.9
recall = 0.1

arithmetic_mean = (precision + recall) / 2
harmonic_mean = 2 * precision * recall / (precision + recall)

print(f"Precision: {precision}, Recall: {recall}")
print(f"산술 평균: {arithmetic_mean:.4f}")
print(f"조화 평균 (F1): {harmonic_mean:.4f}")

실행 결과

Precision: 0.9, Recall: 0.1
산술 평균: 0.5000
조화 평균 (F1): 0.1800

F1 Score는 Precision과 Recall이 모두 높아야 높습니다.

F1 Score 계산

from sklearn.metrics import f1_score, precision_score, recall_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# 019 수동 계산
f1_manual = 2 * precision * recall / (precision + recall)
print(f"F1 (수동): {f1_manual:.4f}")

다중 분류에서의 F1

from sklearn.metrics import f1_score

y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 1, 1, 0, 2, 2, 0, 1, 2]

# 019 Micro F1: 전체 TP, FP, FN으로 계산
micro_f1 = f1_score(y_true, y_pred, average='micro')

# 019 Macro F1: 클래스별 F1의 평균
macro_f1 = f1_score(y_true, y_pred, average='macro')

# 019 Weighted F1: 클래스별 샘플 수로 가중 평균
weighted_f1 = f1_score(y_true, y_pred, average='weighted')

print(f"Micro F1: {micro_f1:.4f}")
print(f"Macro F1: {macro_f1:.4f}")
print(f"Weighted F1: {weighted_f1:.4f}")

종류	설명	사용 상황
Micro F1	전체 집계	클래스 불균형 시 다수 클래스에 가중치
Macro F1	클래스별 평균	모든 클래스 동등하게 취급
Weighted F1	샘플 수 가중 평균	클래스 크기 고려

ROC와 AUC

ROC 곡선 (Receiver Operating Characteristic)

ROC 곡선은 임계값 변화에 따른 TPR과 FPR의 관계를 보여줍니다.

TPR (True Positive Rate) = Recall = TP / (TP + FN)
FPR (False Positive Rate) = FP / (FP + TN)

ROC 곡선 그리기

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from flaml import AutoML
import matplotlib.pyplot as plt

# 019 데이터 준비
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 019 FLAML 학습
automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=30, verbose=0)

# 019 확률 예측
y_prob = automl.predict_proba(X_test)[:, 1]

# 019 ROC 곡선 계산
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# 019 시각화
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

ROC 곡선 해석

        1.0 ┤      ●━━━━━━━━━
            │     ╱
        TPR │    ╱  ← 좋은 모델 (왼쪽 위로 휘어짐)
            │   ╱
            │  ╱
        0.5 ┤ ╱ - - - - ← 랜덤 분류기 (대각선)
            │╱
        0.0 ┼━━━━━━━━━━━━
            0.0       0.5       1.0
                     FPR

왼쪽 위로 휘어질수록 좋은 모델
대각선은 랜덤 분류기 (동전 던지기)
완벽한 모델은 (0, 1) 점을 지남

AUC (Area Under the Curve)

AUC는 ROC 곡선 아래 면적입니다.

# 019 AUC 계산
auc = roc_auc_score(y_test, y_prob)
print(f"ROC AUC: {auc:.4f}")

AUC 해석

AUC 값	해석
1.0	완벽한 분류기
0.9 ~ 1.0	매우 우수
0.8 ~ 0.9	우수
0.7 ~ 0.8	양호
0.6 ~ 0.7	보통
0.5	랜덤 분류기
< 0.5	랜덤보다 나쁨

AUC의 장점

임계값 독립적: 특정 임계값에 의존하지 않음
불균형 데이터에서도 유용
모델 비교에 용이

여러 모델 ROC 비교

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

models = {
    'FLAML': automl,
    'Random Forest': RandomForestClassifier(random_state=42).fit(X_train, y_train),
    'Logistic Regression': LogisticRegression(max_iter=1000).fit(X_train, y_train)
}

plt.figure(figsize=(10, 8))

for name, model in models.items():
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_prob)
        auc = roc_auc_score(y_test, y_prob)
        plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.4f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

FLAML에서 F1, AUC 사용

F1 Score 최적화

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="f1"  # F1 Score 최적화
)

ROC AUC 최적화

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="roc_auc"  # AUC 최적화
)

다중 분류에서 AUC

# 019 다중 분류용 AUC
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="roc_auc_ovr"  # One-vs-Rest AUC
)

지표 선택 종합 가이드

상황	권장 지표	이유
균형 데이터	Accuracy	단순하고 해석 용이
불균형 데이터	F1, AUC	양성 클래스 성능 파악
FP, FN 동등 중요	F1 Score	Precision, Recall 균형
확률 예측 중요	AUC	임계값 독립적 평가
클래스별 성능	Macro F1	모든 클래스 동등 취급

실전 예제

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)

# 019 데이터 및 모델
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=30, verbose=0)

y_pred = automl.predict(X_test)
y_prob = automl.predict_proba(X_test)[:, 1]

# 019 모든 지표 출력
print("="*50)
print("분류 모델 평가 종합")
print("="*50)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_prob):.4f}")

정리

지표	특징	사용 상황
F1 Score	Precision과 Recall의 조화 평균	불균형 데이터, 둘 다 중요할 때
ROC AUC	ROC 곡선 아래 면적	임계값 독립적 평가, 모델 비교

F1 Score는 Precision과 Recall의 균형을 측정합니다.
ROC 곡선은 임계값 변화에 따른 TPR-FPR 관계를 보여줍니다.
AUC는 ROC 곡선 아래 면적으로, 0.5~1.0 사이의 값입니다.
불균형 데이터에서는 F1이나 AUC를 사용하세요.
FLAML에서 metric="f1" 또는 metric="roc_auc"로 설정합니다.

다음 글 예고

다음 글에서는 혼동 행렬 해석하기에 대해 알아보겠습니다. 혼동 행렬을 시각화하고 분석하는 방법을 다룹니다.

FLAML AutoML 마스터 시리즈 #019

개요​

실습 환경​

F1 Score​

정의​

왜 조화 평균인가?​

실행 결과​

F1 Score 계산​

다중 분류에서의 F1​

ROC와 AUC​

ROC 곡선 (Receiver Operating Characteristic)​

ROC 곡선 그리기​

ROC 곡선 해석​

AUC (Area Under the Curve)​

AUC 해석​

AUC의 장점​

여러 모델 ROC 비교​

FLAML에서 F1, AUC 사용​

F1 Score 최적화​

ROC AUC 최적화​

다중 분류에서 AUC​

지표 선택 종합 가이드​

실전 예제​

정리​

다음 글 예고​

개요