022 다중 분류 - 붓꽃 품종 예측

키워드: 다중 분류, 붓꽃, iris

개요

붓꽃(Iris) 데이터세트는 머신러닝의 "Hello World"와 같은 데이터세트입니다. 꽃받침과 꽃잎의 길이/너비로 3가지 붓꽃 품종을 분류하는 다중 분류 문제입니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], pandas, scikit-learn, seaborn

pip install flaml[automl] pandas scikit-learn seaborn matplotlib

프로젝트 개요

목표

꽃받침(sepal)과 꽃잎(petal)의 특성으로 붓꽃 품종 예측

데이터세트

샘플 수: 150개 (각 품종 50개씩)
특성 수: 4개
클래스: Setosa, Versicolor, Virginica (3개)

Step 1: 데이터 로드 및 탐색

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# 022 데이터 로드
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
df['species_name'] = df['species'].map({0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'})

print("데이터 크기:", df.shape)
print("\n처음 5행:")
print(df.head())

print("\n클래스 분포:")
print(df['species_name'].value_counts())

실행 결과

데이터 크기: (150, 6)

클래스 분포:
Setosa        50
Versicolor    50
Virginica     50

특성 통계

print("\n특성별 통계:")
print(df.groupby('species_name')[iris.feature_names].mean())

Step 2: 데이터 시각화

Pairplot

plt.figure(figsize=(12, 10))
sns.pairplot(df, hue='species_name', diag_kind='kde',
             vars=iris.feature_names)
plt.suptitle('Iris Dataset Pairplot', y=1.02)
plt.tight_layout()
plt.savefig('iris_pairplot.png', dpi=100)
plt.show()

특성별 분포

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, feature in enumerate(iris.feature_names):
    for species in df['species_name'].unique():
        data = df[df['species_name'] == species][feature]
        axes[i].hist(data, alpha=0.5, label=species, bins=15)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Count')
    axes[i].legend()

plt.suptitle('Feature Distributions by Species')
plt.tight_layout()
plt.show()

Step 3: 데이터 준비

from sklearn.model_selection import train_test_split

# 022 특성과 타겟 분리
X = df[iris.feature_names]
y = df['species']

# 022 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"학습 데이터: {X_train.shape[0]}개")
print(f"테스트 데이터: {X_test.shape[0]}개")
print(f"\n학습 데이터 클래스 분포:")
print(pd.Series(y_train).value_counts().sort_index())

Step 4: FLAML AutoML 학습

from flaml import AutoML

# 022 AutoML 객체 생성
automl = AutoML()

# 022 학습 설정
settings = {
    "task": "classification",
    "time_budget": 60,
    "metric": "accuracy",
    "estimator_list": ["lgbm", "xgboost", "rf", "extra_tree"],
    "seed": 42,
    "verbose": 1
}

print("="*60)
print("FLAML AutoML 학습 (다중 분류)")
print("="*60)

# 022 학습
automl.fit(X_train, y_train, **settings)

print(f"\n최적 모델: {automl.best_estimator}")
print(f"검증 정확도: {1 - automl.best_loss:.4f}")

Step 5: 평가

from sklearn.metrics import (
    accuracy_score, classification_report,
    confusion_matrix, f1_score
)

# 022 예측
y_pred = automl.predict(X_test)

# 022 정확도
accuracy = accuracy_score(y_test, y_pred)
print(f"테스트 정확도: {accuracy:.4f}")

# 022 F1 Score (다중 분류)
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')
print(f"Macro F1: {f1_macro:.4f}")
print(f"Weighted F1: {f1_weighted:.4f}")

# 022 분류 리포트
print("\n분류 리포트:")
print(classification_report(y_test, y_pred,
                           target_names=['Setosa', 'Versicolor', 'Virginica']))

실행 결과

테스트 정확도: 1.0000
Macro F1: 1.0000
Weighted F1: 1.0000

분류 리포트:
              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        10
  Versicolor       1.00      1.00      1.00        10
   Virginica       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Step 6: 혼동 행렬 시각화

# 022 혼동 행렬
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Setosa', 'Versicolor', 'Virginica'],
            yticklabels=['Setosa', 'Versicolor', 'Virginica'])
plt.xlabel('예측')
plt.ylabel('실제')
plt.title('Iris Classification - Confusion Matrix')
plt.tight_layout()
plt.savefig('iris_confusion_matrix.png', dpi=100)
plt.show()

Step 7: 확률 예측 분석

# 022 확률 예측
y_prob = automl.predict_proba(X_test)

print("확률 예측 (처음 5개):")
print(f"Shape: {y_prob.shape}")

prob_df = pd.DataFrame(y_prob, columns=['Setosa', 'Versicolor', 'Virginica'])
prob_df['실제'] = y_test.values
prob_df['예측'] = y_pred
print(prob_df.head(10))

예측 확신도 분석

# 022 최대 확률 (확신도)
max_probs = y_prob.max(axis=1)

plt.figure(figsize=(10, 5))
plt.hist(max_probs, bins=20, edgecolor='black')
plt.xlabel('Prediction Confidence (Max Probability)')
plt.ylabel('Count')
plt.title('Distribution of Prediction Confidence')
plt.axvline(x=0.5, color='r', linestyle='--', label='Random threshold')
plt.legend()
plt.tight_layout()
plt.show()

print(f"평균 확신도: {max_probs.mean():.4f}")
print(f"최소 확신도: {max_probs.min():.4f}")

Step 8: 특성 중요도

if automl.best_estimator in ['lgbm', 'xgboost', 'rf', 'extra_tree']:
    model = automl.best_model
    importance = model.feature_importances_

    # 시각화
    plt.figure(figsize=(10, 6))
    sorted_idx = np.argsort(importance)[::-1]

    plt.bar(range(len(importance)), importance[sorted_idx])
    plt.xticks(range(len(importance)),
               np.array(iris.feature_names)[sorted_idx], rotation=45)
    plt.xlabel('Feature')
    plt.ylabel('Importance')
    plt.title('Feature Importance for Iris Classification')
    plt.tight_layout()
    plt.savefig('iris_feature_importance.png', dpi=100)
    plt.show()

    print("\n특성 중요도:")
    for i in sorted_idx:
        print(f"  {iris.feature_names[i]}: {importance[i]:.4f}")

Step 9: 새 데이터 예측

# 022 새로운 붓꽃 데이터
new_flowers = pd.DataFrame({
    'sepal length (cm)': [5.0, 6.5, 7.0],
    'sepal width (cm)': [3.5, 3.0, 3.2],
    'petal length (cm)': [1.4, 5.0, 6.0],
    'petal width (cm)': [0.2, 1.8, 2.0]
})

# 022 예측
predictions = automl.predict(new_flowers)
probabilities = automl.predict_proba(new_flowers)

species_names = ['Setosa', 'Versicolor', 'Virginica']

print("새 붓꽃 예측 결과:")
print("-" * 60)
for i in range(len(new_flowers)):
    pred_class = species_names[predictions[i]]
    probs = probabilities[i]
    print(f"\n샘플 {i+1}:")
    print(f"  예측: {pred_class}")
    print(f"  확률: Setosa={probs[0]:.3f}, Versicolor={probs[1]:.3f}, Virginica={probs[2]:.3f}")

이진 분류 vs 다중 분류 비교

항목	이진 분류	다중 분류
클래스 수	2개	3개 이상
predict_proba shape	(n, 2)	(n, k)
F1 Score	기본	macro, micro, weighted
ROC AUC	직접 계산	OvR, OvO
혼동 행렬	2×2	k×k

정리

붓꽃 데이터세트로 다중 분류(3클래스) 문제를 해결했습니다.
FLAML이 100% 정확도를 달성했습니다.
**predict_proba()**로 각 클래스별 확률을 얻을 수 있습니다.
다중 분류에서는 Macro F1, Weighted F1을 사용합니다.
혼동 행렬로 어떤 클래스가 혼동되는지 확인합니다.

다음 글 예고

다음 글에서는 다중 분류 평가 지표에 대해 알아보겠습니다. Micro/Macro/Weighted 평균의 차이와 사용법을 다룹니다.

FLAML AutoML 마스터 시리즈 #022

개요​

실습 환경​

프로젝트 개요​

목표​

데이터세트​

Step 1: 데이터 로드 및 탐색​

실행 결과​

특성 통계​

Step 2: 데이터 시각화​

Pairplot​

특성별 분포​

Step 3: 데이터 준비​

Step 4: FLAML AutoML 학습​

Step 5: 평가​

실행 결과​

Step 6: 혼동 행렬 시각화​

Step 7: 확률 예측 분석​

예측 확신도 분석​

Step 8: 특성 중요도​

Step 9: 새 데이터 예측​

이진 분류 vs 다중 분류 비교​

정리​

다음 글 예고​

개요

실습 환경

프로젝트 개요

목표

데이터세트

Step 1: 데이터 로드 및 탐색

실행 결과

특성 통계

Step 2: 데이터 시각화

Pairplot

특성별 분포

Step 3: 데이터 준비

Step 4: FLAML AutoML 학습

Step 5: 평가

실행 결과

Step 6: 혼동 행렬 시각화

Step 7: 확률 예측 분석

예측 확신도 분석

Step 8: 특성 중요도

Step 9: 새 데이터 예측

이진 분류 vs 다중 분류 비교

정리

다음 글 예고