035 분류 파트 총정리

키워드: 분류, 총정리, 체크리스트

개요

Part 2에서는 FLAML을 활용한 분류(Classification) 문제를 다양한 관점에서 다루었습니다. 이 글에서는 지금까지 배운 내용을 정리하고, 실전에서 바로 사용할 수 있는 체크리스트를 제공합니다.

Part 2 학습 내용

기본 개념 (16~18번)

번호	주제	핵심 내용
16	분류 문제란?	이진/다중 분류, 확률 예측
17	분류 평가 지표	Accuracy, Precision, Recall, F1, AUC
18	타이타닉 프로젝트	실전 분류 파이프라인

평가와 해석 (19~23번)

번호	주제	핵심 내용
19	혼동 행렬	TP, FP, TN, FN 분석
20	Precision vs Recall	트레이드오프, F1 Score
21	ROC 곡선과 AUC	시각화, 모델 비교
22	다중 분류	붓꽃 품종 예측
23	다중 분류 평가 지표	Micro, Macro, Weighted

불균형 데이터 (24~25번)

번호	주제	핵심 내용
24	불균형 데이터 전략	SMOTE, 리샘플링
25	sample_weight	클래스 가중치 적용

고급 설정 (26~27번)

번호	주제	핵심 내용
26	n_splits	교차 검증 설정
27	early_stop	조기 종료로 효율 향상

실전 프로젝트 (28~30번)

번호	주제	핵심 내용
28	사기 탐지	극심한 불균형 처리
29	고객 이탈	비즈니스 가치 분석
30	스팸 분류	텍스트 분류

모델 해석 및 통합 (31~34번)

번호	주제	핵심 내용
31	특성 중요도	Feature Importance
32	SHAP	모델 해석, XAI
33	sklearn 파이프라인	전처리 통합
34	범주형 처리	인코딩 전략

분류 문제 해결 체크리스트

1. 데이터 이해

# 035 체크리스트
checks_data = [
    "데이터 크기 확인 (샘플 수, 특성 수)",
    "타겟 변수 분포 확인 (균형/불균형)",
    "결측치 확인 및 처리 방법 결정",
    "이상치 확인",
    "특성 타입 확인 (수치/범주)",
    "특성 간 상관관계 분석"
]

print("✅ 데이터 이해 체크리스트:")
for check in checks_data:
    print(f"  □ {check}")

2. 전처리

checks_preprocessing = [
    "결측치 처리 (제거/대체)",
    "범주형 변수 인코딩 방식 선택",
    "스케일링 필요 여부 확인",
    "특성 엔지니어링 필요 여부",
    "불균형 처리 방법 결정",
    "학습/검증/테스트 분할 (stratify 적용)"
]

print("\n✅ 전처리 체크리스트:")
for check in checks_preprocessing:
    print(f"  □ {check}")

3. 모델링

checks_modeling = [
    "평가 지표 선택 (문제에 적합한)",
    "time_budget 설정",
    "n_splits 설정 (데이터 크기 고려)",
    "sample_weight 적용 여부",
    "estimator_list 지정 여부",
    "seed 설정 (재현성)"
]

print("\n✅ 모델링 체크리스트:")
for check in checks_modeling:
    print(f"  □ {check}")

4. 평가

checks_evaluation = [
    "테스트 데이터로 최종 평가",
    "혼동 행렬 분석",
    "분류 리포트 확인",
    "ROC 곡선 및 AUC 확인",
    "임계값 조정 필요 여부",
    "과적합 여부 확인"
]

print("\n✅ 평가 체크리스트:")
for check in checks_evaluation:
    print(f"  □ {check}")

5. 해석 및 배포

checks_deploy = [
    "특성 중요도 분석",
    "SHAP 분석 (필요시)",
    "비즈니스 관점 검증",
    "파이프라인 저장",
    "예측 API 구성",
    "모니터링 계획"
]

print("\n✅ 해석 및 배포 체크리스트:")
for check in checks_deploy:
    print(f"  □ {check}")

상황별 권장 설정

균형 데이터 + 일반 분류

from flaml import AutoML

# 035 기본 설정
automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=120,
    metric="accuracy",  # 또는 "roc_auc"
    n_splits=5,
    seed=42
)

불균형 데이터

from sklearn.utils.class_weight import compute_sample_weight

# 035 가중치 계산
weights = compute_sample_weight('balanced', y_train)

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=120,
    metric="f1",  # 또는 "roc_auc"
    sample_weight=weights,
    n_splits=5,
    seed=42
)

대용량 데이터

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=300,  # 더 긴 시간
    metric="roc_auc",
    n_splits=3,  # 폴드 수 감소
    early_stop=True,
    seed=42
)

빠른 프로토타이핑

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,  # 짧은 시간
    metric="accuracy",
    estimator_list=["lgbm", "rf"],  # 빠른 모델만
    n_splits=3,
    seed=42
)

평가 지표 선택 가이드

metric_guide = {
    '균형 데이터': 'accuracy',
    '불균형 데이터': 'f1 또는 roc_auc',
    'FP 비용 높음': 'precision',
    'FN 비용 높음': 'recall',
    '확률 예측 중요': 'roc_auc 또는 log_loss',
    '다중 분류': 'macro_f1 또는 micro_f1'
}

print("평가 지표 선택 가이드:")
print("-" * 40)
for situation, metric in metric_guide.items():
    print(f"  {situation}: {metric}")

빠른 참조 코드

완전한 분류 파이프라인

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import classification_report, roc_auc_score
from flaml import AutoML

def classification_pipeline(df, target_col, time_budget=120):
    """완전한 분류 파이프라인"""

    # 1. 특성/타겟 분리
    X = df.drop(columns=[target_col])
    y = df[target_col]

    # 2. 범주형 인코딩
    le_dict = {}
    for col in X.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))
        le_dict[col] = le

    # 3. 데이터 분할
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # 4. 불균형 확인 및 가중치
    class_ratio = y_train.value_counts(normalize=True).min()
    if class_ratio < 0.3:  # 불균형
        weights = compute_sample_weight('balanced', y_train)
        metric = "f1"
    else:
        weights = None
        metric = "accuracy"

    # 5. FLAML 학습
    automl = AutoML()
    automl.fit(
        X_train, y_train,
        task="classification",
        time_budget=time_budget,
        metric=metric,
        sample_weight=weights,
        seed=42,
        verbose=0
    )

    # 6. 평가
    y_pred = automl.predict(X_test)
    y_prob = automl.predict_proba(X_test)

    print("=" * 50)
    print("분류 결과")
    print("=" * 50)
    print(f"최적 모델: {automl.best_estimator}")
    print(f"\n분류 리포트:")
    print(classification_report(y_test, y_pred))

    if len(np.unique(y)) == 2:
        auc = roc_auc_score(y_test, y_prob[:, 1])
        print(f"ROC AUC: {auc:.4f}")

    return automl, X_test, y_test

# 035 사용 예
# 035 model, X_test, y_test = classification_pipeline(df, 'target')

주요 함수/메서드 정리

flaml_methods = {
    'automl.fit()': '모델 학습',
    'automl.predict()': '클래스 예측',
    'automl.predict_proba()': '확률 예측',
    'automl.score()': '정확도 평가',
    'automl.best_estimator': '최적 모델명',
    'automl.best_model': '최적 모델 객체',
    'automl.best_config': '최적 하이퍼파라미터',
    'automl.best_loss': '최적 손실값 (1-metric)',
}

sklearn_metrics = {
    'accuracy_score': '정확도',
    'precision_score': '정밀도',
    'recall_score': '재현율',
    'f1_score': 'F1 점수',
    'roc_auc_score': 'ROC AUC',
    'confusion_matrix': '혼동 행렬',
    'classification_report': '분류 리포트',
}

print("FLAML 주요 메서드:")
for method, desc in flaml_methods.items():
    print(f"  {method}: {desc}")

print("\nsklearn 평가 함수:")
for func, desc in sklearn_metrics.items():
    print(f"  {func}: {desc}")

Part 3 예고

다음 Part 3에서는 회귀(Regression) 문제를 다룹니다:

회귀 문제의 기본 개념
회귀 평가 지표 (MSE, RMSE, MAE, R²)
주택 가격 예측, 매출 예측 프로젝트
로그 변환, 특성 엔지니어링
앙상블과 스태킹

정리

Part 2에서는 FLAML을 활용한 분류 문제의 전체 워크플로우를 학습했습니다:

기본 개념: 분류 문제 이해, 평가 지표
실전 프로젝트: 타이타닉, 사기 탐지, 고객 이탈, 스팸 분류
불균형 처리: SMOTE, sample_weight, 임계값 조정
모델 해석: 특성 중요도, SHAP
파이프라인: sklearn 통합, 범주형 처리

이제 분류 문제를 자신있게 해결할 수 있습니다!

FLAML AutoML 마스터 시리즈 #035

Part 2: 분류 (Classification) 완료

개요​

Part 2 학습 내용​

기본 개념 (16~18번)​

평가와 해석 (19~23번)​

불균형 데이터 (24~25번)​

고급 설정 (26~27번)​

실전 프로젝트 (28~30번)​

모델 해석 및 통합 (31~34번)​

분류 문제 해결 체크리스트​

1. 데이터 이해​

2. 전처리​

3. 모델링​

4. 평가​

5. 해석 및 배포​

상황별 권장 설정​

균형 데이터 + 일반 분류​

불균형 데이터​

대용량 데이터​

빠른 프로토타이핑​

평가 지표 선택 가이드​

빠른 참조 코드​

완전한 분류 파이프라인​

주요 함수/메서드 정리​

Part 3 예고​

정리​

개요