027 조기 종료 - early_stop

키워드: 조기 종료, early_stop, 효율성

개요

조기 종료(Early Stopping)는 모델이 더 이상 개선되지 않을 때 학습을 중단하는 기법입니다. FLAML에서 조기 종료를 활용하면 학습 시간을 절약하면서도 좋은 성능을 얻을 수 있습니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], scikit-learn

pip install flaml[automl] scikit-learn

조기 종료란?

기본 개념

학습 중 성능이 개선되지 않으면 조기에 학습을 중단합니다.

import numpy as np
import matplotlib.pyplot as plt

# 027 학습 곡선 예시
epochs = np.arange(1, 101)
train_loss = 1 / (1 + epochs * 0.1) + np.random.normal(0, 0.02, 100)
val_loss = 1 / (1 + epochs * 0.08) + 0.1 + np.random.normal(0, 0.03, 100)

# 027 과적합 시뮬레이션 (50 에폭 이후)
val_loss[50:] = val_loss[50:] + np.linspace(0, 0.3, 50)

plt.figure(figsize=(10, 6))
plt.plot(epochs, train_loss, label='Train Loss')
plt.plot(epochs, val_loss, label='Validation Loss')
plt.axvline(x=50, color='r', linestyle='--', label='Early Stop Point')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Early Stopping Example')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("조기 종료 없이: 100 에폭 학습")
print("조기 종료 적용: 50 에폭에서 중단 → 50% 시간 절약!")

왜 조기 종료가 필요한가?

# 1. 과적합 방지
# 2. 학습 시간 절약
# 3. 리소스 효율화

benefits = {
    '과적합 방지': '검증 성능이 하락하기 전에 중단',
    '시간 절약': '불필요한 학습 반복 제거',
    '리소스 효율': 'GPU/CPU 자원 절약'
}

for benefit, description in benefits.items():
    print(f"✓ {benefit}: {description}")

FLAML의 조기 종료 메커니즘

자동 조기 종료

FLAML은 기본적으로 효율적인 탐색을 위해 조기 종료를 적용합니다.

from flaml import AutoML
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# 027 데이터 준비
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 027 FLAML 기본 설정 (자동 조기 종료 포함)
automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    metric="accuracy",
    verbose=1  # 로그로 조기 종료 확인
)

print(f"\n최적 모델: {automl.best_estimator}")
print(f"검증 점수: {1 - automl.best_loss:.4f}")

early_stop 파라미터

# 027 early_stop: True/False 또는 커스텀 함수
automl_early = AutoML()
automl_early.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    early_stop=True,  # 조기 종료 활성화 (기본값)
    verbose=0
)

automl_no_early = AutoML()
automl_no_early.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    early_stop=False,  # 조기 종료 비활성화
    verbose=0
)

print(f"조기 종료 활성화: {automl_early.best_estimator}")
print(f"조기 종료 비활성화: {automl_no_early.best_estimator}")

모델별 조기 종료

LightGBM/XGBoost 조기 종료

# 027 LightGBM, XGBoost는 자체 조기 종료 지원
automl_lgbm = AutoML()
automl_lgbm.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    estimator_list=["lgbm"],
    verbose=0
)

# 027 선택된 하이퍼파라미터 확인
print("LightGBM 설정:")
config = automl_lgbm.best_config
for key, value in config.items():
    print(f"  {key}: {value}")

커스텀 조기 종료 설정

# 027 커스텀 조기 종료 함수
def custom_early_stop(result, **kwargs):
    """
    커스텀 조기 종료 조건

    result: 현재까지의 결과
    return: True면 학습 중단
    """
    # 예: 정확도가 0.99 이상이면 조기 종료
    if 'accuracy' in result and result['accuracy'] >= 0.99:
        return True
    return False

# 027 적용 (고급 사용)
# 027 automl.fit(..., early_stop=custom_early_stop)

조기 종료와 시간 예산

time_budget과의 관계

import time

# 027 짧은 time_budget + 조기 종료
start = time.time()
automl_short = AutoML()
automl_short.fit(
    X_train, y_train,
    task="classification",
    time_budget=30,  # 30초
    early_stop=True,
    verbose=0
)
time_short = time.time() - start

# 027 긴 time_budget + 조기 종료
start = time.time()
automl_long = AutoML()
automl_long.fit(
    X_train, y_train,
    task="classification",
    time_budget=120,  # 120초
    early_stop=True,
    verbose=0
)
time_long = time.time() - start

print(f"time_budget=30초: 실제 {time_short:.1f}초, 점수={1 - automl_short.best_loss:.4f}")
print(f"time_budget=120초: 실제 {time_long:.1f}초, 점수={1 - automl_long.best_loss:.4f}")

retrain_full 옵션

# 027 최적 설정 찾은 후 전체 데이터로 재학습
automl_retrain = AutoML()
automl_retrain.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    retrain_full=True,  # 최적 설정으로 전체 데이터 재학습
    verbose=0
)

# 027 retrain_full=True 효과:
# 1. 하이퍼파라미터 탐색 (교차 검증 사용)
# 2. 최적 설정으로 전체 학습 데이터로 재학습

효율적인 하이퍼파라미터 탐색

FLAML의 탐색 전략

# 027 FLAML은 Cost-Frugal Optimization 사용
# 027 - 저비용 설정부터 시작
# 027 - 유망한 설정에 더 많은 리소스 할당
# 027 - 성능이 낮은 설정은 조기 종료

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=60,
    verbose=2  # 상세 로그
)

# 027 탐색 히스토리 확인
print("\n탐색된 설정 수:", len(automl.config_history))

탐색 히스토리 분석

import pandas as pd

# 027 탐색 히스토리를 DataFrame으로 변환
history = []
for config_id, (config, metrics) in automl.config_history.items():
    history.append({
        'config_id': config_id,
        'estimator': config.get('learner', 'unknown'),
        'val_loss': metrics.get('val_loss', None),
        'train_time': metrics.get('train_time', None)
    })

df_history = pd.DataFrame(history)
print("탐색 히스토리 (처음 10개):")
print(df_history.head(10))

실전 팁

1. 대용량 데이터에서 조기 종료

from sklearn.datasets import make_classification

# 027 대용량 데이터 생성
X_large, y_large = make_classification(
    n_samples=50000,
    n_features=50,
    n_informative=30,
    random_state=42
)

X_train_lg, X_test_lg, y_train_lg, y_test_lg = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)

# 027 조기 종료로 효율적 탐색
automl_large = AutoML()
automl_large.fit(
    X_train_lg, y_train_lg,
    task="classification",
    time_budget=120,
    early_stop=True,
    n_splits=3,  # 폴드 수 줄여 속도 향상
    verbose=0
)

print(f"대용량 데이터 최적 모델: {automl_large.best_estimator}")
print(f"테스트 정확도: {automl_large.score(X_test_lg, y_test_lg):.4f}")

2. 과적합 모니터링

# 027 학습/검증 성능 차이로 과적합 판단
y_pred_train = automl.predict(X_train)
y_pred_test = automl.predict(X_test)

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print(f"학습 정확도: {train_acc:.4f}")
print(f"테스트 정확도: {test_acc:.4f}")
print(f"차이: {train_acc - test_acc:.4f}")

if train_acc - test_acc > 0.05:
    print("⚠️ 과적합 가능성 있음")
else:
    print("✓ 과적합 없음")

3. 앙상블과 조기 종료

# 027 앙상블 사용 시 조기 종료
automl_ensemble = AutoML()
automl_ensemble.fit(
    X_train, y_train,
    task="classification",
    time_budget=120,
    ensemble=True,  # 앙상블 활성화
    early_stop=True,
    verbose=0
)

print(f"앙상블 모델 성능: {automl_ensemble.score(X_test, y_test):.4f}")

정리

조기 종료는 불필요한 학습을 중단하여 효율성을 높입니다.
FLAML은 기본적으로 효율적인 탐색 전략을 사용합니다.
early_stop=True로 명시적으로 조기 종료를 활성화할 수 있습니다.
time_budget과 조기 종료를 함께 사용하면 최적의 효율을 얻습니다.
retrain_full=True로 최적 설정을 찾은 후 전체 데이터로 재학습합니다.
대용량 데이터에서는 n_splits 감소와 조기 종료를 함께 사용합니다.

다음 글 예고

다음 글에서는 분류 프로젝트 - 신용카드 사기 탐지에 대해 알아보겠습니다. 실제 사기 탐지 시나리오에서 FLAML을 활용하는 방법을 다룹니다.

FLAML AutoML 마스터 시리즈 #027

개요​

실습 환경​

조기 종료란?​

기본 개념​

왜 조기 종료가 필요한가?​

FLAML의 조기 종료 메커니즘​

자동 조기 종료​

early_stop 파라미터​

모델별 조기 종료​

LightGBM/XGBoost 조기 종료​

커스텀 조기 종료 설정​

조기 종료와 시간 예산​

time_budget과의 관계​

retrain_full 옵션​

효율적인 하이퍼파라미터 탐색​

FLAML의 탐색 전략​

탐색 히스토리 분석​

실전 팁​

1. 대용량 데이터에서 조기 종료​

2. 과적합 모니터링​

3. 앙상블과 조기 종료​

정리​

다음 글 예고​

개요