028 랜덤 포레스트 분류 상세

키워드: 랜덤 포레스트, rf

개요

랜덤 포레스트(Random Forest)는 여러 결정 트리를 결합한 앙상블 알고리즘입니다. 단일 트리의 과적합 문제를 해결하고 높은 성능을 제공합니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: pycaret[full]>=3.0

랜덤 포레스트 원리

부트스트랩 샘플링: 원본 데이터에서 중복 허용 샘플링
무작위 특성 선택: 각 분할에서 일부 특성만 고려
다수결 투표: 모든 트리의 예측을 종합

데이터 → [부트스트랩 샘플 1] → 트리 1 → 예측 1
     → [부트스트랩 샘플 2] → 트리 2 → 예측 2  → 다수결 → 최종 예측
     → [부트스트랩 샘플 N] → 트리 N → 예측 N

PyCaret에서 랜덤 포레스트

from pycaret.classification import *
from pycaret.datasets import get_data

# 028 데이터 로드
data = get_data('diabetes')
clf = setup(data, target='Class variable', session_id=42, verbose=False)

# 028 랜덤 포레스트 모델 생성
rf = create_model('rf')

주요 하이퍼파라미터

# 028 n_estimators: 트리 개수 (기본 100)
rf_100 = create_model('rf', n_estimators=100)
rf_200 = create_model('rf', n_estimators=200)
rf_500 = create_model('rf', n_estimators=500)

# 028 max_depth: 개별 트리 최대 깊이
rf_d5 = create_model('rf', max_depth=5)
rf_d10 = create_model('rf', max_depth=10)
rf_dnone = create_model('rf', max_depth=None)  # 제한 없음 (기본)

# 028 min_samples_split: 분할에 필요한 최소 샘플 수
rf_split = create_model('rf', min_samples_split=10)

# 028 min_samples_leaf: 리프 노드 최소 샘플 수
rf_leaf = create_model('rf', min_samples_leaf=5)

# 028 max_features: 분할 시 고려할 특성 비율
rf_sqrt = create_model('rf', max_features='sqrt')  # 기본값
rf_log2 = create_model('rf', max_features='log2')
rf_half = create_model('rf', max_features=0.5)

# 028 bootstrap: 부트스트랩 사용 여부
rf_boot = create_model('rf', bootstrap=True)  # 기본값

# 028 class_weight: 클래스 가중치
rf_balanced = create_model('rf', class_weight='balanced')

n_estimators 영향 분석

from pycaret.classification import *
from pycaret.datasets import get_data
import pandas as pd

data = get_data('credit')
clf = setup(data, target='default', session_id=42, verbose=False)

results = []

for n_trees in [10, 50, 100, 200, 500]:
    rf = create_model('rf', n_estimators=n_trees, verbose=False)
    metrics = pull()

    results.append({
        'n_trees': n_trees,
        'accuracy': metrics['Accuracy'].mean(),
        'auc': metrics['AUC'].mean(),
        'std': metrics['Accuracy'].std()
    })

df = pd.DataFrame(results)
print(df)

# 028 일반적으로 100-200개면 충분
# 028 더 늘려도 성능 향상 미미, 학습 시간만 증가

OOB (Out-of-Bag) Score

부트스트랩에서 선택되지 않은 샘플로 검증:

from pycaret.classification import *
from pycaret.datasets import get_data

data = get_data('diabetes')
clf = setup(data, target='Class variable', session_id=42, verbose=False)

# 028 OOB Score 활성화
rf = create_model('rf', oob_score=True, verbose=False)

# 028 OOB Score 확인
print(f"OOB Score: {rf.oob_score_:.4f}")

# 028 OOB Score ≈ 교차 검증 점수
# 028 별도 검증 세트 없이 성능 추정 가능

특성 중요도

from pycaret.classification import *
from pycaret.datasets import get_data
import pandas as pd

data = get_data('diabetes')
clf = setup(data, target='Class variable', session_id=42, verbose=False)

rf = create_model('rf', verbose=False)

# 028 특성 중요도
feature_names = get_config('X_train').columns
importances = rf.feature_importances_

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

print("특성 중요도:")
print(importance_df)

# 028 시각화
plot_model(rf, plot='feature')

튜닝

from pycaret.classification import *
from pycaret.datasets import get_data

data = get_data('credit')
clf = setup(data, target='default', session_id=42, verbose=False)

# 028 기본 모델
rf = create_model('rf', verbose=False)

# 028 자동 튜닝
tuned_rf = tune_model(rf, optimize='AUC')

# 028 커스텀 그리드
custom_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', 0.5]
}

tuned_rf = tune_model(rf, custom_grid=custom_grid, optimize='AUC')

시각화

from pycaret.classification import *
from pycaret.datasets import get_data

data = get_data('diabetes')
clf = setup(data, target='Class variable', session_id=42, verbose=False)

rf = create_model('rf', verbose=False)

# 028 AUC
plot_model(rf, plot='auc')

# 028 혼동 행렬
plot_model(rf, plot='confusion_matrix')

# 028 특성 중요도
plot_model(rf, plot='feature')

# 028 학습 곡선
plot_model(rf, plot='learning')

개별 트리 확인

# 028 포레스트 내 개별 트리 접근
print(f"트리 개수: {len(rf.estimators_)}")

# 028 첫 번째 트리
first_tree = rf.estimators_[0]

# 028 첫 번째 트리 시각화
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 8))
plot_tree(
    first_tree,
    feature_names=list(get_config('X_train').columns),
    max_depth=3,  # 시각화 깊이 제한
    filled=True,
    rounded=True
)
plt.title("Random Forest - First Tree")
plt.tight_layout()
plt.savefig('rf_first_tree.png', dpi=150)

병렬 처리

# 028 n_jobs: 사용할 CPU 코어 수
rf_parallel = create_model('rf', n_jobs=-1)  # 모든 코어 사용
rf_single = create_model('rf', n_jobs=1)     # 단일 코어

# 028 대용량 데이터에서 학습 시간 단축

장단점

장점:

높은 정확도
과적합에 강건
특성 중요도 제공
대용량 데이터에 적합
하이퍼파라미터 튜닝에 덜 민감

단점:

해석이 어려움 (블랙박스)
학습 시간이 오래 걸릴 수 있음
메모리 사용량 높음
실시간 예측에 부적합할 수 있음

언제 사용하나?

# 1. 대부분의 분류 문제에서 좋은 기준선
best = compare_models()  # RF가 자주 상위권

# 2. 특성 중요도가 필요할 때
plot_model(rf, plot='feature')

# 3. 과적합이 걱정될 때
# 028 단일 트리보다 안정적

# 4. 빠른 프로토타이핑
# 028 튜닝 없이도 좋은 성능

랜덤 포레스트 vs 결정 트리

항목	결정 트리	랜덤 포레스트
과적합	높음	낮음
해석력	높음	낮음
성능	보통	높음
안정성	낮음	높음
학습 시간	빠름	보통

정리

랜덤 포레스트는 여러 트리의 앙상블
n_estimators=100~200이 일반적
max_depth, min_samples_leaf로 과적합 조절
OOB Score로 별도 검증 없이 성능 추정
특성 중요도로 변수 선택에 활용
대부분의 문제에서 좋은 기준선

다음 글 예고

다음 글에서는 Gradient Boosting 분류 상세를 다룹니다.

PyCaret 머신러닝 마스터 시리즈 #028

개요​

실습 환경​

랜덤 포레스트 원리​

PyCaret에서 랜덤 포레스트​

주요 하이퍼파라미터​

n_estimators 영향 분석​

OOB (Out-of-Bag) Score​

특성 중요도​

튜닝​

시각화​

개별 트리 확인​

병렬 처리​

장단점​

언제 사용하나?​

랜덤 포레스트 vs 결정 트리​

정리​

다음 글 예고​

개요

실습 환경

랜덤 포레스트 원리

PyCaret에서 랜덤 포레스트

주요 하이퍼파라미터

n_estimators 영향 분석

OOB (Out-of-Bag) Score

특성 중요도

튜닝

시각화

개별 트리 확인

병렬 처리

장단점

언제 사용하나?

랜덤 포레스트 vs 결정 트리

정리

다음 글 예고