017 이진 분류 - 타이타닉 생존 예측

키워드: 이진 분류, 타이타닉

개요

타이타닉 데이터세트는 머신러닝 입문자들이 가장 먼저 접하는 데이터세트 중 하나입니다. 승객의 정보를 바탕으로 생존 여부를 예측하는 이진 분류 문제입니다.

이 글에서는 FLAML을 사용해 타이타닉 생존 예측 모델을 만들어 봅니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], pandas, scikit-learn, seaborn

pip install flaml[automl] pandas scikit-learn seaborn matplotlib

프로젝트 개요

목표

타이타닉 승객의 정보를 바탕으로 생존(1) 또는 사망(0) 예측

데이터세트 정보

승객 수: 891명 (학습 데이터)
특성: 나이, 성별, 객실 등급, 요금 등
타겟: Survived (0: 사망, 1: 생존)

Step 1: 데이터 로드

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from flaml import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 017 seaborn에서 타이타닉 데이터 로드
df = sns.load_dataset('titanic')

print("데이터 크기:", df.shape)
print("\n컬럼 목록:")
print(df.columns.tolist())
print("\n처음 5행:")
print(df.head())

실행 결과

데이터 크기: (891, 15)

컬럼 목록:
['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
 'alive', 'alone']

Step 2: 데이터 탐색

타겟 분포

print("생존 여부 분포:")
print(df['survived'].value_counts())
print(f"\n생존율: {df['survived'].mean()*100:.1f}%")

# 017 시각화
plt.figure(figsize=(6, 4))
df['survived'].value_counts().plot(kind='bar', color=['salmon', 'lightgreen'])
plt.title('생존 여부 분포')
plt.xlabel('생존 (0=사망, 1=생존)')
plt.ylabel('인원 수')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

결측치 확인

print("결측치:")
print(df.isnull().sum())
print(f"\n결측치 비율:")
print((df.isnull().sum() / len(df) * 100).round(1))

실행 결과

결측치:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
...
deck           688

주요 특성별 생존율

# 017 성별별 생존율
print("성별별 생존율:")
print(df.groupby('sex')['survived'].mean())

# 017 객실 등급별 생존율
print("\n객실 등급별 생존율:")
print(df.groupby('pclass')['survived'].mean())

Step 3: 데이터 전처리

# 017 사용할 특성 선택
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'alone']

# 017 데이터 복사
df_model = df[features + ['survived']].copy()

# 017 결측치 처리
df_model['age'].fillna(df_model['age'].median(), inplace=True)
df_model['embarked'].fillna(df_model['embarked'].mode()[0], inplace=True)

# 017 범주형 변수 인코딩
df_model['sex'] = df_model['sex'].map({'male': 0, 'female': 1})
df_model['embarked'] = df_model['embarked'].map({'S': 0, 'C': 1, 'Q': 2})
df_model['alone'] = df_model['alone'].astype(int)

print("전처리 후 데이터:")
print(df_model.head())
print(f"\n결측치: {df_model.isnull().sum().sum()}")

Step 4: 데이터 분할

# 017 특성과 타겟 분리
X = df_model.drop('survived', axis=1)
y = df_model['survived']

# 017 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # 클래스 비율 유지
)

print(f"학습 데이터: {X_train.shape[0]}명")
print(f"테스트 데이터: {X_test.shape[0]}명")
print(f"\n학습 데이터 생존율: {y_train.mean()*100:.1f}%")
print(f"테스트 데이터 생존율: {y_test.mean()*100:.1f}%")

Step 5: FLAML AutoML 학습

# 017 AutoML 객체 생성
automl = AutoML()

# 017 설정
settings = {
    "task": "classification",
    "time_budget": 60,
    "metric": "accuracy",
    "estimator_list": ["lgbm", "xgboost", "rf", "extra_tree"],
    "seed": 42,
    "verbose": 1
}

print("="*60)
print("FLAML AutoML 학습 시작")
print("="*60)

# 017 학습
automl.fit(X_train, y_train, **settings)

실행 결과

[flaml.automl.logger: INFO] Iteration 1, current learner lgbm
[flaml.automl.logger: INFO] at 0.3s, best lgbm's error=0.1796, ...
...
[flaml.automl.logger: INFO] Best ML model: lgbm

Step 6: 결과 확인

print("="*60)
print("학습 결과")
print("="*60)
print(f"최적 모델: {automl.best_estimator}")
print(f"검증 정확도: {1 - automl.best_loss:.4f}")

print("\n최적 하이퍼파라미터:")
for key, value in automl.best_config.items():
    print(f"  {key}: {value}")

Step 7: 테스트 평가

# 017 예측
y_pred = automl.predict(X_test)
y_prob = automl.predict_proba(X_test)[:, 1]

print("="*60)
print("테스트 평가")
print("="*60)

# 017 정확도
accuracy = accuracy_score(y_test, y_pred)
print(f"정확도: {accuracy:.4f}")

# 017 분류 리포트
print("\n분류 리포트:")
print(classification_report(y_test, y_pred, target_names=['사망', '생존']))

실행 결과

테스트 평가
============================================================
정확도: 0.8212

분류 리포트:
              precision    recall  f1-score   support

          사망       0.84      0.87      0.85       110
          생존       0.79      0.75      0.77        69

    accuracy                           0.82       179
   macro avg       0.81      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179

Step 8: 결과 시각화

혼동 행렬

# 017 혼동 행렬
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['사망 예측', '생존 예측'],
            yticklabels=['실제 사망', '실제 생존'])
plt.title('혼동 행렬')
plt.tight_layout()
plt.savefig('titanic_confusion_matrix.png', dpi=100)
plt.show()

print(f"\n혼동 행렬 해석:")
print(f"  True Negative (사망 정확히 예측): {cm[0,0]}")
print(f"  False Positive (사망을 생존으로): {cm[0,1]}")
print(f"  False Negative (생존을 사망으로): {cm[1,0]}")
print(f"  True Positive (생존 정확히 예측): {cm[1,1]}")

특성 중요도

if automl.best_estimator in ['lgbm', 'xgboost', 'rf', 'extra_tree']:
    model = automl.best_model
    importance = model.feature_importances_

    # 시각화
    plt.figure(figsize=(10, 6))
    sorted_idx = np.argsort(importance)
    plt.barh(range(len(sorted_idx)), importance[sorted_idx])
    plt.yticks(range(len(sorted_idx)), X.columns[sorted_idx])
    plt.xlabel('Feature Importance')
    plt.title('특성 중요도')
    plt.tight_layout()
    plt.savefig('titanic_feature_importance.png', dpi=100)
    plt.show()

Step 9: 예측 예시

# 017 새로운 승객 데이터로 예측
new_passenger = pd.DataFrame({
    'pclass': [1],       # 1등석
    'sex': [1],          # 여성
    'age': [25],         # 25세
    'sibsp': [0],        # 형제/배우자 0명
    'parch': [0],        # 부모/자녀 0명
    'fare': [100],       # 요금 $100
    'embarked': [1],     # Cherbourg 탑승
    'alone': [1]         # 혼자
})

prediction = automl.predict(new_passenger)[0]
probability = automl.predict_proba(new_passenger)[0, 1]

print("새 승객 예측:")
print(f"  생존 예측: {'생존' if prediction == 1 else '사망'}")
print(f"  생존 확률: {probability*100:.1f}%")

인사이트

생존에 영향을 미치는 요인

성별: 여성의 생존율이 높음 (Ladies first)
객실 등급: 1등석 > 2등석 > 3등석
나이: 어린이의 생존율이 높음
요금: 높은 요금을 낸 승객의 생존율이 높음

정리

타이타닉 데이터세트로 이진 분류 모델을 만들었습니다.
결측치 처리, 범주형 인코딩 등 전처리를 수행했습니다.
FLAML이 82% 정확도의 모델을 자동으로 찾았습니다.
특성 중요도 분석으로 주요 예측 요인을 확인했습니다.
성별, 객실 등급, 나이가 생존에 큰 영향을 미칩니다.

다음 글 예고

다음 글에서는 분류 평가 지표 - Accuracy, Precision, Recall에 대해 알아보겠습니다. 분류 모델을 평가하는 다양한 지표를 상세히 다룹니다.

FLAML AutoML 마스터 시리즈 #017

개요​

실습 환경​

프로젝트 개요​

목표​

데이터세트 정보​

Step 1: 데이터 로드​

실행 결과​

Step 2: 데이터 탐색​

타겟 분포​

결측치 확인​

실행 결과​

주요 특성별 생존율​

Step 3: 데이터 전처리​

Step 4: 데이터 분할​

Step 5: FLAML AutoML 학습​

실행 결과​

Step 6: 결과 확인​

Step 7: 테스트 평가​

실행 결과​

Step 8: 결과 시각화​

혼동 행렬​

특성 중요도​

Step 9: 예측 예시​

인사이트​

생존에 영향을 미치는 요인​

정리​

다음 글 예고​

개요

실습 환경

프로젝트 개요

목표

데이터세트 정보

Step 1: 데이터 로드

실행 결과

Step 2: 데이터 탐색

타겟 분포

결측치 확인

실행 결과

주요 특성별 생존율

Step 3: 데이터 전처리

Step 4: 데이터 분할

Step 5: FLAML AutoML 학습

실행 결과

Step 6: 결과 확인

Step 7: 테스트 평가

실행 결과

Step 8: 결과 시각화

혼동 행렬

특성 중요도

Step 9: 예측 예시

인사이트

생존에 영향을 미치는 요인

정리

다음 글 예고