030 분류 프로젝트 - 스팸 메일 분류

키워드: 스팸, 텍스트 분류, NLP

개요

스팸 메일 분류는 텍스트 데이터를 활용한 대표적인 이진 분류 문제입니다. 이 글에서는 텍스트 전처리부터 FLAML을 활용한 분류까지 전체 파이프라인을 다룹니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: flaml[automl], pandas, scikit-learn

pip install flaml[automl] pandas scikit-learn

프로젝트 개요

목표

이메일 내용을 분석하여 스팸 여부 예측

평가 지표

Precision: 스팸이라고 예측한 것 중 실제 스팸 (오탐 방지)
Recall: 실제 스팸 중 탐지한 비율 (미탐 방지)
F1 Score: Precision과 Recall의 조화 평균

Step 1: 데이터 준비

import pandas as pd
import numpy as np

# 030 샘플 스팸 데이터 생성
# 030 (실제 프로젝트에서는 UCI SMS Spam Collection 등 사용)

np.random.seed(42)

# 030 스팸 메일 예시
spam_texts = [
    "Free money! Click here to claim your prize now!",
    "Congratulations! You've won $1000 gift card",
    "URGENT: Your account will be suspended",
    "LIMITED TIME OFFER: Buy now and get 90% off",
    "You are selected for exclusive deal",
    "Make money fast! Work from home",
    "Free iPhone giveaway! Enter now",
    "Your bank account needs verification",
    "Hot singles in your area waiting",
    "Lose weight fast with this one trick",
] * 100  # 1000개

# 030 정상 메일 예시
ham_texts = [
    "Meeting scheduled for tomorrow at 3pm",
    "Please review the attached document",
    "Thank you for your email regarding the project",
    "Can we discuss the quarterly report?",
    "Reminder: Team lunch on Friday",
    "Your order has been shipped",
    "Weekly newsletter: Tech updates",
    "Invoice for your recent purchase",
    "Project deadline extension approved",
    "Happy birthday! Best wishes from the team",
] * 400  # 4000개

# 030 DataFrame 생성
spam_df = pd.DataFrame({'text': spam_texts, 'label': 1})
ham_df = pd.DataFrame({'text': ham_texts, 'label': 0})
df = pd.concat([spam_df, ham_df], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)  # 섞기

print("데이터셋 정보:")
print(f"  전체 메일: {len(df)}개")
print(f"  정상(Ham): {(df['label'] == 0).sum()}개 ({(df['label'] == 0).mean()*100:.1f}%)")
print(f"  스팸(Spam): {(df['label'] == 1).sum()}개 ({(df['label'] == 1).mean()*100:.1f}%)")

실행 결과

데이터셋 정보:
  전체 메일: 5000개
  정상(Ham): 4000개 (80.0%)
  스팸(Spam): 1000개 (20.0%)

Step 2: 텍스트 탐색

# 030 샘플 확인
print("스팸 메일 예시:")
for text in df[df['label'] == 1]['text'].head(3).values:
    print(f"  - {text}")

print("\n정상 메일 예시:")
for text in df[df['label'] == 0]['text'].head(3).values:
    print(f"  - {text}")

# 030 텍스트 길이 분석
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

print("\n텍스트 통계:")
print(df.groupby('label')[['text_length', 'word_count']].mean().round(1))

Step 3: 텍스트 전처리

import re
from sklearn.feature_extraction.text import TfidfVectorizer

# 030 텍스트 정제 함수
def clean_text(text):
    """텍스트 전처리"""
    # 소문자 변환
    text = text.lower()
    # 특수문자 제거 (알파벳, 숫자, 공백만 유지)
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # 여러 공백을 하나로
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# 030 전처리 적용
df['clean_text'] = df['text'].apply(clean_text)

print("전처리 결과:")
print(f"원본: {df['text'].iloc[0]}")
print(f"정제: {df['clean_text'].iloc[0]}")

Step 4: 특성 추출 (TF-IDF)

from sklearn.model_selection import train_test_split

# 030 데이터 분할 (전처리 전)
X_train_text, X_test_text, y_train, y_test = train_test_split(
    df['clean_text'], df['label'],
    test_size=0.2, random_state=42, stratify=df['label']
)

# 030 TF-IDF 벡터화
tfidf = TfidfVectorizer(
    max_features=5000,      # 최대 특성 수
    min_df=2,               # 최소 문서 빈도
    max_df=0.95,            # 최대 문서 빈도
    ngram_range=(1, 2),     # 유니그램 + 바이그램
    stop_words='english'    # 불용어 제거
)

# 030 학습 데이터로 fit, 변환
X_train = tfidf.fit_transform(X_train_text)
X_test = tfidf.transform(X_test_text)

print(f"TF-IDF 특성 수: {X_train.shape[1]}")
print(f"학습 데이터: {X_train.shape}")
print(f"테스트 데이터: {X_test.shape}")

# 030 상위 특성 확인
feature_names = tfidf.get_feature_names_out()
print(f"\n특성 예시: {feature_names[:10]}")

Step 5: FLAML AutoML 학습

from flaml import AutoML

# 030 FLAML 학습
automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=120,
    metric="f1",
    seed=42,
    verbose=1
)

print(f"\n최적 모델: {automl.best_estimator}")
print(f"검증 F1: {1 - automl.best_loss:.4f}")

Step 6: 모델 평가

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

# 030 예측
y_pred = automl.predict(X_test)
y_prob = automl.predict_proba(X_test)[:, 1]

# 030 분류 리포트
print("분류 리포트:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# 030 혼동 행렬
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Ham', 'Spam'],
            yticklabels=['Ham', 'Spam'])
plt.xlabel('예측')
plt.ylabel('실제')
plt.title('스팸 분류 - 혼동 행렬')
plt.tight_layout()
plt.show()

# 030 ROC 곡선
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('스팸 분류 - ROC 곡선')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Step 7: 중요 단어 분석

# 030 모델이 LightGBM이나 XGBoost인 경우 특성 중요도 추출
if hasattr(automl.best_model, 'feature_importances_'):
    importance = automl.best_model.feature_importances_

    # 상위 20개 중요 단어
    top_indices = np.argsort(importance)[::-1][:20]
    top_words = [feature_names[i] for i in top_indices]
    top_importance = [importance[i] for i in top_indices]

    plt.figure(figsize=(12, 6))
    plt.barh(range(20), top_importance[::-1])
    plt.yticks(range(20), top_words[::-1])
    plt.xlabel('Importance')
    plt.title('스팸 분류 - 중요 단어 Top 20')
    plt.tight_layout()
    plt.show()

    print("스팸 탐지에 중요한 단어:")
    for word, imp in zip(top_words[:10], top_importance[:10]):
        print(f"  {word}: {imp:.4f}")

Step 8: 스팸 키워드 분석

# 030 스팸 vs 정상 메일의 단어 빈도 비교
from collections import Counter

def get_word_freq(texts):
    """텍스트에서 단어 빈도 계산"""
    all_words = ' '.join(texts).split()
    return Counter(all_words)

spam_words = get_word_freq(df[df['label'] == 1]['clean_text'])
ham_words = get_word_freq(df[df['label'] == 0]['clean_text'])

print("스팸 메일 빈출 단어:")
for word, count in spam_words.most_common(10):
    print(f"  {word}: {count}")

print("\n정상 메일 빈출 단어:")
for word, count in ham_words.most_common(10):
    print(f"  {word}: {count}")

# 030 스팸 특징 단어 (스팸에만 많이 나타나는)
spam_ratio = {}
for word in set(list(spam_words.keys())[:100]):
    spam_count = spam_words.get(word, 0)
    ham_count = ham_words.get(word, 0)
    if ham_count > 0:
        spam_ratio[word] = spam_count / (ham_count + 1)

print("\n스팸 특징 단어 (스팸/정상 비율):")
for word, ratio in sorted(spam_ratio.items(), key=lambda x: -x[1])[:10]:
    print(f"  {word}: {ratio:.2f}x")

Step 9: 새 메일 분류

def classify_email(text, model, vectorizer):
    """새 이메일 분류"""
    # 전처리
    clean = clean_text(text)
    # 벡터화
    vec = vectorizer.transform([clean])
    # 예측
    pred = model.predict(vec)[0]
    prob = model.predict_proba(vec)[0]

    return {
        'text': text,
        'prediction': 'Spam' if pred == 1 else 'Ham',
        'spam_probability': prob[1],
        'confidence': max(prob)
    }

# 030 테스트 메일
test_emails = [
    "Hi John, can we meet tomorrow to discuss the project?",
    "FREE MONEY! Click here now to claim your $5000 prize!",
    "Your Amazon order #12345 has been shipped",
    "URGENT: Verify your bank account immediately or it will be closed"
]

print("새 메일 분류 결과:")
print("-" * 70)
for email in test_emails:
    result = classify_email(email, automl, tfidf)
    print(f"메일: {email[:50]}...")
    print(f"  → {result['prediction']} (확률: {result['spam_probability']:.4f})")
    print()

Step 10: 파이프라인 저장

import pickle

# 030 모델과 벡터라이저 저장
spam_classifier = {
    'model': automl,
    'vectorizer': tfidf,
    'clean_text': clean_text
}

with open('spam_classifier.pkl', 'wb') as f:
    pickle.dump(spam_classifier, f)

print("모델 저장 완료: spam_classifier.pkl")

# 030 로드 및 사용
with open('spam_classifier.pkl', 'rb') as f:
    loaded = pickle.load(f)

# 030 예측 함수
def predict_spam(text):
    clean = loaded['clean_text'](text)
    vec = loaded['vectorizer'].transform([clean])
    prob = loaded['model'].predict_proba(vec)[0, 1]
    return 'Spam' if prob > 0.5 else 'Ham', prob

result, prob = predict_spam("Win a free vacation! Click now!")
print(f"\n테스트: {result} (스팸 확률: {prob:.4f})")

성능 개선 팁

1. N-gram 조정

# 030 바이그램만 사용
tfidf_bigram = TfidfVectorizer(ngram_range=(2, 2), max_features=3000)

# 030 트라이그램까지 사용
tfidf_trigram = TfidfVectorizer(ngram_range=(1, 3), max_features=5000)

2. 커스텀 특성 추가

def extract_features(text):
    """추가 특성 추출"""
    return {
        'has_url': 1 if 'http' in text.lower() or 'www' in text.lower() else 0,
        'has_money': 1 if '$' in text or 'money' in text.lower() else 0,
        'exclamation_count': text.count('!'),
        'uppercase_ratio': sum(1 for c in text if c.isupper()) / len(text) if text else 0,
        'digit_ratio': sum(1 for c in text if c.isdigit()) / len(text) if text else 0,
    }

3. 더 긴 시간 예산

# 030 더 긴 탐색 시간
automl.fit(X_train, y_train, time_budget=300, metric="f1")

정리

텍스트 분류는 전처리 → 벡터화 → 학습 파이프라인을 따릅니다.
TF-IDF로 텍스트를 수치 특성으로 변환합니다.
FLAML은 **희소 행렬(sparse matrix)**도 지원합니다.
특성 중요도로 스팸 탐지에 중요한 키워드를 파악합니다.
파이프라인 저장으로 전체 분류 시스템을 재사용합니다.
"free", "money", "urgent" 등이 대표적인 스팸 키워드입니다.

다음 글 예고

다음 글에서는 특성 중요도 분석에 대해 알아보겠습니다. 모델이 어떤 특성을 중요하게 생각하는지 분석하는 방법을 다룹니다.

FLAML AutoML 마스터 시리즈 #030

개요​

실습 환경​

프로젝트 개요​

목표​

평가 지표​

Step 1: 데이터 준비​

실행 결과​

Step 2: 텍스트 탐색​

Step 3: 텍스트 전처리​

Step 4: 특성 추출 (TF-IDF)​

Step 5: FLAML AutoML 학습​

Step 6: 모델 평가​

Step 7: 중요 단어 분석​

Step 8: 스팸 키워드 분석​

Step 9: 새 메일 분류​

Step 10: 파이프라인 저장​

성능 개선 팁​

1. N-gram 조정​

2. 커스텀 특성 추가​

3. 더 긴 시간 예산​

정리​

다음 글 예고​

개요