073 시계열 예측의 이해

키워드: 시계열, time series

개요

시계열 예측(Time Series Forecasting)은 과거 데이터의 패턴을 분석하여 미래 값을 예측하는 기법입니다. 판매량, 주가, 수요, 날씨 등 시간에 따라 변하는 데이터 예측에 활용됩니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: pycaret[full]>=3.0

시계열이란?

시계열 데이터:
시간 순서대로 기록된 데이터

t₁ → t₂ → t₃ → t₄ → t₅ → ... → tₙ → ?
y₁    y₂    y₃    y₄    y₅         yₙ    ?

특징:
- 순서가 중요 (셔플 불가)
- 시간 의존성 (과거가 미래에 영향)
- 자기 상관 (autocorrelation)

시계열 데이터 예시

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 073 시계열 데이터 생성
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=365*3, freq='D')  # 3년

# 073 구성 요소
trend = np.linspace(100, 200, len(dates))  # 상승 추세
seasonality = 30 * np.sin(2 * np.pi * np.arange(len(dates)) / 365)  # 연간 계절성
weekly = 10 * np.sin(2 * np.pi * np.arange(len(dates)) / 7)  # 주간 패턴
noise = np.random.normal(0, 10, len(dates))  # 노이즈

# 073 결합
values = trend + seasonality + weekly + noise

# 073 데이터프레임
df = pd.DataFrame({
    'date': dates,
    'value': values
})
df.set_index('date', inplace=True)

print(f"데이터 기간: {df.index.min()} ~ {df.index.max()}")
print(f"데이터 수: {len(df)}")
print(df.head())

시계열의 구성 요소

import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# 073 분해
decomposition = seasonal_decompose(df['value'], model='additive', period=365)

fig, axes = plt.subplots(4, 1, figsize=(14, 12))

axes[0].plot(df.index, df['value'])
axes[0].set_title('Original')
axes[0].set_ylabel('Value')

axes[1].plot(df.index, decomposition.trend)
axes[1].set_title('Trend')
axes[1].set_ylabel('Value')

axes[2].plot(df.index, decomposition.seasonal)
axes[2].set_title('Seasonality')
axes[2].set_ylabel('Value')

axes[3].plot(df.index, decomposition.resid)
axes[3].set_title('Residual')
axes[3].set_ylabel('Value')

plt.tight_layout()
plt.savefig('time_series_decomposition.png', dpi=150)

구성 요소 설명

1. 추세 (Trend)
   - 장기적인 상승/하락 경향
   - 선형 or 비선형

2. 계절성 (Seasonality)
   - 일정 주기로 반복되는 패턴
   - 일별, 주별, 월별, 연별

3. 순환 (Cycle)
   - 불규칙한 주기의 변동
   - 경기 순환 등

4. 잔차 (Residual)
   - 설명되지 않는 변동
   - 노이즈

정상성 (Stationarity)

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller

# 073 정상성 테스트 (ADF 테스트)
def check_stationarity(series, name):
    result = adfuller(series.dropna())
    print(f"\n{name}")
    print(f"ADF Statistic: {result[0]:.4f}")
    print(f"p-value: {result[1]:.4f}")
    print("정상성:", "Yes" if result[1] < 0.05 else "No")
    return result[1] < 0.05

# 073 원본 데이터
is_stationary = check_stationarity(df['value'], "Original Series")

# 073 차분 (Differencing)
df['diff1'] = df['value'].diff()
is_stationary_diff1 = check_stationarity(df['diff1'], "First Difference")

# 073 시각화
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

axes[0].plot(df.index, df['value'])
axes[0].set_title('Original (Non-stationary)')

axes[1].plot(df.index, df['diff1'])
axes[1].set_title('First Difference (Stationary)')

plt.tight_layout()
plt.savefig('stationarity.png', dpi=150)

정상성이 중요한 이유

정상 시계열:
- 평균과 분산이 시간에 따라 일정
- 많은 통계 모델의 가정
- 예측 안정성 향상

비정상 → 정상 변환:
- 차분 (Differencing)
- 로그 변환
- 계절 차분

자기상관 (Autocorrelation)

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 073 ACF (자기상관함수)
plot_acf(df['value'].dropna(), lags=50, ax=axes[0])
axes[0].set_title('Autocorrelation Function (ACF)')

# 073 PACF (편자기상관함수)
plot_pacf(df['value'].dropna(), lags=50, ax=axes[1])
axes[1].set_title('Partial Autocorrelation Function (PACF)')

plt.tight_layout()
plt.savefig('acf_pacf.png', dpi=150)

ACF와 PACF 해석

ACF (Autocorrelation Function):
- lag k에서의 상관관계
- 간접 영향 포함

PACF (Partial ACF):
- 중간 lag 영향 제거 후 상관관계
- 직접 영향만

모델 선택:
- ACF 지수 감소, PACF p에서 절단 → AR(p)
- ACF q에서 절단, PACF 지수 감소 → MA(q)
- 둘 다 지수 감소 → ARMA(p, q)

시계열 예측 방법론

1. 통계적 방법

# 073 ARIMA, SARIMA, ETS 등

# 073 특징:
# 073 - 수학적 기반
# 073 - 해석 용이
# 073 - 적은 데이터로도 가능

2. 머신러닝 방법

# 073 Random Forest, XGBoost, LightGBM 등

# 073 특징:
# 073 - 비선형 패턴 학습
# 073 - 외부 변수 활용 용이
# 073 - 특성 엔지니어링 필요

3. 딥러닝 방법

# 073 LSTM, Transformer 등

# 073 특징:
# 073 - 복잡한 패턴 학습
# 073 - 많은 데이터 필요
# 073 - 긴 시퀀스 처리 가능

간단한 예측 예제

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
import pandas as pd

# 073 특성 엔지니어링
df_features = df.copy()
df_features['year'] = df_features.index.year
df_features['month'] = df_features.index.month
df_features['day'] = df_features.index.day
df_features['dayofweek'] = df_features.index.dayofweek
df_features['dayofyear'] = df_features.index.dayofyear

# 073 래그 특성
for lag in [1, 7, 14, 30, 365]:
    df_features[f'lag_{lag}'] = df_features['value'].shift(lag)

# 073 이동 평균
df_features['rolling_7'] = df_features['value'].shift(1).rolling(7).mean()
df_features['rolling_30'] = df_features['value'].shift(1).rolling(30).mean()

# 073 NaN 제거
df_features = df_features.dropna()

# 073 학습/테스트 분할 (시간 순서 유지)
train_size = int(len(df_features) * 0.8)
train = df_features[:train_size]
test = df_features[train_size:]

feature_cols = ['year', 'month', 'day', 'dayofweek', 'dayofyear',
                'lag_1', 'lag_7', 'lag_14', 'lag_30', 'lag_365',
                'rolling_7', 'rolling_30']

X_train = train[feature_cols]
y_train = train['value']
X_test = test[feature_cols]
y_test = test['value']

# 073 모델 학습
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 073 예측
y_pred = model.predict(X_test)

# 073 평가
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")

# 073 시각화
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 6))
plt.plot(test.index, y_test, label='Actual', alpha=0.7)
plt.plot(test.index, y_pred, label='Predicted', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Prediction')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('simple_forecast.png', dpi=150)

예측 평가 지표

import numpy as np

def evaluate_forecast(y_true, y_pred):
    """시계열 예측 평가 지표"""

    # MAE (Mean Absolute Error)
    mae = np.mean(np.abs(y_true - y_pred))

    # RMSE (Root Mean Squared Error)
    rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))

    # MAPE (Mean Absolute Percentage Error)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

    # SMAPE (Symmetric MAPE)
    smape = np.mean(2 * np.abs(y_true - y_pred) / (np.abs(y_true) + np.abs(y_pred))) * 100

    # R² Score
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    r2 = 1 - (ss_res / ss_tot)

    return {
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape,
        'SMAPE': smape,
        'R2': r2
    }

metrics = evaluate_forecast(y_test.values, y_pred)
for name, value in metrics.items():
    print(f"{name}: {value:.4f}")

지표별 특징

MAE:
- 해석 용이 (원본 단위)
- 이상치에 덜 민감

RMSE:
- 큰 오차에 페널티
- 최적화에 많이 사용

MAPE:
- 비율 기반 (% 단위)
- 0에 가까운 값에 불안정

SMAPE:
- MAPE 개선 버전
- 대칭적

R²:
- 설명력 (0~1)
- 1에 가까울수록 좋음

시계열 예측의 어려움

1. 데이터 누출 (Data Leakage)
   - 미래 정보가 학습에 사용됨
   - 시간 순서 교차 검증 필요

2. 컨셉 드리프트
   - 시간에 따라 패턴 변화
   - 모델 주기적 업데이트 필요

3. 외부 요인
   - 휴일, 이벤트, 경제 상황
   - 외생 변수 고려 필요

4. 다중 계절성
   - 일별 + 주별 + 연별
   - 복합 패턴 모델링

정리

시계열은 시간 순서대로 기록된 데이터
추세, 계절성, 잔차로 분해
정상성 확인 필수
시간 순서 유지하여 분할
MAE, RMSE, MAPE 등으로 평가

다음 글 예고

다음 글에서는 PyCaret 시계열 모듈 시작하기를 다룹니다.

PyCaret 머신러닝 마스터 시리즈 #073

개요​

실습 환경​

시계열이란?​

시계열 데이터 예시​

시계열의 구성 요소​

구성 요소 설명​

정상성 (Stationarity)​

정상성이 중요한 이유​

자기상관 (Autocorrelation)​

ACF와 PACF 해석​

시계열 예측 방법론​

1. 통계적 방법​

2. 머신러닝 방법​

3. 딥러닝 방법​

간단한 예측 예제​

예측 평가 지표​

지표별 특징​

시계열 예측의 어려움​

정리​

다음 글 예고​

개요

실습 환경

시계열이란?

시계열 데이터 예시

시계열의 구성 요소

구성 요소 설명

정상성 (Stationarity)

정상성이 중요한 이유

자기상관 (Autocorrelation)

ACF와 PACF 해석

시계열 예측 방법론

1. 통계적 방법

2. 머신러닝 방법

3. 딥러닝 방법

간단한 예측 예제

예측 평가 지표

지표별 특징

시계열 예측의 어려움

정리

다음 글 예고