076 다변량 시계열 예측

키워드: 다변량, multivariate, exogenous

개요

다변량 시계열 예측은 예측 대상 외에 추가적인 외생 변수(exogenous variables)를 활용하여 예측 정확도를 높이는 방법입니다. 날씨, 프로모션, 경제 지표 등 외부 요인을 고려할 수 있습니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: pycaret[full]>=3.0

외생 변수란?

외생 변수 (Exogenous Variables):
- 예측 대상에 영향을 주는 외부 요인
- 예측 시점에 알려져 있어야 함

예시:
- 판매량 예측: 프로모션, 가격, 휴일
- 전력 수요 예측: 온도, 습도
- 교통량 예측: 날씨, 이벤트

y(t+1) = f(y(t), y(t-1), ..., X₁(t+1), X₂(t+1), ...)

데이터 준비

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=365*2, freq='D')

# 076 타겟: 일별 판매량
base = 1000
trend = np.linspace(0, 200, len(dates))
seasonality = 100 * np.sin(2 * np.pi * np.arange(len(dates)) / 365)
weekly = 50 * np.sin(2 * np.pi * np.arange(len(dates)) / 7)

# 076 외생 변수
temperature = 20 + 10 * np.sin(2 * np.pi * np.arange(len(dates)) / 365) + np.random.normal(0, 3, len(dates))
promotion = np.random.binomial(1, 0.1, len(dates))  # 10% 확률로 프로모션
holiday = np.random.binomial(1, 0.03, len(dates))  # 3% 확률로 휴일
price = 50 + np.random.normal(0, 5, len(dates))

# 076 외생 변수 영향
temp_effect = 2 * (temperature - 20)  # 온도 영향
promo_effect = 200 * promotion  # 프로모션 효과
holiday_effect = 150 * holiday  # 휴일 효과
price_effect = -5 * (price - 50)  # 가격 영향

noise = np.random.normal(0, 50, len(dates))

sales = base + trend + seasonality + weekly + temp_effect + promo_effect + holiday_effect + price_effect + noise

# 076 데이터프레임
data = pd.DataFrame({
    'date': dates,
    'sales': sales,
    'temperature': temperature,
    'promotion': promotion,
    'holiday': holiday,
    'price': price
})
data.set_index('date', inplace=True)

print(f"데이터 기간: {data.index.min()} ~ {data.index.max()}")
print(data.head())
print(f"\n컬럼: {data.columns.tolist()}")

PyCaret 다변량 예측

from pycaret.time_series import *

# 076 환경 설정 (외생 변수 지정)
ts = setup(
    data=data,
    target='sales',
    fh=30,
    fold=3,
    exogenous_features=['temperature', 'promotion', 'holiday', 'price'],
    session_id=42,
    verbose=False
)

# 076 모델 비교 (외생 변수 지원 모델)
best = compare_models()

외생 변수 지원 모델

from pycaret.time_series import *

ts = setup(
    data=data,
    target='sales',
    fh=30,
    exogenous_features=['temperature', 'promotion', 'holiday', 'price'],
    session_id=42,
    verbose=False
)

# 076 ARIMA with exogenous (ARIMAX)
arimax = create_model('auto_arima')

# 076 Prophet with regressors
prophet = create_model('prophet')

# 076 머신러닝 모델 (외생 변수 자동 포함)
lightgbm = create_model('lightgbm_cds_dt')
rf = create_model('rf_cds_dt')
xgboost = create_model('xgboost_cds_dt')

미래 외생 변수 준비

import pandas as pd
import numpy as np

# 076 예측 기간의 외생 변수 준비 (실제로는 알려진 값 또는 예측값)
future_dates = pd.date_range('2022-01-01', periods=30, freq='D')

# 076 미래 외생 변수
future_exog = pd.DataFrame({
    'date': future_dates,
    'temperature': 20 + 10 * np.sin(2 * np.pi * np.arange(30) / 365),  # 예상 온도
    'promotion': [1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                  0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                  0, 0, 1, 0, 0, 0, 0, 0, 0, 0],  # 계획된 프로모션
    'holiday': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
                0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  # 알려진 휴일
    'price': [50] * 30  # 예정 가격
})
future_exog.set_index('date', inplace=True)

print("미래 외생 변수:")
print(future_exog.head(10))

예측 수행

from pycaret.time_series import *

ts = setup(
    data=data,
    target='sales',
    fh=30,
    exogenous_features=['temperature', 'promotion', 'holiday', 'price'],
    session_id=42,
    verbose=False
)

model = create_model('lightgbm_cds_dt')

# 076 예측 (외생 변수 포함)
# 076 PyCaret 3.x에서는 predict_model에서 자동으로 처리
predictions = predict_model(model)
print(predictions)

외생 변수 영향도 분석

from pycaret.time_series import *
import matplotlib.pyplot as plt
import numpy as np

ts = setup(
    data=data,
    target='sales',
    fh=30,
    exogenous_features=['temperature', 'promotion', 'holiday', 'price'],
    session_id=42,
    verbose=False
)

# 076 LightGBM 모델
lgbm = create_model('lightgbm_cds_dt')

# 076 특성 중요도 (모델에 따라 다름)
try:
    plot_model(lgbm, plot='feature')
except:
    print("특성 중요도 플롯은 일부 모델에서만 지원됩니다.")

프로모션 효과 시뮬레이션

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

# 076 수동으로 시뮬레이션
# 076 학습 데이터 준비
data_ml = data.copy()
data_ml['dayofweek'] = data_ml.index.dayofweek
data_ml['dayofyear'] = data_ml.index.dayofyear
data_ml['month'] = data_ml.index.month

for lag in [1, 7, 14]:
    data_ml[f'sales_lag_{lag}'] = data_ml['sales'].shift(lag)

data_ml = data_ml.dropna()

feature_cols = ['temperature', 'promotion', 'holiday', 'price',
                'dayofweek', 'dayofyear', 'month',
                'sales_lag_1', 'sales_lag_7', 'sales_lag_14']

X = data_ml[feature_cols]
y = data_ml['sales']

# 076 모델 학습
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# 076 시나리오 비교: 프로모션 있음 vs 없음
last_data = data_ml.iloc[-1:].copy()

# 076 프로모션 없음
scenario_no_promo = last_data[feature_cols].copy()
scenario_no_promo['promotion'] = 0

# 076 프로모션 있음
scenario_promo = last_data[feature_cols].copy()
scenario_promo['promotion'] = 1

pred_no_promo = model.predict(scenario_no_promo)[0]
pred_promo = model.predict(scenario_promo)[0]

print(f"프로모션 없을 때 예측: {pred_no_promo:.0f}")
print(f"프로모션 있을 때 예측: {pred_promo:.0f}")
print(f"프로모션 효과: +{pred_promo - pred_no_promo:.0f} (+{(pred_promo - pred_no_promo) / pred_no_promo * 100:.1f}%)")

What-If 분석

import numpy as np
import matplotlib.pyplot as plt

# 076 가격에 따른 판매량 변화
prices = np.arange(40, 65, 1)
predictions_by_price = []

base_scenario = last_data[feature_cols].copy()

for p in prices:
    scenario = base_scenario.copy()
    scenario['price'] = p
    pred = model.predict(scenario)[0]
    predictions_by_price.append(pred)

# 076 시각화
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 076 판매량 vs 가격
axes[0].plot(prices, predictions_by_price, 'b-', linewidth=2)
axes[0].set_xlabel('Price')
axes[0].set_ylabel('Predicted Sales')
axes[0].set_title('Sales vs Price')
axes[0].grid(True, alpha=0.3)

# 076 매출 vs 가격
revenue = prices * np.array(predictions_by_price)
axes[1].plot(prices, revenue, 'g-', linewidth=2)
axes[1].axvline(x=prices[np.argmax(revenue)], color='red', linestyle='--',
                label=f'Optimal Price: ${prices[np.argmax(revenue)]}')
axes[1].set_xlabel('Price')
axes[1].set_ylabel('Revenue')
axes[1].set_title('Revenue vs Price')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('whatif_analysis.png', dpi=150)

print(f"매출 최적화 가격: ${prices[np.argmax(revenue)]}")
print(f"예상 최대 매출: ${max(revenue):,.0f}")

시나리오 예측

import pandas as pd
import numpy as np

# 076 여러 시나리오 비교
scenarios = {
    'Base': {'promotion': 0, 'price': 50, 'holiday': 0},
    'Promotion': {'promotion': 1, 'price': 50, 'holiday': 0},
    'Price Cut': {'promotion': 0, 'price': 45, 'holiday': 0},
    'Promo + Price Cut': {'promotion': 1, 'price': 45, 'holiday': 0},
    'Holiday + Promo': {'promotion': 1, 'price': 50, 'holiday': 1}
}

results = []
for name, params in scenarios.items():
    scenario = last_data[feature_cols].copy()
    for key, value in params.items():
        scenario[key] = value

    pred = model.predict(scenario)[0]
    revenue = pred * params['price']

    results.append({
        'Scenario': name,
        'Promotion': params['promotion'],
        'Price': params['price'],
        'Holiday': params['holiday'],
        'Predicted Sales': pred,
        'Revenue': revenue
    })

df_scenarios = pd.DataFrame(results)
print("\n시나리오 비교:")
print(df_scenarios.round(0))

외생 변수 선택

from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# 076 특성 중요도
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)

print("\n특성 중요도:")
print(feature_importance.sort_values('Importance', ascending=False))

외생 변수 예측값 사용

import pandas as pd
import numpy as np

# 076 외생 변수도 예측해야 하는 경우
# 076 예: 온도는 기상 예보 사용, 가격은 계획된 값 사용

def prepare_future_exogenous(forecast_horizon, planned_promotions, planned_prices):
    """미래 외생 변수 준비"""

    future_dates = pd.date_range('2022-01-01', periods=forecast_horizon, freq='D')

    # 온도: 과거 평균 패턴 사용 (또는 기상 예보 API)
    dayofyear = future_dates.dayofyear
    temp_pattern = 20 + 10 * np.sin(2 * np.pi * dayofyear / 365)

    # 휴일: 캘린더에서 가져오기 (예시)
    known_holidays = ['2022-01-01', '2022-01-17']  # 예시
    holidays = [1 if str(d.date()) in known_holidays else 0 for d in future_dates]

    future_exog = pd.DataFrame({
        'date': future_dates,
        'temperature': temp_pattern,
        'promotion': planned_promotions[:forecast_horizon],
        'holiday': holidays,
        'price': planned_prices[:forecast_horizon]
    })
    future_exog.set_index('date', inplace=True)

    return future_exog

# 076 사용 예
planned_promos = [0] * 30  # 30일간 프로모션 계획
planned_promos[7] = 1  # 8일째 프로모션
planned_promos[14] = 1  # 15일째 프로모션

planned_prices = [50] * 30  # 30일간 가격 계획

future_exog = prepare_future_exogenous(30, planned_promos, planned_prices)
print(future_exog.head(10))

주의사항

외생 변수 사용 시 주의:

1. 미래 값 필요
   - 예측 시점에 외생 변수 값을 알아야 함
   - 모르면 예측 또는 계획 사용

2. 인과 관계
   - 상관관계 ≠ 인과관계
   - 도메인 지식 필요

3. 다중공선성
   - 외생 변수 간 높은 상관관계 주의
   - VIF 체크

4. 데이터 누출
   - 미래 정보 사용 주의
   - lag 변수 적절히 사용

정리

외생 변수: 예측 정확도 향상
ARIMAX, Prophet, ML 모델 지원
미래 외생 변수 값 필요
What-If 분석으로 의사결정 지원
특성 중요도로 영향도 파악

다음 글 예고

다음 글에서는 시계열 교차 검증을 다룹니다.

PyCaret 머신러닝 마스터 시리즈 #076

개요​

실습 환경​

외생 변수란?​

데이터 준비​

PyCaret 다변량 예측​

외생 변수 지원 모델​

미래 외생 변수 준비​

예측 수행​

외생 변수 영향도 분석​

프로모션 효과 시뮬레이션​

What-If 분석​

시나리오 예측​

외생 변수 선택​

외생 변수 예측값 사용​

주의사항​

정리​

다음 글 예고​

개요

실습 환경

외생 변수란?

데이터 준비

PyCaret 다변량 예측

외생 변수 지원 모델

미래 외생 변수 준비

예측 수행

외생 변수 영향도 분석

프로모션 효과 시뮬레이션

What-If 분석

시나리오 예측

외생 변수 선택

외생 변수 예측값 사용

주의사항

정리

다음 글 예고