Sometime you have a time series of historical data and you would like to know how this series of data will look in the days or years to come. If the data looks "smooth" you would probably make a linear fit and extrapolate from the historical data to the future.
But what if the data is not resembling a line, but has time dependent structures? For example, the outdoor temperature is strongly dependent on the season, and while a straight line fit could be a pretty good representation of the yearly average temperature, it helps me very little if I need to know if I need to bring a scarf tomorrow.
If you have multiple features in the data, you could run a multivariate machine learning regression on the problem, e.g., using a random forest regression with influencing factors as features, and some of the periodicity modelled as explicit features.
But what if all you have is a time series of values? In this post I will demonstrate using Gaussian Process regression to predict the monthly retail trade turnover in Switzerland for the coming year.
I decided to do this example in python, and I am importing the Gaussian Process implementation from scikit-learn.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
from IPython.display import display, HTML
from sklearn import gaussian_process
%matplotlib inline
colors = ['#00257A','#009FDA','#6a7f10']
The data was downloaded from the Swiss Federal Statistical Office. Use the "STAT-TAB" query tool and select data from "Retail Trade Turnover Statistics - monthly series".
After loading the Excel file as a pandas dataframe, we see that we only have two columns:
- The date, with monthly granularity
- The retail trade turnover in billions of Swiss francs.
Let's take advantage of the time series functionality of pandas. I replaced the default index with a date time index. (Since our input data is only at monthly granularity, I gave each month a default day in order to have full dates in the index. Using PeriodIndex would also be an option.) The upper data frame is the original, and the lower data frame is the result.
df = pd.read_excel("px-x-0603020000_101.xlsx")
print 'Original data frame:'
display(df.head())
df = pd.DataFrame({'date':pd.to_datetime(df.Date, format="%YM%m").apply(lambda x: x.replace(day=15)), 'Retail trade turnover' : df['Retail trade turnover']} )
df.set_index('date', inplace=True)
print 'Data frame with DateTimeIndex:'
display(df.head())
Before doing any advanced analytics it is always a good idea to see the data with your own eyes. With one line of python we can show the content of the data frame. As the below figure shows, it contains quite a lot of time dependent structure.
- There is a strong sales peak at Christmas times, followed by a dip shortly thereafter.
- For many years the yearly turnover was increasing, but this growth appears to have stagnated now.
ax = df.plot(figsize=(14,6), lw=2, style='k-')
Since we want to make predictions, we need to append some future dates to the data frame. Here is one example where it is convenient to use the date time as index of the data frame.
last_data_date = df.iloc[-1:].index[0]
print 'Last data was collected ', last_data_date
periodicity = 'm'
dfFuture = pd.DataFrame({ 'day': pd.Series([last_data_date, datetime.datetime(2019,9,30)])})
dfFuture.set_index('day',inplace=True)
dfFuture = dfFuture.asfreq(periodicity)
df = df.append(dfFuture)
df.tail()
We will train the machine learning model on one part of the data, then use the model to predict how the future will be. To ensure that the model is not overfitting the data, it is important to validate the model on a part of the data that the model never saw during training. In other words, the training data set and the validation data sets must not overlap.
There are hence three sets of the data that should be considered:
- The training data set - used to build an accurate model
- The validation data set - used to ensure the correctness of the model
- The testing data set - used to make prediction of future events
To keep it simple in this example, I defined a date where I stop using data for training. The "test" dataset is is covering a wide range of dates, so we can see what happens when we predict dates we used for training, the actual validation date range, and the future prediction date range.
The "nugget" below should ideally contain the uncertainty of the central values. In this simple example we do not have access to the uncertainties, so we cheat a bit and add an arbritrary value.
start_test = last_data_date-datetime.timedelta(days=365)
stop_test = last_data_date+datetime.timedelta(days=365)
ts_train = df[df.index <= start_test]
ts_test = df[df.index <= stop_test]
data_train = np.array((ts_train.index.month, ts_train.index.year)).T
data_target = np.array(ts_train['Retail trade turnover'])[np.newaxis].ravel()
data_nugget = np.array(0.01*np.ones(len(ts_train.index)))[np.newaxis].ravel()
data_test = np.array((ts_test.index.month, ts_test.index.year )).T
With the data prepared we are ready to define Gaussian Process as the machine learning model, and train the model on the train data set.
gp = gaussian_process.GaussianProcess(theta0=1e-2, thetaL=1e-9, thetaU=1e-0, nugget=data_nugget)
gp.fit(data_train, data_target)
Now we take the trained model and predict the retail trade turnover as a function of the date. This model returns also the uncertainty of the prediction, and both the prediction and the uncertainty are added as columns to the data frame. As expected the early data have both the actual turnover and the prediction, while the future dates only have the prediction.
y_pred, sigma2_pred = gp.predict(data_test, eval_MSE=True)
ts_test.loc[:,'Prediction'] = y_pred
ts_test.loc[:,'PredictionSigma'] = np.sqrt(sigma2_pred)
display(ts_test.head())
display(ts_test.tail())
The figure below shows the actual and predicted retail trade turnover in Switzerland from 2002 to 2017.
- Dates left of the dashed line were used to train the model, while date right of this line were used for validation.
- The black line is the actual historical values, and the blue line are the predicted values.
- The 95% confidence interval of the prediction is shown as a shaded region around the prediction.
It is interesting to see that the model successfully learned that there is a large peak around Christmas, which it accurately predicted for December 2015. The model is quite confident that sales will go up also this coming Christmas, and I think not many would argue against that prediction.
So how accurate is our model? From the validation data set we estimate that the prediction is on average overestimating by the turnover by 3 billion Swiss francs, and it is typically wrong (in either direction) by 5 billion each month. The uncertainties of the prediction make a decent estimate of this accuracy and precision. However, as we predict further into the future the machine learning model gives growing uncertainty of it's predictions.
ax = ts_test['Retail trade turnover'].plot(lw=2, alpha=0.75, figsize=(14,6), style=['k-'], label='Actual')
ts_test['Prediction'].plot(lw=4, alpha=0.75, style=[colors[1]])
plt.fill_between(ts_test.index, ts_test.Prediction-2*ts_test.PredictionSigma, \
ts_test.Prediction+2*ts_test.PredictionSigma, color='b', alpha=0.2)
ax.set_ylabel('Retail trade turnover [billion CHF]')
ax.set_xlabel('Time')
ax.grid(0)
ax.axvline(start_test, c='k', ls='--')
ax.annotate('Training-testing boundary', xy=(start_test, 0), xytext=(start_test-datetime.timedelta(days=15), 10), \
horizontalalignment='right')
plt.ylim(0, plt.ylim()[1])
plt.gca().legend(frameon=False, loc='best')
plt.savefig('predict.png')
std = np.std(np.array(ts_test.dropna().loc[start_test:last_data_date]['Retail trade turnover'] \
-ts_test.dropna().loc[start_test:last_data_date]['Prediction']))
mean = np.mean(np.array(ts_test.dropna().loc[start_test:last_data_date]['Prediction'] \
-ts_test.dropna().loc[start_test:last_data_date]['Retail trade turnover']))
print 'Model deviation = ', np.round(mean, 2), '+-', np.round(std, 2)
This post has shown how to fit a time series with periodic structures, using no other information than the values in the time series itself. It has also shown how the confidence of these predictions can be estimated.
Comments !