Working with time series has always represented a serious issue. The fact that the data is naturally ordered denies the possibility to apply the common Machine Learning Methods which by default tend to shuffle the entries losing the time information.
Dealing with Stocks Market Prediction I had to face this kind of challenge which, despite its being pretty common, is not well treated at a documentation level.
One of the first problem I encountered when applying the first ML was how to deal with Cross Validation. This technique is widely used to perform feature and model selection once the data collection and cleaning phases have been carried out. The idea behind CV is that in order to select the best predictors and algorithm it is mandatory to measure the accuracy of our process on a set of data which is different from the one used to train the model.
Just to be sure we are on the same page let’s walk through the following basic example; imagine you are a professor teaching a student a new topic. To train the student you provide him some questions with solutions (supervised learning), just to give him the possibility to check whether his reasoning was right or wrong. But if you had to test his knowledge you wouldn’t ask him the same questions of the previous day, right? That would be unfair; well the student would be happy of course but the result of the test would not be reliable. To really check whether he has digested the new concept or not you have to provide him brand new exercises. Something he has not seen before.
This his the concept at the base of Cross Validation. The most accepted technique in the ML world consists in randomly picking samples out of the available data and split it in train and test set. Well to be completely precise the steps are generally the following:
- Split randomly data in train and test set.
- Focus on train set and split it again randomly in chunks (called folds).
- Let’s say you got 10 folds; train on 9 of them and test on the 10th.
- Repeat step three 10 times to get 10 accuracy measures on 10 different and separate folds.
- Compute the average of the 10 accuracies which is the final reliable number telling us how the model is performing.
The issue with Time Series is that the previous approach (implemented by the most common built-in Scikit functions) cannot be applied. And the reason is that in the Time Series case data cannot be shuffled randomly, cause we’ll lose its natural order, which in most cases matters.
Thus on possible solution is to the following one:
- Split data in train and test set given a Date (i.e. test set is what happens after 2 April 2014 included).
- Split train set (i.e. what happens before 2 April 2014 not included) in for example 10 consecutive time folds.
- Then, in order not lo lose the time information, perform the following steps:
- Train on fold 1 –> Test on fold 2
- Train on fold 1+2 –> Test on fold 3
- Train on fold 1+2+3 –> Test on fold 4
- Train on fold 1+2+3+4 –> Test on fold 5
- Train on fold 1+2+3+4+5 –> Test on fold 6
- Train on fold 1+2+3+4+5+6 –> Test on fold 7
- Train on fold 1+2+3+4+5+6+7 –> Test on fold 8
- Train on fold 1+2+3+4+5+6+7+8 –> Test on fold 9
- Train on fold 1+2+3+4+5+6+7+8+9 –> Test on fold 10
- Compute the average of the accuracies of the 9 test folds (number of folds – 1)
The result of the previous approach is encapsulated in the following function which I built myself to be exactly sure of what was doing. The code is fully commented and provided below.
def performTimeSeriesCV(X_train, y_train, number_folds, algorithm, parameters):
Given X_train and y_train (the test set is excluded from the Cross Validation),
number of folds, the ML algorithm to implement and the parameters to test,
the function acts based on the following logic: it splits X_train and y_train in a
number of folds equal to number_folds. Then train on one fold and tests accuracy
on the consecutive as follows:
- Train on fold 1, test on 2
- Train on fold 1-2, test on 3
- Train on fold 1-2-3, test on 4
Returns mean of test accuracies.
print 'Parameters --------------------------------> ', parameters
print 'Size train set: ', X_train.shape
# k is the size of each fold. It is computed dividing the number of
# rows in X_train by number_folds. This number is floored and coerced to int
k = int(np.floor(float(X_train.shape) / number_folds))
print 'Size of each fold: ', k
# initialize to zero the accuracies array. It is important to stress that
# in the CV of Time Series if I have n folds I test n-1 folds as the first
# one is always needed to train
accuracies = np.zeros(folds-1)
# loop from the first 2 folds to the total number of folds
for i in range(2, number_folds + 1):
# the split is the percentage at which to split the folds into train
# and test. For example when i = 2 we are taking the first 2 folds out
# of the total available. In this specific case we have to split the
# two of them in half (train on the first, test on the second),
# so split = 1/2 = 0.5 = 50%. When i = 3 we are taking the first 3 folds
# out of the total available, meaning that we have to split the three of them
# in two at split = 2/3 = 0.66 = 66% (train on the first 2 and test on the
split = float(i-1)/i
# example with i = 4 (first 4 folds):
# Splitting the first 4 chunks at 3 / 4
print 'Splitting the first ' + str(i) + ' chunks at ' + str(i-1) + '/' + str(i)
# as we loop over the folds X and y are updated and increase in size.
# This is the data that is going to be split and it increases in size
# in the loop as we account for more folds. If k = 300, with i starting from 2
# the result is the following in the loop
# i = 2
# X = X_train[:(600)]
# y = y_train[:(600)]
# i = 3
# X = X_train[:(900)]
# y = y_train[:(900)]
X = X_train[:(k*i)]
y = y_train[:(k*i)]
print 'Size of train + test: ', X.shape # the size of the dataframe is going to be k*i
# X and y contain both the folds to train and the fold to test.
# index is the integer telling us where to split, according to the
# split percentage we have set above
index = int(np.floor(X.shape * split))
# folds used to train the model
X_trainFolds = X[:index]
y_trainFolds = y[:index]
# fold used to test the model
X_testFold = X[(index + 1):]
y_testFold = y[(index + 1):]
# i starts from 2 so the zeroth element in accuracies array is i-2. performClassification() is a function which takes care of a classification problem. This is only an example and you can replace this function with whatever ML approach you need.
accuracies[i-2] = performClassification(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds, algorithm, parameters)
# example with i = 4:
# Accuracy on fold 4 : 0.85423
print 'Accuracy on fold ' + str(i) + ': ', acc[i-2]
# the function returns the mean of the accuracy on the n-1 folds
To show the result of the performTimeSeriesCV() a sample real output is provided.
Specifically the following is the output of next command whose intention is to perform a Cross Validation for a binary classification on a stock price (Positive or Negative Return) using Quadratic Discriminant Analysis.
performTimeSeriesCV(X_train, y_train, 10, 'QDA', )