Pythonic Cross Validation on Time Series

Working with time series has always represented a serious issue. The fact that the data is naturally ordered denies the possibility to apply the common Machine Learning Methods which by default tend to shuffle the entries losing the time information.

Dealing with Stocks Market Prediction I had to face this kind of challenge which, despite its being pretty common, is not well treated at a documentation level.

One of the first problem I encountered when applying the first ML was how to deal with Cross Validation. This technique is widely used to perform feature and model selection once the data collection and cleaning phases have been carried out. The idea behind CV is that in order to select the best predictors and algorithm it is mandatory to measure the  accuracy of our process on a set of data which is different from the one used to train the model.

Just to be sure we are on the same page let’s walk through the following basic example; imagine you are a professor teaching a student a new topic. To train the student you provide him some questions with solutions (supervised learning), just to give him the possibility to check whether his reasoning was right or wrong. But if you had to test his knowledge you wouldn’t ask him the same questions of the previous day, right? That would be unfair; well the student would be happy of course but the result of the test would not be reliable. To really check whether he has digested the new concept or not you have to provide him brand new exercises. Something he has not seen before.

This his the concept at the base of Cross Validation. The most accepted technique in the ML world consists in randomly picking samples out of the available data and split it in train and test set. Well to be completely precise the steps are generally the following:

  1. Split randomly data in train and test set.
  2. Focus on train set and split it again randomly in chunks (called folds).
  3. Let’s say you got 10 folds; train on 9 of them and test on the 10th.
  4. Repeat step three 10 times to get 10 accuracy measures on 10 different and separate folds.
  5. Compute the average of the 10 accuracies which is the final reliable number telling us how the model is performing.

The issue with Time Series is that the previous approach (implemented by the most common built-in Scikit functions) cannot be applied. And the reason is that in the Time Series case data cannot be shuffled randomly, cause we’ll lose its natural order, which in most cases matters.

Thus on possible solution is to the following one:

  1. Split data in train and test set given a Date (i.e. test set is what happens after 2 April 2014 included).
  2. Split train set (i.e. what happens before 2 April 2014 not included) in for example 10 consecutive time folds.
  3. Then, in order not lo lose the time information, perform the following steps:
  4. Train on fold 1 –>  Test on fold 2
  5. Train on fold 1+2 –>  Test on fold 3
  6. Train on fold 1+2+3 –>  Test on fold 4
  7. Train on fold 1+2+3+4 –>  Test on fold 5
  8. Train on fold 1+2+3+4+5 –>  Test on fold 6
  9. Train on fold 1+2+3+4+5+6 –>  Test on fold 7
  10. Train on fold 1+2+3+4+5+6+7 –>  Test on fold 8
  11. Train on fold 1+2+3+4+5+6+7+8 –>  Test on fold 9
  12. Train on fold 1+2+3+4+5+6+7+8+9 –>  Test on fold 10
  13. Compute the average of the accuracies of the 9 test folds (number of folds  – 1)

The result of the previous approach is encapsulated in the following function which I built myself to be exactly sure of what was doing. The code is fully commented and provided below.

To show the result of the performTimeSeriesCV() a sample real output is provided.

Specifically the following is the output of next command whose intention is to perform a Cross Validation for a binary classification on a stock price (Positive or Negative Return) using Quadratic Discriminant Analysis.

CV

by Francesco Pochetti

Comments

comments