- Introduction and Discussion of the Problem
- Feature Generation
- Classification Algorithms
- Feature/Model Selection
- Results on Test Set
- Trading Algorithm and Portfolio Performance
We closed the previous post with the results of Cross Validation.
Eventually we decided that our best combinations is the following:
- Algorithm: Random Forests (n_estimators = 100)
- Features: n = 9 / delta = 9
Random Forests Results
From a strict Machine Learning point of view now that we have picked model and features it should be time of algorithm parameter selection. This goal can be achieved performing an other round of CV on train set looping over a set of parameters and then select the ones maximizing the average accuracy on folds. Obviously the parameters of interest change with the algorithm. In Random Forests Scikit-Learn implementation there are several elements we are allowed to play with:
- n_estimators (default = 10): The number of trees in the forest.
- max_features (default = sqrt(n_features)): The number of features to consider when looking for the best split
- Other minor parameters such as the maximum depth of the tree, the minimum number of samples required to split an internal node and so on so forth.
The capability of these parameters to significantly affect the performances of a model strongly depend on the specific problem we are facing. There is now generic rule of thumb to stick to. It’s quite well known that increasing the number of estimators decreases the train error (and hopefully the test one too), but in any case after a certain limit has been passed no relevant improvement will be recorded. Thus, we will only be raising the computation cost without a real benefit. For this reason I stick to 100 trees and never changed it.
As for the max_features parameter, this can be a tricky one. Random Forests is just an improved evolution on the Bagging or Bootstrap Aggregating Algorithm, which solves the high variance innate problem of a single tree but introduces an unavoidable high correlation among all the bootstrapped trees due to the fact that all the features are taken into account at each iteration. Thus, in presence of few dominant predictors, Bagging basically splits the trees always in the same way and eventually we end up averaging a bunch of identical trees. To solve this issue Random Forests was implemented. The main improvement consists in the fact that only a subset of the available features are selected each time a tree is built. As a consequence the trees are generally uncorrelated one to another and the final result is much more reliable. Theoretically speaking if I set max_features to be equal to the total amount of features I’d end up performing Bagging without even realizing it. At the end of the day this parameter may be pretty relevant. In any case the default one (square root of the number of predictors) is a very good balance in terms of bias-variance trade-off.
Let’s go back to our firts concern which was to measure the accuracy of the best CV model on test set. To do that I train the model on the whole train set and report the accuracy on the test set. After the model has been trained it is saved to a file (.pickle) in order to be reused in the future and to avoid to generate it from scratch each time we need it. The output of the cose is provided below.
Maximum time lag applied 9
Delta days accounted: 9
Size of data frame: (5456, 153)
Number of NaN after merging: 13311
Number of NaN after time interpolation: 0
Number of NaN after temporal shifting: 0
Size of data frame after feature creation: (5446, 225)
Accuracy on Test Set: 0.571428571429
The accuracy of our final model is around 57%. This result is quite disappointing. We have always to keep in mind that we are performing a binary classification, in which case random guessing would have a success probability of 50%. So basically our best model is is only 7% better than tossing a coin. I have to admit that I struggled with this accuracies for quite a bit and after some reasoning and a lot of literature I arrived to the following conclusions.
As stressed at the very beginning of this set of posts the Stock Market Prediction Problem is very though. Lots of research has been produced on this topic and lots will surely be produced in the future and getting relevant results out of the blue is hard. We have always to remind ourselves that we are basically going against the Efficient Market Hypothesis (EMH), asserting that markets are informationally efficient, which means that they immediately auto adjust as soon as an event or a pattern is identified. As a result of this forecasting procedures in this kind of environment are very challenging.
The most common mistake is to believe that the relevant information is inside the market itself. There is for sure some significant information beyond historical exchange data but its extraction is not straightforward and generally can be achieved by technical time series analysis and econometrics. It’s feasible to get something out of raw data with basic financial analysis, as I did, but real information must be scraped much more in depth.
In addition to this we must account also for poor results of common Machine Learning Algorithms. As far as I read from the available literature much better performances can be obtained by implementing custom cost functions taking into account correlations and other more advanced metrics.
Concluding I’m very glad with the process I followed and the pipeline I built up, but the results are evidently not satisfactory.
In any case finally we have a model! Next step is to build a trading algorithm on top of our predictions and see what happens in real life with a backtest example.