Part IV - Model/Feature Selection -

Reading Time: 6 minutes

Index

In the last post I introduced the classification algorithms tested for the project’s purposes. The function in charge of data preparation and splitting has also been presented. Basically we are now ready for some real ML reasoning.

Cross Validation

We have introduced the algorithms and the features but now the real though question is the following:

What is the best algorithm and what are the best features?
But most importantly what does “best” mean?

In order to answer this questions we have to introduce Cross Validation which I actually already treated quite extensively in this post about Pythonic Cross Validation on Time Series.

Just to recap a little bit for the ones who don’t want to go through the linked post.

First of all we have to decide a metrics to evaluate the results of our prediction. Generally a well accepted measurement is the Area Under the Curve (AUC), which consists in the percentage of misclassified events counted at several probability thresholds. The output of a prediction algorithm can always be interpreted as a probability for a certain case to belong to a specific class or not. For a binary classification the default behavior of the algorithm is to classify as “0” a case whose probability of belonging to “1” is less than 50%, and viceversa. This threshold can be varied as needed, depending on the field of investigation. There may be situations in which this kind of tricks is absolutely fundamental. This is the case for example in which the relative proportion of the two classes is extremely skewed towards one of them. In this case the 50% threshold does not really make sense. If I’m building a model to detect whether a person is affected by a particular disease or not and the disease rate in the whole population is let’s say 3% then I do want to be very careful, as this 3% is very likely to fall in my misclassification error.

The AUC takes care of this kind of issues measuring the robustness of a classifier at several probability thresholds. In my case, being stock markets notoriously randomic I decided to stick to the more classic accuracy of a classifier fixing my threshold at 50%.

Having said that let’s come back to Cross Validation. In order to show the technique I’ll walk through a practical example.

We want to assess the performance of a Random Forest Classifier in the following conditions:

100 trees (n_estimators in Scikit-Learn)
n = 2 / delta = 2 . Thus we are lagging the returns at maximun for 2 days and we are computing at maximum 2-day-returns and 2-day-moving average returns.

What we do next is what follows:

Split data in train and test set given a Date (in my case after 1 April 2014 included).
Split train set (before 1 April 2014 not included) in 10 consecutive time folds.
Then, in order not lo lose the time information, perform the following steps:
Train on fold 1 –> Test on fold 2
Train on fold 1+2 –> Test on fold 3
Train on fold 1+2+3 –> Test on fold 4
Train on fold 1+2+3+4 –> Test on fold 5
Train on fold 1+2+3+4+5 –> Test on fold 6
Train on fold 1+2+3+4+5+6 –> Test on fold 7
Train on fold 1+2+3+4+5+6+7 –> Test on fold 8
Train on fold 1+2+3+4+5+6+7+8 –> Test on fold 9
Train on fold 1+2+3+4+5+6+7+8+9 –> Test on fold 10
Compute the average of the accuracies of the 9 test folds (number of folds – 1)

Repeat steps 1-12 in the following conditions:

n = 2 / delta = 3
n = 2 / delta = 4
n = 2 / delta = 5
n = 3 / delta = 2
n = 3 / delta = 3
n = 3 / delta = 4
n = 3 / delta = 5
n = 4 / delta = 2
n = 4 / delta = 3
n = 4 / delta = 4
n = 4 / delta = 5
n = 5 / delta = 2
n = 5 / delta = 3
n = 5 / delta = 4
n = 5 / delta = 5

and get average of the accuracies of the 9 test folds in each one of the previous cases. Obviously there is an infinite number of possibilities to generate and assess. What I did is to stop at a maximum of 10 days. Thus I basically performed a double for loop up to n = 10 / delta = 10.

Each time the script gets into one iteration of the for loop it generates a brand new dataframe with a different set of features. Then, on top of the newborn dataframe, 10-fold Cross Validation is performed in order to assess the performance of the selected algorithm with that particular set of predictors.

I repeated this set of operations for all the algorithms introducued in the previous post (Random Forest, KNN, SVM Adaptive Boosting, Gradient Tree Boosting, QDA) and after all the computations the best result is the following:

Random Forest | n = 9 / delta = 9 | Average Accuracy = 0.530748

The output for this specific conditions is provided below together with the python function in charge of looping over n and delta and performing CV on each iteration.

-
Maximum time lag applied 9
Delta days accounted:  9

Size of data frame:  (5456, 153)
Number of NaN after merging:  13311
Number of NaN after time interpolation:  0
Number of NaN after temporal shifting:  0
Size of data frame after feature creation:  (5446, 225)

Parameters -------------------------------->  [100]
Size train set:  (5341, 224)
Size of each fold:  534

Splitting the first 2 chuncks at 1/2
Size of train+test:  (1068, 224)
Performing RF Classification...
Size of train set:  (534, 224)
Size of test set:  (533, 224)
Accuracy on fold 2:  0.514071294559

Splitting the first 3 chuncks at 2/3
Size of train+test:  (1602, 224)
Performing RF Classification...
Size of train set:  (1068, 224)
Size of test set:  (533, 224)
Accuracy on fold 3:  0.538461538462

Splitting the first 4 chuncks at 3/4
Size of train+test:  (2136, 224)
Performing RF Classification...
Size of train set:  (1602, 224)
Size of test set:  (533, 224)
Accuracy on fold 4:  0.480300187617

Splitting the first 5 chuncks at 4/5
Size of train+test:  (2670, 224)
Performing RF Classification...
Size of train set:  (2136, 224)
Size of test set:  (533, 224)
Accuracy on fold 5:  0.527204502814

Splitting the first 6 chuncks at 5/6
Size of train+test:  (3204, 224)
Performing RF Classification...
Size of train set:  (2670, 224)
Size of test set:  (533, 224)
Accuracy on fold 6:  0.547842401501

Splitting the first 7 chuncks at 6/7
Size of train+test:  (3738, 224)
Performing RF Classification...
Size of train set:  (3204, 224)
Size of test set:  (533, 224)
Accuracy on fold 7:  0.544090056285

Splitting the first 8 chuncks at 7/8
Size of train+test:  (4272, 224)
Performing RF Classification...
Size of train set:  (3738, 224)
Size of test set:  (533, 224)
Accuracy on fold 8:  0.55722326454

Splitting the first 9 chuncks at 8/9
Size of train+test:  (4806, 224)
Performing RF Classification...
Size of train set:  (4272, 224)
Size of test set:  (533, 224)
Accuracy on fold 9:  0.547842401501

Splitting the first 10 chuncks at 9/10
Size of train+test:  (5340, 224)
Performing RF Classification...
Size of train set:  (4806, 224)
Size of test set:  (533, 224)
Accuracy on fold 10:  0.519699812383

Mean Accuracy for (9, 9): 0.530748

– function to perform Feature and Model Selection

def performFeatureSelection(maxdeltas, maxlags, fout, cut, start_test, path_datasets, savemodel, method, folds, parameters):
    """
    Performs Feature selection for a specific algorithm
    """
    
    for maxlag in range(3, maxlags + 2):
        lags = range(2, maxlag) 
        print ''
        print '============================================================='
        print 'Maximum time lag applied', max(lags)
        print ''
        for maxdelta in range(3, maxdeltas + 2):
            datasets = loadDatasets(path_datasets, fout)
            delta = range(2, maxdelta) 
            print 'Delta days accounted: ', max(delta)
            datasets = applyRollMeanDelayedReturns(datasets, delta)
            finance = mergeDataframes(datasets, 6, cut)
            print 'Size of data frame: ', finance.shape
            print 'Number of NaN after merging: ', count_missing(finance)
            finance = finance.interpolate(method='linear')
            print 'Number of NaN after time interpolation: ', count_missing(finance)
            finance = finance.fillna(finance.mean())
            print 'Number of NaN after mean interpolation: ', count_missing(finance)    
            finance = applyTimeLag(finance, lags, delta)
            print 'Number of NaN after temporal shifting: ', count_missing(finance)
            print 'Size of data frame after feature creation: ', finance.shape
            X_train, y_train, X_test, y_test  = prepareDataForClassification(finance, start_test)
            
            print performCV(X_train, y_train, folds, method, parameters, fout, savemodel)
            print ''

It is very important to notice that nothing has been done at an algorithmic parameter level. What I mean is that with the previous approach we have been able to achieve two very important goals:

Assess the best classification algorithm, comparing all of them on the same set of features.
Assess for each algorithm the best set of features.
Pick the couple (Model/Features) maximizing the CV Accuracy

Nice! So let’s move to the test set I would suggest!

Twitter

Part IV – Model/Feature Selection

Index

Cross Validation

Discover more from