Index
- Introduction and Discussion of the Problem
- Feature Generation
- Classification Algorithms
- Feature/Model Selection
- Results on Test Set
- Trading Algorithm and Portfolio Performance
So finally, as a result of last post, we have a dataframe to play with.
Before diving into model and feature selection I would like to make a little overview of the Classification Algorithms I tested. In the next post I’ll show the techniques and the decision process followed to select the best model and the best features. Anyway to keep it simple I’ll present actual results only for the best model, so it is worth to introduce the other ones I implemented.
Classification ML
The following is a pretty awesome algorithm cheat-sheet provided as part of the Scikit-Learn Documentation. I’ll cover the Classification branch of the tree, going through the code needed to have the selected algorithms running.
First of all we need to prepare our data for the proper Machine Learning stuff. So let’s take care of that with prepareDataForClassification(). This function takes care of
- turning S&P 500 daily returns from a float to a string column (Up, Down)
- encoding it as a binary integer 1,0 (the only form accepted by Scikit-Learn Classifiers).
- selecting the features used for prediction (basically all the columns except for the first one – float returns – and the last one – string returns).
- splitting the whole dataframe into train and test set (based on a date passed as argument) and returns for each of them predictors and actual prediction.
def prepareDataForClassification(dataset, start_test): """ generates categorical output column, attach to dataframe label the categories and split into train and test """ le = preprocessing.LabelEncoder() dataset['UpDown'] = dataset['Return_Out'] dataset.UpDown[dataset.UpDown >= 0] = 'Up' dataset.UpDown[dataset.UpDown < 0] = 'Down' dataset.UpDown = le.fit(dataset.UpDown).transform(dataset.UpDown) features = dataset.columns[1:-1] X = dataset[features] y = dataset.UpDown X_train = X[X.index < start_test] y_train = y[y.index < start_test] X_test = X[X.index >= start_test] y_test = y[y.index >= start_test] return X_train, y_train, X_test, y_test
– performClassification() is in charge of calling the selected algorithm and returning the classification result
def performClassification(X_train, y_train, X_test, y_test, method, parameters, fout, savemodel): """ performs classification on daily returns using several algorithms (method). method --> string algorithm parameters --> list of parameters passed to the classifier (if any) fout --> string with name of stock to be predicted savemodel --> boolean. If TRUE saves the model to pickle file """ if method == 'RF': return performRFClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel) elif method == 'KNN': return performKNNClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel) elif method == 'SVM': return performSVMClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel) elif method == 'ADA': return performAdaBoostClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel) elif method == 'GTB': return performGTBClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel) elif method == 'QDA': return performQDAClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel)
Following are the code snippets implementing the algorithms
Random Forest
def performRFClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel): """ Random Forest Binary Classification """ clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1) clf.fit(X_train, y_train) if savemodel == True: fname_out = '{}-{}.pickle'.format(fout, datetime.now()) with open(fname_out, 'wb') as f: cPickle.dump(clf, f, -1) accuracy = clf.score(X_test, y_test) return accuracy
K- Nearest Neighbors
def performKNNClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel): """ KNN binary Classification """ clf = neighbors.KNeighborsClassifier() clf.fit(X_train, y_train) if savemodel == True: fname_out = '{}-{}.pickle'.format(fout, datetime.now()) with open(fname_out, 'wb') as f: cPickle.dump(clf, f, -1) accuracy = clf.score(X_test, y_test) return accuracy
Support Vector Machines
def performSVMClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel): """ SVM binary Classification """ c = parameters[0] g = parameters[1] clf = SVC() clf.fit(X_train, y_train) if savemodel == True: fname_out = '{}-{}.pickle'.format(fout, datetime.now()) with open(fname_out, 'wb') as f: cPickle.dump(clf, f, -1) accuracy = clf.score(X_test, y_test) return accuracy
Adaptive Boosting
def performAdaBoostClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel): """ Ada Boosting binary Classification """ n = parameters[0] l = parameters[1] clf = AdaBoostClassifier() clf.fit(X_train, y_train) if savemodel == True: fname_out = '{}-{}.pickle'.format(fout, datetime.now()) with open(fname_out, 'wb') as f: cPickle.dump(clf, f, -1) accuracy = clf.score(X_test, y_test) return accuracy
Gradient Tree Boosting
def performGTBClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel): """ Gradient Tree Boosting binary Classification """ clf = GradientBoostingClassifier(n_estimators=100) clf.fit(X_train, y_train) if savemodel == True: fname_out = '{}-{}.pickle'.format(fout, datetime.now()) with open(fname_out, 'wb') as f: cPickle.dump(clf, f, -1) accuracy = clf.score(X_test, y_test) return accuracy
Quadratic Discriminant Analysis
def performQDAClass(X_train, y_train, X_test, y_test, parameters, fout, savemodel): """ Quadratic Discriminant Analysis binary Classification """ def replaceTiny(x): if (abs(x) < 0.0001): x = 0.0001 X_train = X_train.apply(replaceTiny) X_test = X_test.apply(replaceTiny) clf = QDA() clf.fit(X_train, y_train) if savemodel == True: fname_out = '{}-{}.pickle'.format(fout, datetime.now()) with open(fname_out, 'wb') as f: cPickle.dump(clf, f, -1) accuracy = clf.score(X_test, y_test) return accuracy
Well, time to turn to some Model and Feature Selection!