Part II – Feature Generation


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance


In the last post I went through the project’s introduction and the data collection, together with a little bit of feature analysis. In this article I’ll deal with additional feature generation and lay the foundations of feature selection.

Basically I want to answer the following questions: after having collected financial historical data

  • how do I get relevant information out of it?
  • how do I add flexibility to my system plugging in artificially generated features?

Old and New Features

Together with the algorithm this is the most important question to be answered. Turns out that it is very hard to define a priori a good set of features. I’m not talking about feature selection; this topic will be faced in the future using Cross Validation. I’m talking about the basic set of features to start playing with. Excluding the date, I could just have taken all 6 columns (Open, High, Low, Close, Volume, Adj Close) for the 8 selected indices (NASDAQ, Dow Jones, Frankfurt, London, Paris, Tokyo, Hong Kong, Australia) plus the output (S&P 500), merged them into a unique dataframe of 54 columns and run an algorithm on top of that.

The latter approach is quite naive as it doesn’t really take into account any financial dynamics, plugging into the model absolute values of prices and not their fluctuations.

It would be much more informative to turn all the predictors into returns as well, to account for the variation of the predictors more than sticking to their static values. To achieve that I focused on the Adjusted Price of each predictor. So out of the 56 possible columns I selected only 9 of them.

  1. AdjClose_sp
  2. AdjClose_nasdaq
  3. AdjClose_djia
  4. AdjClose_frankfurt
  5. AdjClose_london
  6. AdjClose_paris
  7. AdjClose_nikkei
  8. AdjClose_hkong
  9. AdjClose_australia

Then in order to account for time variations I decided to play with 4 basic financial metrics:

  1. Days Returns: percentage difference of Adjusted Close Price of i-th day  compared to (i-1)-th day.  $$ Return_{i} = \frac{AdjClose_{i} – AdjClose_{i-1} }{AdjClose_{i-1}} $$
  2. Multiple Day Returns: percentage difference of Adjusted Close Price of i-th day  compared to (i-delta)-th day. Example: 3-days Return is the percentage difference of Adjusted Close Price of today compared to the one of 3 days ago. $$ Return_{\delta} = \frac{AdjClose_{i} – AdjClose_{i-\delta} }{AdjClose_{i-\delta}} $$
  3. Returns Moving Average: average returns on last delta days. Example: 3-days Return is the percentage difference of Adjusted Close Price of today compared to the one of 3 days ago. $$ MovingAverage_{\delta} = \frac{Return_{i} + Return_{i-2} + Return_{i-2} +\dots Return_{i-\delta}}{\delta} $$
  4. Time Lagged Returns: shift the daily returns n days backwards. Example: if n =  1 todays’ Return becomes yesterdays’ Return.

Thus to recap this is the process I follow to build my dataset:

  1. I start with 8 basic predictors (the Adjusted Close Price of the 8 world major stock indices) + 1 output/predictor (Adjusted Close Price of S&P 500). Take in mind that despite S&P daily returns being my predicted values I still want to keep inside my model some information regarding Standard & Poors itself. I hope this is not confusing.
  2. I compute the daily returns of each of them.
  3. I add features to the DataFrame using the 4 financial metrics described above. Thus basically playing around with n and delta I can generate as many features as I want adding more and more flexibility to my dataframe.
  4. I get rid of the Adjusted Close Price I had at the beginning, ending up with a perfectly scaled dataset. No need for normalization as all the data I generated are in the same range (as you probably noticed we basically performed always the same kind of returns computations).
  5. Notice that a bunch of missing values are automatically produced. To make this point clear let’s walk through a practical example: what happens when I compute the 3-day-Moving Average on my Daily Returns column? Pandas is going to replace today’s return with the average of the returns of the last 3 days. Now let’s suppose that the first entry in our dataframe corresponds to 20 April 2014. There is going to be no 3-day-Moving Average for 20 April 2014 as there are no 3 previous days. The same for 21 and 22 April 2014. Actually the first non missing day is going to be 23 April 2014 as it would be possible to compute the average of daily returns on the 3 previous days. Notice that the same issue (with slightly different results) rises with the other financial metrics are taken into account.

As follows I provide the code which takes care of all the previously discussed feature generation, plus solving a couple of intermediate issues.

– given the AdjClose and Returns, addFeatures() returns delta-Multiple Day Returns and delta-Returns Moving Average. This function is called for several deltas inside applyRollMeanDelayedReturns()

– given the list of datasets and the range of deltas to explore applyRollMeanDelayedReturns() adds features to each dataset and returns the augmented list

mergeDataframes() is fundamental as it takes the list of augmented datasets produced by applyRollMeanDelayedReturns() and merges all of them in the finance dataframe applying a time cut at the beginning of the series (all data after 1993 – I decided to implement this time cut as Australia ASX-200 data is not available before that time) and selecting only the relevant columns out of each dataframe. This is the step in which we get rid of all the Open, High, Low, Close, Volume, AdjClose columns in each dataset and keep only the Daily Returns, delta-Multiple Day Returns and delta-Returns Moving Average previously created.

I want to stress the specific merging Pandas command. This step is quite tricky as I’m concatenating dataframes by date index. The issue arising is that markets all over the world have not the same trading days due basically to different national holidays. So there is going to be an unavoidable annoying mismatch between dates. I faced the issue in the following way:

  1. Perform an OUTER JOIN of all the predictors. This command generates a UNION of all the columns, thus creating a bunch of NAs. I afterwards imputed them by linear interpolation. Let’s call the result of this operation PREDICTORS.
  2. Perform a S&P LEFT JOIN PREDICTORS. This line shrinks PREDICTORS to match S&P date index. Thus no additional NA will be created in the output dataframe.

applyTimeLag() achieves two goals:

  1. First of all takes the finance dataframe as input, focuses only on Daily Returns columns and generates Time Lagged Returns accordingly.
  2. Secondly, and maybe even more importantly, in this line of code

    the function shifts a day backwards S&P 500 daily returns. This step is crucial as, in order for my prediction to work correctly, I need to align tomorrow’s returns to today’s data. If I didn’t that I would forecast today’s exchange results!

Well, time to perform some Machine Learning on top of our brand new finance DataFrame.

by Francesco Pochetti