This is the first of a series of posts summarizing the work I’ve done on Stock Market Prediction as part of my portfolio project at Data Science Retreat.
The scope of this post is to get an overview of the whole work, specifically walking through the foundations and core ideas.
First of all I provide the list of modules needed to have the Python code running correctly in all the following posts. I import them only once at the beginning and that’s it.
import cPickle import numpy as np import pandas as pd import datetime from sklearn import preprocessing from datetime import datetime from sklearn.ensemble import RandomForestClassifier from sklearn import neighbors from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.svm import SVC import operator import pandas.io.data from sklearn.qda import QDA import re from dateutil import parser from backtest import Strategy, Portfolio
Nice! Now we can start!
Introduction
The idea at the base of this project is to build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s  exchange closing price is going to be lower or higher with respect to today. Next step will be to develop a trading strategy on top of that, based on our predictions, and backtest it against a benchmark.
Specifically, I’ll go through the pipeline, decision process and results I obtained trying to model S&P 500 daily returns.
My whole work will be structured as follows:
- Introduction and Discussion of the Problem
- Feature Generation
- Classification Algorithms
- Feature/Model Selection
- Results on Test Set
- Trading Algorithm and Portfolio Performance
Problem Definition
The aim of the project is to predict whether future daily returns of a S&P 500 are going to be positive or negative.
Thus the problem I’m facing is a binary classification.
The metric I deal with is daily return which is computed as follows:
$$ Return_{i} = \frac{AdjClose_{i} – AdjClose_{i-1} }{AdjClose_{i-1}} $$
The Return on the i-th day is equal to the Adjusted Stock Close Price on the i-th day minus the Adjusted Stock Close Price on the (i-1)-th day divided by the Adjusted Stock Close Price on the (i-1)-th day. Adjusted Close Price of a stock is its close price modified by taking into account dividends. It is common practice to use this metrics in Returns computations.
Since the beginnning I decided to focus only on S&P 500, a stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE (New York Stock Exchange) or NASDAQ. Being such a diversified portfolio, the S&P 500 index is typically used as a market benchmark, for example to compute betas of companies listed on the exchange.
Feature Analysis
The main idea is to use world major stock indices as input features for the machine learning based predictor. The intuition behind this approach is that globalization has deepened the interaction between financial markets around the world. Shock wave of US financial crisis (from Lehman Brothers crack) hit the economy of almost every country and debt crisis originated in Greece brought down all major stock indices. Nowadays, no financial market is isolated. Economic data, political perturbation and any other oversea affairs could cause dramatic fluctuation in domestic markets. A “bad day” on the Australian or Japanese exchange is going to heavily affect Wall Street opening and trend. In the light of the previous considerations the following predictors have been selected:
- NASDAQ Composite (^IXIC Yahoo Finance)
- Dow Jones Industrial Average (^DJI Quandl)
- Frankfurt DAX (^GDAXI Yahoo Finance)
- London FTSE-100 (^FTSE Yahoo Finance)
- Paris CAC 40 (^FCHI Yahoo Finance)
- Tokyo Nikkei-225 (^N225 Yahoo Finance)
- Hong Kong Hang Seng (^HSI Yahoo Finance)
- Australia ASX-200 (^AXJO Yahoo Finance)
It is very easy to get historical daily prices of the previous indices. Python provides easy libraries to handle the download. The data can be pulled down from Yahoo Finance or Quandl and cleanly formatted into a dataframe with the following columns:
- Date : in days
- Open : price of the stock at the opening of the trading (in US dollars)
- High : highest price of the stock during the trading day (in US dollars)
- Low : lowest price of the stock during the trading day (in US dollars)
- Close : price of the stock at the closing of the trading (in US dollars)
- Volume : amount of stocks traded (in US dollars)
- Adj Close : price of the stock at the closing of the trading adjusted with dividends (in US dollars)
The following is a screenshot of Yahoo Finance website showing a subset of NASDAQ Composite historical prices. This is exactly how a Pandas DataFrame looks like after having downloaded the data.
Output of Prediction
How do I plug the desired output of my prediction inside my dataframe? The answer is pretty straightforward and basically consists in repeating the exact same steps followed for predictors. Thus eventually, together with the 8 selected major stock indices, we’ll end up downloading a 9th dataset for S&P 500. Notice that the output of our prediction is a binary classification; we want to be able to answer the following question: is tomorrow going to be an Up or Down day? In order to do that the S&P data must undergo a simple manipulation of two steps:
- Compute S&P 500 daily returns (we’ll do this for predictors as well, as discussed in the next post).
- Generate an additional column in the DataFrame with ‘Up’ whenever the return on that specific day was positive and ‘Down’ whenever it was negative.
This passage of the pipeline is actually very important and it must be absolutely clear. I’ll spend a couple of words in addition to what I’ve already written. As I stressed, the output of my prediction is whether S&P 500 daily returns are positive or not. To carry out this kind of prediction I use the following indices: NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia and S&P 500 itself. Obviously I won’t use S&P 500 daily returns to forecast S&P 500 daily returns! This would not make sense. What I mean by S&P 500 itself is that I’ll play with S&P 500 historical close prices lagging them in time accordingly. The intuition is that I do not want to lose any potential information contained in the output data.
So to recap the logic is the following:
- Download 9 dataframes (NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia, S&P 500).
- Compute S&P 500 daily returns and turn them into a binary variable (Up, Down). This is my output and it won’t be touched anymore.
- Play with all the other columns of the 9 available dataframes (S&P 500 included) as explained in the following post.
For the sake of completeness I attach the Python code in charge of data gathering and very first preparation:
def getStock(symbol, start, end): """ Downloads Stock from Yahoo Finance. Computes daily Returns based on Adj Close. Returns pandas dataframe. """ df = pd.io.data.get_data_yahoo(symbol, start, end) df.columns.values[-1] = 'AdjClose' df.columns = df.columns + '_' + symbol df['Return_%s' %symbol] = df['AdjClose_%s' %symbol].pct_change() return df
def getStockFromQuandl(symbol, name, start, end): """ Downloads Stock from Quandl. Computes daily Returns based on Adj Close. Returns pandas dataframe. """ import Quandl df = Quandl.get(symbol, trim_start = start, trim_end = end, authtoken="your token") df.columns.values[-1] = 'AdjClose' df.columns = df.columns + '_' + name df['Return_%s' %name] = df['AdjClose_%s' %name].pct_change() return df
def getStockDataFromWeb(fout, start_string, end_string): """ Collects predictors data from Yahoo Finance and Quandl. Returns a list of dataframes. """ start = parser.parse(start_string) end = parser.parse(end_string) nasdaq = getStock('^IXIC', start, end) frankfurt = getStock('^GDAXI', start, end) london = getStock('^FTSE', start, end) paris = getStock('^FCHI', start, end) hkong = getStock('^HSI', start, end) nikkei = getStock('^N225', start, end) australia = getStock('^AXJO', start, end) djia = getStockFromQuandl("YAHOO/INDEX_DJI", 'Djia', start_string, end_string) out = pd.io.data.get_data_yahoo(fout, start, end) out.columns.values[-1] = 'AdjClose' out.columns = out.columns + '_Out' out['Return_Out'] = out['AdjClose_Out'].pct_change() return [out, nasdaq, djia, frankfurt, london, paris, hkong, nikkei, australia]
Lets’ move to the details of feature generation.