# Stock Trading Algorithm on top of Market Event Study

This post is the result of the first six weeks of class from the Computational Investing course I’m currently following on Coursera. The course is an introduction to Portfolio Management and Optimization in Python and lays the foundations for the second part of the track which wil deal with Machine Learning for Trading. Let’s move quickly to the core of the business.

The question I want to answer with is the following:

• Is it possible to exploit event studies in a trading strategy?

First of all we should clarify what an event study is. As Wikipedia states, an event study is a statistical method to assess the impact of an event on the value of a firm. This definition is very broad and can easily incorporate facts directly concerning the company (i.e. private life of the CEO, merging with other firms, confidential news from insiders) or anomalous fluctuactions in the price of the stock. I naively (and maybe incorrectly) categorized events regarding a company into these two types, news related and market related, but there should be no difference as they are generally tigthly correlated. In any case, as it is not easy to have access and parse in real time news feeds we will focus on market related events, meaning that in the rest of the post an event must be intended as an anomalous behavior in the price of the stock whose consequences we could exploit to trade in a more efficient way.

Now that we have properly defined an event we can go back to the beginning and think a little bit more about what study an event really means. To understand it let’s walk through a complete example and suppose that we have an event whenever the closing price of a stock at the end of day i  is less than 10$whilst at the end of day i-1 was more than 10$. Thus we are examining a significant drop in the price of the stock. Given this definition the answer is: what does it statistically happen to prices of stocks experiencing those kind of fluctuations? Is there a trend that could be somehow exploited? The reason at the base of these questions is that if we knew in advance that a stock followed a specific pattern as a consequence of some event we could could adjust our trading strategy accordingly. If statistics suggests that the price is bound to increase maybe it is a good idea to long the shares whether in the opposite case the best decision is to short.

In order to run an even study we take advantage of the EventProfiler class inside the QSTK library. This class allows us to define an event and then, given a time inerval and a list of stocks, it works in the following way: it scrolls firm after firm and whenever it finds an event sets that day day as day 0. Then it goes 20 days ahead and 20 days before the event and saves the timeframe. After having analyzed all the stocks it aligns the events on the day 0, averages all the prices before and after and scales the result by the market (SPY). The output is a chart which basically answers this question: what happens on average when the closing price of a stock at the end of day i  is less than 10$whilst at the end of day i-1 was more than 10$? The test period was the one between 1 January 2008 and 31 December 2009 (in the middle of the financial crisis), while the stocks chosen were the 500 contained in the S&P index in 2012. The graph is shown below and the following information can be extracted: first, 461 such events were registered during the investigated time frame. Second, on the day of the event there is a drop of about 10% in the stock price w.r.t the day before. Third, the price seems to recover after day zero, even though the confidence intervals of the daily increase are huge.

Now the idea is the following. If the observed behavior is respected what we can do is build a trading strategy consisting in buying on the day of the event and selling let’s say after 5 days (we don’t want to hold too long despite the price increasing almost monotonically). Just to recap here you find the whole pipeline from event definition to portofolio assessment.

Now that we have a plan let’s dive into the code (you can find all the code on Github).

First of all I’ll introduce one after the other the two main functions.

find_events(ls_symbols,  d_data,  shares=100):  given the list of the stocks in the portfolio, their historical prices and the number of shares to be traded identifies events and issues a Buy Order on the day of the event and a Sell Order after 5 trading days. Eventually it returns a csv file to be passed to the market simulator. The first lines of the csv file are previed below (year, month, day, stock, order, shares).

marketsim(investment, orders_file, out_file):  given the initial investment in dollars (50000 $in our case), the csv files containing all the orders (the output of find_events()) and the file to save to the results of the simulation, this function places the order in chronologic order and updates automatically the value of the portfolio. It returns a csv file with the portfolio value in time, a plot comparing the portfolio performance against the market benchmark and print to screen a summary of the main financial metrics used to evaluate the portfolio. main(): this function calls the previous two after getting and cleaning all the relevant data. This is the output, as promised: Well, despite the huge crisis (-19% market return) our trading strategy brought us to gain a remarkable +19%! This was just an example but in any case very powerful to show the possibilities of event studies in finance. by Francesco Pochetti # Part VI – Trading Algorithm and Portfolio Performance ### Index Now that we have a prediction we can also develop a trading strategy and test it against the real markets. ## Trading Strategy The idea is the following. I built a forecasting algorithm and now I know with a certain confidence if the closing price of tomorrow will be higher or lower than the the closing price of today. How can I use this information? The idea I’m about to go through is explained pretty much in detail on QuantStart, a very nice website with great financial tutorials in Python. Basically I picked their code and adapted it to my needs. The strategy is very basic and works in this way: if the probability of the day being “up” exceeds 50%, the strategy purchases 500 shares of S&P 500 and sells it at the end of the day. if the probability of a down day exceeds 50%, the strategy sells 500 shares of S&P 500 and then buys back at the close. The idea is that I start with 100k US$ and buy and sell only playing with this amount of money.

It is quite evident that this strategy has only learning purposes. Even though we could be successfull and make at the end of the test period some positive returns, this approach is absolutely not applicable in real life for basically two reasons:

1. Transaction costs (such as commission fees) have not been added to this backtesting system. Since the strategy carries out a round-trip trade once per day, these fees are likely to significantly curtail the returns.
2. The strategy assumes that the closing price of today is going to be equal to the opening price of tomorrow which is unlikely to happen.

In any case I stress again that the purpose of this exercise is only a lerning one so it is worth going on and see how to implement this process in Python.

Basically everything is contained in the Python Code section. Instead of being too verbose in the post body I believed that in this context it would have been much better to comment directly inside the code,  so you’ll find all the relevant explanations below.

## Portofolio Performance

This is maybe the most important part of all the blog posts I have written so far, as It summarizes in a single plot all the work done.

In the figure below (whose code you can find at the end in the Python Code section) there are two subplots:

1. S&P 500 Close Price in the period 1 April 2014 – 28 August 2014. This first graph shows the actual trend of the market index in the backtest period. In this particular period the market had a return of almost 6%.
2. Portofolio Value in the period 1 April 2014 – 28 August 2014. This graph shows the trend of the Porfolio generated on top of our predictions. As you can see the start value is 100k \$ which end up at a final value, after 5 months of trading, of about 10%.

The results are quite good, and show the potential of this kind of approach. As I explained in all the recent posts there is much more work to be done and a lot to be improved. In any case i think that the whole process I just described can be the base of a more robust pipeline.

Thanks a lot for reading and see you with the next project!

## Python Code

In the previous code snippet there are two call to the following external functions:

1. getPredictionFromBestModel() : Function
3. backtest_portfolio() : Class Method

Below I provide the code for all of them adding the line at which they were called right before the code itself.

##### – getPredictionFromBestModel()

Portolio interface is provided at the end

##### – Plotting Portfolio Performance with Matplotlib

by Francesco Pochetti

# Part V – Results on Test Set

### Index

We closed the previous post with the results of Cross Validation.

Eventually we decided that our best combinations is the following:

• Algorithm: Random Forests (n_estimators = 100)
• Features: n = 9 / delta  = 9

## Random Forests Results

From a strict Machine Learning point of view now that we have picked model and features it should be time of algorithm parameter selection. This goal can be achieved performing an other round of CV on train set looping over a set of parameters and then select the ones maximizing the average accuracy on folds. Obviously the parameters of interest change with the algorithm. In Random Forests Scikit-Learn implementation there are several elements we are allowed to play with:

1. n_estimators (default = 10): The number of trees in the forest.
2. max_features (default = sqrt(n_features)): The number of features to consider when looking for the best split
3. Other minor parameters such as the maximum depth of the tree, the minimum number of samples required to split an internal node and so on so forth.

The capability of these parameters to significantly affect the performances of a model strongly depend on the specific problem we are facing. There is now generic rule of thumb to stick to. It’s quite well known that increasing the number of estimators decreases the train error (and hopefully the test one too), but in any case after a certain limit has been passed no relevant improvement will be recorded. Thus, we will only be raising the computation cost without a real benefit. For this reason I stick to 100 trees and never changed it.

As for the max_features parameter, this can be a tricky one. Random Forests is just an improved evolution on the Bagging or Bootstrap Aggregating Algorithm, which solves the high variance innate problem of a single tree but introduces an unavoidable high correlation among all the bootstrapped trees due to the fact that all the features are taken into account at each iteration. Thus, in presence of  few dominant predictors, Bagging basically splits the trees always in the same way and eventually we end up averaging a bunch of identical trees. To solve this issue Random Forests was implemented. The main improvement consists in the fact that only a subset of the available features are selected each time a tree is built. As a consequence the trees are generally uncorrelated one to another and the final result is much more reliable. Theoretically speaking if I set max_features to be equal to the total amount of features I’d end up performing Bagging without even realizing it. At the end of the day this parameter may be pretty relevant. In any case the default one (square root of the number of predictors) is a very good balance in terms of bias-variance trade-off.

Let’s go back to our firts concern which was to measure the accuracy of the best CV model on test set. To do that I train the model on the whole train set and report the accuracy on the test set. After the model has been trained it is saved to a file (.pickle) in order to be reused in the future and to avoid to generate it from scratch each time we need it. The output of the cose is provided below.

The accuracy of our final model is around 57%. This result is quite disappointing. We have always to keep in mind that we are performing a binary classification, in which case random guessing would have a success probability of 50%. So basically our best model is is only 7% better than tossing a coin. I have to admit that I struggled with this accuracies for quite a bit and after some reasoning and a lot of literature I arrived to the following conclusions.

As stressed at the very beginning of this set of posts the Stock Market Prediction Problem is very though. Lots of research has been produced on this topic and lots will surely be produced in the future and getting relevant results out of the blue is hard. We have always to remind ourselves that we are basically going against the Efficient Market Hypothesis (EMH), asserting that markets are informationally efficient, which means that they immediately auto adjust as soon as an event or a pattern is identified. As a result of this forecasting procedures in this kind of environment are very challenging.

The most common mistake is to believe that the relevant information is inside the market itself. There is for sure some significant information beyond historical exchange data but its extraction is not straightforward and generally can be achieved by technical time series analysis and econometrics. It’s feasible to get something out of raw data with basic financial analysis, as I did, but real information must be scraped much more in depth.

In addition to this we must account also for poor results of common Machine Learning Algorithms. As far as I read from the available literature much better performances can be obtained by implementing custom cost functions taking into account correlations and other more advanced metrics.

Concluding I’m very glad with the process I followed and the pipeline I built up, but the results are evidently not satisfactory.

In any case finally we have a model! Next step is to build a trading algorithm on top of our predictions and see what happens in real life with a backtest example.

# Part II – Feature Generation

### Index

In the last post I went through the project’s introduction and the data collection, together with a little bit of feature analysis. In this article I’ll deal with additional feature generation and lay the foundations of feature selection.

Basically I want to answer the following questions: after having collected financial historical data

• how do I get relevant information out of it?
• how do I add flexibility to my system plugging in artificially generated features?

## Old and New Features

Together with the algorithm this is the most important question to be answered. Turns out that it is very hard to define a priori a good set of features. I’m not talking about feature selection; this topic will be faced in the future using Cross Validation. I’m talking about the basic set of features to start playing with. Excluding the date, I could just have taken all 6 columns (Open, High, Low, Close, Volume, Adj Close) for the 8 selected indices (NASDAQ, Dow Jones, Frankfurt, London, Paris, Tokyo, Hong Kong, Australia) plus the output (S&P 500), merged them into a unique dataframe of 54 columns and run an algorithm on top of that.

The latter approach is quite naive as it doesn’t really take into account any financial dynamics, plugging into the model absolute values of prices and not their fluctuations.

It would be much more informative to turn all the predictors into returns as well, to account for the variation of the predictors more than sticking to their static values. To achieve that I focused on the Adjusted Price of each predictor. So out of the 56 possible columns I selected only 9 of them.

Then in order to account for time variations I decided to play with 4 basic financial metrics:

1. Days Returns: percentage difference of Adjusted Close Price of i-th day  compared to (i-1)-th day.  $$Return_{i} = \frac{AdjClose_{i} – AdjClose_{i-1} }{AdjClose_{i-1}}$$
2. Multiple Day Returns: percentage difference of Adjusted Close Price of i-th day  compared to (i-delta)-th day. Example: 3-days Return is the percentage difference of Adjusted Close Price of today compared to the one of 3 days ago. $$Return_{\delta} = \frac{AdjClose_{i} – AdjClose_{i-\delta} }{AdjClose_{i-\delta}}$$
3. Returns Moving Average: average returns on last delta days. Example: 3-days Return is the percentage difference of Adjusted Close Price of today compared to the one of 3 days ago. $$MovingAverage_{\delta} = \frac{Return_{i} + Return_{i-2} + Return_{i-2} +\dots Return_{i-\delta}}{\delta}$$
4. Time Lagged Returns: shift the daily returns n days backwards. Example: if n =  1 todays’ Return becomes yesterdays’ Return.

Thus to recap this is the process I follow to build my dataset:

1. I start with 8 basic predictors (the Adjusted Close Price of the 8 world major stock indices) + 1 output/predictor (Adjusted Close Price of S&P 500). Take in mind that despite S&P daily returns being my predicted values I still want to keep inside my model some information regarding Standard & Poors itself. I hope this is not confusing.
2. I compute the daily returns of each of them.
3. I add features to the DataFrame using the 4 financial metrics described above. Thus basically playing around with n and delta I can generate as many features as I want adding more and more flexibility to my dataframe.
4. I get rid of the Adjusted Close Price I had at the beginning, ending up with a perfectly scaled dataset. No need for normalization as all the data I generated are in the same range (as you probably noticed we basically performed always the same kind of returns computations).
5. Notice that a bunch of missing values are automatically produced. To make this point clear let’s walk through a practical example: what happens when I compute the 3-day-Moving Average on my Daily Returns column? Pandas is going to replace today’s return with the average of the returns of the last 3 days. Now let’s suppose that the first entry in our dataframe corresponds to 20 April 2014. There is going to be no 3-day-Moving Average for 20 April 2014 as there are no 3 previous days. The same for 21 and 22 April 2014. Actually the first non missing day is going to be 23 April 2014 as it would be possible to compute the average of daily returns on the 3 previous days. Notice that the same issue (with slightly different results) rises with the other financial metrics are taken into account.

As follows I provide the code which takes care of all the previously discussed feature generation, plus solving a couple of intermediate issues.

– given the AdjClose and Returns, addFeatures() returns delta-Multiple Day Returns and delta-Returns Moving Average. This function is called for several deltas inside applyRollMeanDelayedReturns()

– given the list of datasets and the range of deltas to explore applyRollMeanDelayedReturns() adds features to each dataset and returns the augmented list

mergeDataframes() is fundamental as it takes the list of augmented datasets produced by applyRollMeanDelayedReturns() and merges all of them in the finance dataframe applying a time cut at the beginning of the series (all data after 1993 – I decided to implement this time cut as Australia ASX-200 data is not available before that time) and selecting only the relevant columns out of each dataframe. This is the step in which we get rid of all the Open, High, Low, Close, Volume, AdjClose columns in each dataset and keep only the Daily Returns, delta-Multiple Day Returns and delta-Returns Moving Average previously created.

I want to stress the specific merging Pandas command. This step is quite tricky as I’m concatenating dataframes by date index. The issue arising is that markets all over the world have not the same trading days due basically to different national holidays. So there is going to be an unavoidable annoying mismatch between dates. I faced the issue in the following way:

1. Perform an OUTER JOIN of all the predictors. This command generates a UNION of all the columns, thus creating a bunch of NAs. I afterwards imputed them by linear interpolation. Let’s call the result of this operation PREDICTORS.
2. Perform a S&P LEFT JOIN PREDICTORS. This line shrinks PREDICTORS to match S&P date index. Thus no additional NA will be created in the output dataframe.

applyTimeLag() achieves two goals:

1. First of all takes the finance dataframe as input, focuses only on Daily Returns columns and generates Time Lagged Returns accordingly.
2. Secondly, and maybe even more importantly, in this line of code

the function shifts a day backwards S&P 500 daily returns. This step is crucial as, in order for my prediction to work correctly, I need to align tomorrow’s returns to today’s data. If I didn’t that I would forecast today’s exchange results!

Well, time to perform some Machine Learning on top of our brand new finance DataFrame.

by Francesco Pochetti

# Part I – Stock Market Prediction in Python Intro

This is the first of a series of posts summarizing the work I’ve done on Stock Market Prediction as part of my portfolio project at Data Science Retreat.

The scope of this post is to get an overview of the whole work, specifically walking through the foundations and core ideas.

First of all I provide the list of modules needed to have the Python code running correctly in all the following posts. I import them only once at the beginning and that’s it.

Nice! Now we can start!

## Introduction

The idea at the base of this project is to build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s  exchange closing price is going to be lower or higher with respect to today. Next step will be to develop a trading strategy on top of that, based on our predictions, and backtest it against a benchmark.

Specifically, I’ll go through the pipeline, decision process and results I obtained trying to model S&P 500 daily returns.

My whole work will be structured as follows:

## Problem Definition

The aim of the project is to predict whether future daily returns of a S&P 500 are going to be positive or negative.

Thus the problem I’m facing is a binary classification.

The metric I deal with is daily return which is computed as follows:

$$Return_{i} = \frac{AdjClose_{i} – AdjClose_{i-1} }{AdjClose_{i-1}}$$

The Return on the i-th day is equal to the Adjusted Stock Close Price on the i-th day minus the Adjusted Stock Close Price on the (i-1)-th day divided by the Adjusted Stock Close Price on the (i-1)-th day. Adjusted Close Price of a stock is its close price modified by taking into account dividends. It is common practice to use this metrics in Returns computations.

Since the beginnning I decided to focus only on S&P 500, a stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE (New York Stock Exchange) or NASDAQ. Being such a diversified portfolio, the S&P 500 index is typically used as a market benchmark, for example to compute betas of companies listed on the exchange.

## Feature Analysis

The main idea is to use world major stock indices as input features for the machine learning based predictor. The intuition behind this approach is that globalization has deepened the interaction between financial markets around the world. Shock wave of US financial crisis (from Lehman Brothers crack) hit the economy of almost every country and debt crisis originated in Greece brought down all major stock indices. Nowadays, no financial market is isolated. Economic data, political perturbation and any other oversea affairs could cause dramatic fluctuation in domestic markets. A “bad day” on the Australian or Japanese exchange is going to heavily affect Wall Street opening and trend. In the light of the previous considerations the following predictors have been selected:

It is very easy to get historical daily prices of the previous indices. Python provides easy libraries to handle the download. The data can be pulled down from Yahoo Finance or Quandl and cleanly formatted into a dataframe with the following columns:

• Date : in days
• Open : price of the stock at the opening of the trading (in US dollars)
• High : highest price of the stock during the trading day (in US dollars)
• Low : lowest price of the stock during the trading day (in US dollars)
• Close : price of the stock at the closing of the trading (in US dollars)
• Volume : amount of stocks traded (in US dollars)
• Adj Close : price of the stock at the closing of the trading adjusted with dividends (in US dollars)

The following is a screenshot of Yahoo Finance website showing a subset of NASDAQ Composite historical prices. This is exactly how a Pandas DataFrame looks like after having downloaded the data.

## Output of Prediction

How do I plug the desired output of my prediction inside my dataframe? The answer is pretty straightforward and basically consists in repeating the exact same steps followed for predictors. Thus eventually,  together with the 8 selected major stock indices, we’ll end up downloading a 9th dataset for S&P 500. Notice that the output of our prediction is a binary classification; we want to be able to answer the following question: is tomorrow going to be an Up or Down day? In order to do that the S&P data must undergo  a simple manipulation of two steps:

1. Compute S&P 500 daily returns (we’ll do this for predictors as well, as discussed in the next post).
2. Generate an additional column in the DataFrame with ‘Up’ whenever the return on that specific day was positive and ‘Down’ whenever it was negative.

This passage of the pipeline is actually very important and it must be absolutely clear. I’ll spend a couple of words in addition to what I’ve already written. As I stressed, the output of my prediction is whether S&P 500 daily returns are positive or not. To carry out this kind of prediction I use the following indices: NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia and S&P 500 itself.  Obviously I won’t use S&P 500 daily returns to forecast  S&P 500 daily returns! This would not make sense. What I mean by S&P 500 itself is that I’ll play with S&P 500 historical close prices lagging them in time accordingly. The intuition is that I do not want to lose any potential information contained in the output data.

So to recap the logic is the following:

1. Download 9 dataframes (NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia, S&P 500).
2. Compute S&P 500 daily returns and turn them into a binary variable (Up, Down). This is my output and it won’t be touched anymore.
3. Play with all the other columns of the 9 available dataframes (S&P 500 included) as explained in the following post.

For the sake of completeness I attach the Python code in charge of data gathering and very first preparation:

Lets’ move to the details of feature generation.

by Francesco Pochetti