Monthly Archives: September 2014

Part VI – Trading Algorithm and Portfolio Performance


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

Now that we have a prediction we can also develop a trading strategy and test it against the real markets.

Trading Strategy

The idea is the following. I built a forecasting algorithm and now I know with a certain confidence if the closing price of tomorrow will be higher or lower than the the closing price of today. How can I use this information?

The idea I’m about to go through is explained pretty much in detail on QuantStart, a very nice website with great financial tutorials in Python. Basically I picked their code and adapted it to my needs.

The strategy is very basic and works in this way: if the probability of the day being “up” exceeds 50%, the strategy purchases 500 shares of S&P 500 and sells it at the end of the day. if the probability of a down day exceeds 50%, the strategy sells 500 shares of S&P 500 and then buys back at the close. The idea is that I start with 100k US $ and buy and sell only playing with this amount of money.

It is quite evident that this strategy has only learning purposes. Even though we could be successfull and make at the end of the test period some positive returns, this approach is absolutely not applicable in real life for basically two reasons:

  1. Transaction costs (such as commission fees) have not been added to this backtesting system. Since the strategy carries out a round-trip trade once per day, these fees are likely to significantly curtail the returns.
  2. The strategy assumes that the closing price of today is going to be equal to the opening price of tomorrow which is unlikely to happen.

In any case I stress again that the purpose of this exercise is only a lerning one so it is worth going on and see how to implement this process in Python.

Basically everything is contained in the Python Code section. Instead of being too verbose in the post body I believed that in this context it would have been much better to comment directly inside the code,  so you’ll find all the relevant explanations below.

Portofolio Performance

This is maybe the most important part of all the blog posts I have written so far, as It summarizes in a single plot all the work done.

In the figure below (whose code you can find at the end in the Python Code section) there are two subplots:

  1. S&P 500 Close Price in the period 1 April 2014 – 28 August 2014. This first graph shows the actual trend of the market index in the backtest period. In this particular period the market had a return of almost 6%.
  2. Portofolio Value in the period 1 April 2014 – 28 August 2014. This graph shows the trend of the Porfolio generated on top of our predictions. As you can see the start value is 100k $ which end up at a final value, after 5 months of trading, of about 10%.


The results are quite good, and show the potential of this kind of approach. As I explained in all the recent posts there is much more work to be done and a lot to be improved. In any case i think that the whole process I just described can be the base of a more robust pipeline.

Thanks a lot for reading and see you with the next project!

Python Code

In the previous code snippet there are two call to the following external functions:

  1. getPredictionFromBestModel() : Function
  2. MarketIntradayPortfolio() : Class
  3. backtest_portfolio() : Class Method

Below I provide the code for all of them adding the line at which they were called right before the code itself.

– getPredictionFromBestModel()

– MarketIntradayPortFolio(Portolio)

Portolio interface is provided at the end

– backtest_portofolio()

 – Plotting Portfolio Performance with Matplotlib



by Francesco Pochetti

Part V – Results on Test Set


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

We closed the previous post with the results of Cross Validation.

Eventually we decided that our best combinations is the following:

  • Algorithm: Random Forests (n_estimators = 100)
  • Features: n = 9 / delta  = 9

Random Forests Results

From a strict Machine Learning point of view now that we have picked model and features it should be time of algorithm parameter selection. This goal can be achieved performing an other round of CV on train set looping over a set of parameters and then select the ones maximizing the average accuracy on folds. Obviously the parameters of interest change with the algorithm. In Random Forests Scikit-Learn implementation there are several elements we are allowed to play with:

  1. n_estimators (default = 10): The number of trees in the forest.
  2. max_features (default = sqrt(n_features)): The number of features to consider when looking for the best split
  3. Other minor parameters such as the maximum depth of the tree, the minimum number of samples required to split an internal node and so on so forth.

The capability of these parameters to significantly affect the performances of a model strongly depend on the specific problem we are facing. There is now generic rule of thumb to stick to. It’s quite well known that increasing the number of estimators decreases the train error (and hopefully the test one too), but in any case after a certain limit has been passed no relevant improvement will be recorded. Thus, we will only be raising the computation cost without a real benefit. For this reason I stick to 100 trees and never changed it.

As for the max_features parameter, this can be a tricky one. Random Forests is just an improved evolution on the Bagging or Bootstrap Aggregating Algorithm, which solves the high variance innate problem of a single tree but introduces an unavoidable high correlation among all the bootstrapped trees due to the fact that all the features are taken into account at each iteration. Thus, in presence of  few dominant predictors, Bagging basically splits the trees always in the same way and eventually we end up averaging a bunch of identical trees. To solve this issue Random Forests was implemented. The main improvement consists in the fact that only a subset of the available features are selected each time a tree is built. As a consequence the trees are generally uncorrelated one to another and the final result is much more reliable. Theoretically speaking if I set max_features to be equal to the total amount of features I’d end up performing Bagging without even realizing it. At the end of the day this parameter may be pretty relevant. In any case the default one (square root of the number of predictors) is a very good balance in terms of bias-variance trade-off.

Let’s go back to our firts concern which was to measure the accuracy of the best CV model on test set. To do that I train the model on the whole train set and report the accuracy on the test set. After the model has been trained it is saved to a file (.pickle) in order to be reused in the future and to avoid to generate it from scratch each time we need it. The output of the cose is provided below.

The accuracy of our final model is around 57%. This result is quite disappointing. We have always to keep in mind that we are performing a binary classification, in which case random guessing would have a success probability of 50%. So basically our best model is is only 7% better than tossing a coin. I have to admit that I struggled with this accuracies for quite a bit and after some reasoning and a lot of literature I arrived to the following conclusions.

As stressed at the very beginning of this set of posts the Stock Market Prediction Problem is very though. Lots of research has been produced on this topic and lots will surely be produced in the future and getting relevant results out of the blue is hard. We have always to remind ourselves that we are basically going against the Efficient Market Hypothesis (EMH), asserting that markets are informationally efficient, which means that they immediately auto adjust as soon as an event or a pattern is identified. As a result of this forecasting procedures in this kind of environment are very challenging.

The most common mistake is to believe that the relevant information is inside the market itself. There is for sure some significant information beyond historical exchange data but its extraction is not straightforward and generally can be achieved by technical time series analysis and econometrics. It’s feasible to get something out of raw data with basic financial analysis, as I did, but real information must be scraped much more in depth.

In addition to this we must account also for poor results of common Machine Learning Algorithms. As far as I read from the available literature much better performances can be obtained by implementing custom cost functions taking into account correlations and other more advanced metrics.

Concluding I’m very glad with the process I followed and the pipeline I built up, but the results are evidently not satisfactory.

In any case finally we have a model! Next step is to build a trading algorithm on top of our predictions and see what happens in real life with a backtest example.

by Francesco Pochetti

Part IV – Model/Feature Selection


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

In the last post I introduced the classification algorithms tested for the project’s purposes. The function in charge of data preparation and splitting has also been presented. Basically we are now ready for some real ML reasoning.

Cross Validation

We have introduced the algorithms and the features but now the real though question is the following:

  • What is the best algorithm and what are the best features?
  • But most importantly what does “best” mean?

In order to answer this questions we have to introduce Cross Validation which I actually already treated quite extensively in this post about Pythonic Cross Validation on Time Series.

Just to recap a little bit for the ones who don’t want to go through the linked post.

First of all we have to decide a metrics to evaluate the results of our prediction. Generally a well accepted measurement is the Area Under the Curve (AUC), which consists in the percentage of misclassified events counted at several probability thresholds. The output of a prediction algorithm can always be interpreted as a probability for a certain case to belong to a specific class or not. For a binary classification the default behavior of the algorithm is to classify as “0” a case whose probability of belonging to “1” is less than 50%, and viceversa. This threshold can be varied as needed, depending on the field of investigation. There may be situations in which this kind of tricks is absolutely fundamental. This is the case for example in which the relative proportion of the two classes is extremely skewed towards one of them. In this case the 50% threshold does not really make sense. If I’m building a model to detect whether a person is affected by a particular disease or not and the disease rate in the whole population is let’s say 3% then I do want to be very careful, as this 3% is very likely to fall in my misclassification error.

The AUC takes care of this kind of issues measuring the robustness of a classifier at several probability thresholds. In my case, being stock markets notoriously randomic I decided to stick to the more classic accuracy of a classifier fixing my threshold at 50%.

Having said that let’s come back to Cross Validation. In order to show the technique I’ll walk through a practical example.

We want to assess the performance of a Random Forest Classifier in the following conditions:

  1. 100 trees (n_estimators in Scikit-Learn)
  2. n = 2 / delta = 2 . Thus we are lagging the returns at maximun for 2 days and we are computing at maximum 2-day-returns and 2-day-moving average returns.

What we do next is what follows:

  1. Split data in train and test set given a Date (in my case after 1 April 2014 included).
  2. Split train set (before 1 April 2014 not included) in 10 consecutive time folds.
  3. Then, in order not lo lose the time information, perform the following steps:
  4. Train on fold 1 –>  Test on fold 2
  5. Train on fold 1+2 –>  Test on fold 3
  6. Train on fold 1+2+3 –>  Test on fold 4
  7. Train on fold 1+2+3+4 –>  Test on fold 5
  8. Train on fold 1+2+3+4+5 –>  Test on fold 6
  9. Train on fold 1+2+3+4+5+6 –>  Test on fold 7
  10. Train on fold 1+2+3+4+5+6+7 –>  Test on fold 8
  11. Train on fold 1+2+3+4+5+6+7+8 –>  Test on fold 9
  12. Train on fold 1+2+3+4+5+6+7+8+9 –>  Test on fold 10
  13. Compute the average of the accuracies of the 9 test folds (number of folds  – 1)

Repeat steps 1-12 in the following conditions:

  • n = 2 / delta = 3
  • n = 2 / delta = 4
  • n = 2 / delta = 5
  • n = 3 / delta = 2
  • n = 3 / delta = 3
  • n = 3 / delta = 4
  • n = 3 / delta = 5
  • n = 4 / delta = 2
  • n = 4 / delta = 3
  • n = 4 / delta = 4
  • n = 4 / delta = 5
  • n = 5 / delta = 2
  • n = 5 / delta = 3
  • n = 5 / delta = 4
  • n = 5 / delta = 5

and get average of the accuracies of the 9 test folds  in each one of the previous cases. Obviously there is an infinite number of possibilities to generate and assess. What I did is to stop at a maximum of 10 days. Thus I basically performed a double for loop up to n = 10 / delta = 10.

Each time the script gets into one iteration of the for loop it generates a brand new dataframe with a different set of features. Then, on top of the newborn dataframe, 10-fold Cross Validation is performed in order to assess the performance of the selected algorithm  with that particular set of predictors.

I repeated this set of operations for all the algorithms introducued in the previous post (Random Forest, KNN, SVM Adaptive Boosting, Gradient Tree Boosting, QDA) and after all the computations the best result is the following:

  • Random Forest | n = 9 / delta = 9 | Average Accuracy = 0.530748

The output for this specific conditions is provided below together with the python function in charge of looping over n and delta and performing CV on each iteration.

– function to perform Feature and Model Selection

It is very important to notice that nothing has been done at an algorithmic parameter level. What I mean is that with the previous approach we have been able to achieve two very important goals:

  1. Assess the best classification algorithm, comparing all of them on the same set of features.
  2. Assess for each algorithm the best set of features.
  3. Pick the couple (Model/Features) maximizing the CV Accuracy

Nice! So let’s move to the test set I would suggest!

by Francesco Pochetti

Part III – Scikit Classification Algorithms


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

So finally, as a result of last post, we have a dataframe to play with.

Before diving into model and feature selection I would like to make a little overview of the Classification Algorithms I tested. In the next post I’ll show the techniques and the decision process followed to select the best model and the best features. Anyway to keep it simple I’ll present actual results only for the best model, so it is worth to introduce the other ones I implemented.

Classification ML

The following is a pretty awesome algorithm cheat-sheet provided as part of the Scikit-Learn Documentation. I’ll cover the Classification branch of the tree, going through the code needed to have the selected algorithms running.


First of all we need to prepare our data for the proper Machine Learning stuff. So let’s take care of that with prepareDataForClassification(). This function takes care of

  1. turning S&P 500 daily returns from a float to a string column (Up, Down)
  2. encoding it as a binary integer 1,0 (the only form accepted by Scikit-Learn Classifiers).
  3. selecting the features used for prediction (basically all the columns except for the first one – float returns – and the last one – string returns).
  4. splitting the whole dataframe into train and test set (based on a date passed as argument) and returns for each of them predictors and actual prediction.

performClassification() is in charge of calling the selected algorithm and returning the classification result

Following are the code snippets implementing the algorithms

Random Forest

K- Nearest Neighbors

Support Vector Machines

Adaptive Boosting

Gradient Tree Boosting

Quadratic Discriminant Analysis

Well, time to turn to some Model and Feature Selection!

by Francesco Pochetti

Part II – Feature Generation


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance


In the last post I went through the project’s introduction and the data collection, together with a little bit of feature analysis. In this article I’ll deal with additional feature generation and lay the foundations of feature selection.

Basically I want to answer the following questions: after having collected financial historical data

  • how do I get relevant information out of it?
  • how do I add flexibility to my system plugging in artificially generated features?

Old and New Features

Together with the algorithm this is the most important question to be answered. Turns out that it is very hard to define a priori a good set of features. I’m not talking about feature selection; this topic will be faced in the future using Cross Validation. I’m talking about the basic set of features to start playing with. Excluding the date, I could just have taken all 6 columns (Open, High, Low, Close, Volume, Adj Close) for the 8 selected indices (NASDAQ, Dow Jones, Frankfurt, London, Paris, Tokyo, Hong Kong, Australia) plus the output (S&P 500), merged them into a unique dataframe of 54 columns and run an algorithm on top of that.

The latter approach is quite naive as it doesn’t really take into account any financial dynamics, plugging into the model absolute values of prices and not their fluctuations.

It would be much more informative to turn all the predictors into returns as well, to account for the variation of the predictors more than sticking to their static values. To achieve that I focused on the Adjusted Price of each predictor. So out of the 56 possible columns I selected only 9 of them.

  1. AdjClose_sp
  2. AdjClose_nasdaq
  3. AdjClose_djia
  4. AdjClose_frankfurt
  5. AdjClose_london
  6. AdjClose_paris
  7. AdjClose_nikkei
  8. AdjClose_hkong
  9. AdjClose_australia

Then in order to account for time variations I decided to play with 4 basic financial metrics:

  1. Days Returns: percentage difference of Adjusted Close Price of i-th day  compared to (i-1)-th day.  $$ Return_{i} = \frac{AdjClose_{i} – AdjClose_{i-1} }{AdjClose_{i-1}} $$
  2. Multiple Day Returns: percentage difference of Adjusted Close Price of i-th day  compared to (i-delta)-th day. Example: 3-days Return is the percentage difference of Adjusted Close Price of today compared to the one of 3 days ago. $$ Return_{\delta} = \frac{AdjClose_{i} – AdjClose_{i-\delta} }{AdjClose_{i-\delta}} $$
  3. Returns Moving Average: average returns on last delta days. Example: 3-days Return is the percentage difference of Adjusted Close Price of today compared to the one of 3 days ago. $$ MovingAverage_{\delta} = \frac{Return_{i} + Return_{i-2} + Return_{i-2} +\dots Return_{i-\delta}}{\delta} $$
  4. Time Lagged Returns: shift the daily returns n days backwards. Example: if n =  1 todays’ Return becomes yesterdays’ Return.

Thus to recap this is the process I follow to build my dataset:

  1. I start with 8 basic predictors (the Adjusted Close Price of the 8 world major stock indices) + 1 output/predictor (Adjusted Close Price of S&P 500). Take in mind that despite S&P daily returns being my predicted values I still want to keep inside my model some information regarding Standard & Poors itself. I hope this is not confusing.
  2. I compute the daily returns of each of them.
  3. I add features to the DataFrame using the 4 financial metrics described above. Thus basically playing around with n and delta I can generate as many features as I want adding more and more flexibility to my dataframe.
  4. I get rid of the Adjusted Close Price I had at the beginning, ending up with a perfectly scaled dataset. No need for normalization as all the data I generated are in the same range (as you probably noticed we basically performed always the same kind of returns computations).
  5. Notice that a bunch of missing values are automatically produced. To make this point clear let’s walk through a practical example: what happens when I compute the 3-day-Moving Average on my Daily Returns column? Pandas is going to replace today’s return with the average of the returns of the last 3 days. Now let’s suppose that the first entry in our dataframe corresponds to 20 April 2014. There is going to be no 3-day-Moving Average for 20 April 2014 as there are no 3 previous days. The same for 21 and 22 April 2014. Actually the first non missing day is going to be 23 April 2014 as it would be possible to compute the average of daily returns on the 3 previous days. Notice that the same issue (with slightly different results) rises with the other financial metrics are taken into account.

As follows I provide the code which takes care of all the previously discussed feature generation, plus solving a couple of intermediate issues.

– given the AdjClose and Returns, addFeatures() returns delta-Multiple Day Returns and delta-Returns Moving Average. This function is called for several deltas inside applyRollMeanDelayedReturns()

– given the list of datasets and the range of deltas to explore applyRollMeanDelayedReturns() adds features to each dataset and returns the augmented list

mergeDataframes() is fundamental as it takes the list of augmented datasets produced by applyRollMeanDelayedReturns() and merges all of them in the finance dataframe applying a time cut at the beginning of the series (all data after 1993 – I decided to implement this time cut as Australia ASX-200 data is not available before that time) and selecting only the relevant columns out of each dataframe. This is the step in which we get rid of all the Open, High, Low, Close, Volume, AdjClose columns in each dataset and keep only the Daily Returns, delta-Multiple Day Returns and delta-Returns Moving Average previously created.

I want to stress the specific merging Pandas command. This step is quite tricky as I’m concatenating dataframes by date index. The issue arising is that markets all over the world have not the same trading days due basically to different national holidays. So there is going to be an unavoidable annoying mismatch between dates. I faced the issue in the following way:

  1. Perform an OUTER JOIN of all the predictors. This command generates a UNION of all the columns, thus creating a bunch of NAs. I afterwards imputed them by linear interpolation. Let’s call the result of this operation PREDICTORS.
  2. Perform a S&P LEFT JOIN PREDICTORS. This line shrinks PREDICTORS to match S&P date index. Thus no additional NA will be created in the output dataframe.

applyTimeLag() achieves two goals:

  1. First of all takes the finance dataframe as input, focuses only on Daily Returns columns and generates Time Lagged Returns accordingly.
  2. Secondly, and maybe even more importantly, in this line of code

    the function shifts a day backwards S&P 500 daily returns. This step is crucial as, in order for my prediction to work correctly, I need to align tomorrow’s returns to today’s data. If I didn’t that I would forecast today’s exchange results!

Well, time to perform some Machine Learning on top of our brand new finance DataFrame.

by Francesco Pochetti

Part I – Stock Market Prediction in Python Intro

This is the first of a series of posts summarizing the work I’ve done on Stock Market Prediction as part of my portfolio project at Data Science Retreat.

The scope of this post is to get an overview of the whole work, specifically walking through the foundations and core ideas.

First of all I provide the list of modules needed to have the Python code running correctly in all the following posts. I import them only once at the beginning and that’s it.

Nice! Now we can start!


The idea at the base of this project is to build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s  exchange closing price is going to be lower or higher with respect to today. Next step will be to develop a trading strategy on top of that, based on our predictions, and backtest it against a benchmark.

Specifically, I’ll go through the pipeline, decision process and results I obtained trying to model S&P 500 daily returns.

My whole work will be structured as follows:

  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

Problem Definition

The aim of the project is to predict whether future daily returns of a S&P 500 are going to be positive or negative.

Thus the problem I’m facing is a binary classification.

The metric I deal with is daily return which is computed as follows:

$$ Return_{i} = \frac{AdjClose_{i} – AdjClose_{i-1} }{AdjClose_{i-1}} $$

The Return on the i-th day is equal to the Adjusted Stock Close Price on the i-th day minus the Adjusted Stock Close Price on the (i-1)-th day divided by the Adjusted Stock Close Price on the (i-1)-th day. Adjusted Close Price of a stock is its close price modified by taking into account dividends. It is common practice to use this metrics in Returns computations.

Since the beginnning I decided to focus only on S&P 500, a stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE (New York Stock Exchange) or NASDAQ. Being such a diversified portfolio, the S&P 500 index is typically used as a market benchmark, for example to compute betas of companies listed on the exchange.

Feature Analysis

The main idea is to use world major stock indices as input features for the machine learning based predictor. The intuition behind this approach is that globalization has deepened the interaction between financial markets around the world. Shock wave of US financial crisis (from Lehman Brothers crack) hit the economy of almost every country and debt crisis originated in Greece brought down all major stock indices. Nowadays, no financial market is isolated. Economic data, political perturbation and any other oversea affairs could cause dramatic fluctuation in domestic markets. A “bad day” on the Australian or Japanese exchange is going to heavily affect Wall Street opening and trend. In the light of the previous considerations the following predictors have been selected:

It is very easy to get historical daily prices of the previous indices. Python provides easy libraries to handle the download. The data can be pulled down from Yahoo Finance or Quandl and cleanly formatted into a dataframe with the following columns:

  • Date : in days
  • Open : price of the stock at the opening of the trading (in US dollars)
  • High : highest price of the stock during the trading day (in US dollars)
  • Low : lowest price of the stock during the trading day (in US dollars)
  • Close : price of the stock at the closing of the trading (in US dollars)
  • Volume : amount of stocks traded (in US dollars)
  • Adj Close : price of the stock at the closing of the trading adjusted with dividends (in US dollars)

The following is a screenshot of Yahoo Finance website showing a subset of NASDAQ Composite historical prices. This is exactly how a Pandas DataFrame looks like after having downloaded the data.


Output of Prediction

How do I plug the desired output of my prediction inside my dataframe? The answer is pretty straightforward and basically consists in repeating the exact same steps followed for predictors. Thus eventually,  together with the 8 selected major stock indices, we’ll end up downloading a 9th dataset for S&P 500. Notice that the output of our prediction is a binary classification; we want to be able to answer the following question: is tomorrow going to be an Up or Down day? In order to do that the S&P data must undergo  a simple manipulation of two steps:

  1. Compute S&P 500 daily returns (we’ll do this for predictors as well, as discussed in the next post).
  2. Generate an additional column in the DataFrame with ‘Up’ whenever the return on that specific day was positive and ‘Down’ whenever it was negative.

This passage of the pipeline is actually very important and it must be absolutely clear. I’ll spend a couple of words in addition to what I’ve already written. As I stressed, the output of my prediction is whether S&P 500 daily returns are positive or not. To carry out this kind of prediction I use the following indices: NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia and S&P 500 itself.  Obviously I won’t use S&P 500 daily returns to forecast  S&P 500 daily returns! This would not make sense. What I mean by S&P 500 itself is that I’ll play with S&P 500 historical close prices lagging them in time accordingly. The intuition is that I do not want to lose any potential information contained in the output data.

So to recap the logic is the following:

  1. Download 9 dataframes (NASDAQ, Dow Jones, Frankfurt, London , Paris, Tokyo, Hong Kong, Australia, S&P 500).
  2. Compute S&P 500 daily returns and turn them into a binary variable (Up, Down). This is my output and it won’t be touched anymore.
  3. Play with all the other columns of the 9 available dataframes (S&P 500 included) as explained in the following post.

For the sake of completeness I attach the Python code in charge of data gathering and very first preparation:

Lets’ move to the details of feature generation.

by Francesco Pochetti

Pythonic Cross Validation on Time Series

Working with time series has always represented a serious issue. The fact that the data is naturally ordered denies the possibility to apply the common Machine Learning Methods which by default tend to shuffle the entries losing the time information.

Dealing with Stocks Market Prediction I had to face this kind of challenge which, despite its being pretty common, is not well treated at a documentation level.

One of the first problem I encountered when applying the first ML was how to deal with Cross Validation. This technique is widely used to perform feature and model selection once the data collection and cleaning phases have been carried out. The idea behind CV is that in order to select the best predictors and algorithm it is mandatory to measure the  accuracy of our process on a set of data which is different from the one used to train the model.

Just to be sure we are on the same page let’s walk through the following basic example; imagine you are a professor teaching a student a new topic. To train the student you provide him some questions with solutions (supervised learning), just to give him the possibility to check whether his reasoning was right or wrong. But if you had to test his knowledge you wouldn’t ask him the same questions of the previous day, right? That would be unfair; well the student would be happy of course but the result of the test would not be reliable. To really check whether he has digested the new concept or not you have to provide him brand new exercises. Something he has not seen before.

This his the concept at the base of Cross Validation. The most accepted technique in the ML world consists in randomly picking samples out of the available data and split it in train and test set. Well to be completely precise the steps are generally the following:

  1. Split randomly data in train and test set.
  2. Focus on train set and split it again randomly in chunks (called folds).
  3. Let’s say you got 10 folds; train on 9 of them and test on the 10th.
  4. Repeat step three 10 times to get 10 accuracy measures on 10 different and separate folds.
  5. Compute the average of the 10 accuracies which is the final reliable number telling us how the model is performing.

The issue with Time Series is that the previous approach (implemented by the most common built-in Scikit functions) cannot be applied. And the reason is that in the Time Series case data cannot be shuffled randomly, cause we’ll lose its natural order, which in most cases matters.

Thus on possible solution is to the following one:

  1. Split data in train and test set given a Date (i.e. test set is what happens after 2 April 2014 included).
  2. Split train set (i.e. what happens before 2 April 2014 not included) in for example 10 consecutive time folds.
  3. Then, in order not lo lose the time information, perform the following steps:
  4. Train on fold 1 –>  Test on fold 2
  5. Train on fold 1+2 –>  Test on fold 3
  6. Train on fold 1+2+3 –>  Test on fold 4
  7. Train on fold 1+2+3+4 –>  Test on fold 5
  8. Train on fold 1+2+3+4+5 –>  Test on fold 6
  9. Train on fold 1+2+3+4+5+6 –>  Test on fold 7
  10. Train on fold 1+2+3+4+5+6+7 –>  Test on fold 8
  11. Train on fold 1+2+3+4+5+6+7+8 –>  Test on fold 9
  12. Train on fold 1+2+3+4+5+6+7+8+9 –>  Test on fold 10
  13. Compute the average of the accuracies of the 9 test folds (number of folds  – 1)

The result of the previous approach is encapsulated in the following function which I built myself to be exactly sure of what was doing. The code is fully commented and provided below.

To show the result of the performTimeSeriesCV() a sample real output is provided.

Specifically the following is the output of next command whose intention is to perform a Cross Validation for a binary classification on a stock price (Positive or Negative Return) using Quadratic Discriminant Analysis.


by Francesco Pochetti

Financial Sentiment Analysis Part II – Sentiment Extraction

As promised I’ll devote this second post to walk trough the remaining part of the Financial Sentiment Anaysis pipeline.

Just to recap, the steps we wanted to clarify are the following:

  1.  Scrape the historical archives of a web financial blog in order to get for each post the following information: date, keywords, text.
  2. Save all this information in a JSON file.
  3.  Read the JSON file with Pandas and preprocess the text with NLTK (Natural Language ToolKit) and BeautifulSoup.
  4. Perform Sentiment Analysis on the clean text data in order to get sentiment scores for each day.
  5. Generate a final Pandas DataFrame and correlate it with stocks prices to test our hypothesis.
  6. Repeat points 1-5 for as many blogs as possible. For the sake of simplicity I report only the pipeline for a single blog, Bloomberg Business Week. The code can be easily readapted to other websites; in any case all my experiments are available on Github.

In the previous post we went trough steps 1-2. The goal of this second article is to cover the remaining steps (3-5), from data cleaning to sentiment extraction.

Step 3 – Data Cleaning

This step is crucial as it consists in preprocessing all the information stored in the JSON file generated after the web crawling. The main tasks to be carried out are:

  1. read JSON in Pandas DataFrame
  2. unlist all the entries (XPath queries return Python lists)
  3. convert dates to datetime
  4. join keywords and body columns in one text column and drop the first 2
  5. The first result must be a Pandas DataFrame with 2 columns: date of the post and text of the post.
  6. turn all the text to lowercase
  7. get rid of all the HTML tags from text using BeautifulSoup
  8. tokenize the text and get rid of stop words using NLTK
  9. return clean text in form of a list of words

Two functions take care of this work: readJson (step 1-5) and cleanText (step 6-9). Both reported below.

Step 4 – Computing Sentiment Score

This is the core of the whole pipeline. The problem of computing the sentiment of a piece of text is extremely complex. There is a whole branch of Machine Learning devoted to developing algorithms for this kind of issues (Natural Language Processing) and there are at present several possible approaches to be considered.

  1. One solution consists in getting chunks of pre-prepared texts labeled as Positive or Negative and then perform a supervised binary classification over our posts dividing them in the two available categories.
  2. An other approach consists in getting a dictionary of Positive and Negative words and then count how many of each occur respectively in each post. Then get a Sentiment Measure out of that.

I decided to follow the second path, which first of all required me to find dictionaries available for this specific purpose. After an intense googling I stepped into the web page of Prof Bill McDonald, professor of Finance at the University of Notre Dame. His research group recently conducted a very interesting work over Sentiment Analysis of Financial Texts, showing that in order to get a decent accuracy in this kind of computations it is absolutely mandatory to use a dictionary of words developed for the specific financial purpose. In fact, it is not uncommon that words happening to have a negative meaning in a normal context may turn positive in a financial one. After the paper had been published these guys had put their own dictionaries online, thus I downloaded them and used for my analysis.

In order to get to the final result (getting the sentiment out of a text) I wrote a bunch of functions which I go through as follows:

    • loadPositive(): loads the dictionary of positive words into a list.

    • loadNegative(): loads the dictionary of negative words into a list.

    • countPos(cleantext, positive): counts the number of words contained both in the post and in the positive dictionary.

    • countNeg(cleantext, negative): counts the number of words contained both in the post and in the negative dictionary.

    • getSentiment(cleantext, negative, positive): returns the difference between positive and negative words in a post.

    • upDateSentimentDataFrame(dataframe): performs all the computations described above on the whole Pandas dataframe of posts. The returned dataframe is then saved to csv file.

  • prepareToConcat(filename): reads the csv sentiment dataframe and groups by day performing the average of the sentiment per day.

Step 5 – Final Steps

It seems we arrived almost to the end. Now we have a Pandas DataFrame consisting of one single ‘score’ column (for each day)  and indexed with dates. What we have to do now is to plug this sentiment feature in our model.

To do so we save the output of the prepareToConcat(filename) function to a sentiment.csv file and then we apply the mergeSentimenToStocks(stocks), which takes the previously built financial stocks prices dataframe as argument and left joins it with our new born sentiment dataframe.

There you go! Now the score of the financial blog is a real new feature and we can run a model on top of that. Task accomplished!

Cool, isn’t it?


by Francesco Pochetti