Monthly Archives: October 2014

Stock Trading Algorithm on top of Market Event Study

This post is the result of the first six weeks of class from the Computational Investing course I’m currently following on Coursera. The course is an introduction to Portfolio Management and Optimization in Python and lays the foundations for the second part of the track which wil deal with Machine Learning for Trading. Let’s move quickly to the core of the business.

The question I want to answer with is the following:

  • Is it possible to exploit event studies in a trading strategy?

First of all we should clarify what an event study is. As Wikipedia states, an event study is a statistical method to assess the impact of an event on the value of a firm. This definition is very broad and can easily incorporate facts directly concerning the company (i.e. private life of the CEO, merging with other firms, confidential news from insiders) or anomalous fluctuactions in the price of the stock. I naively (and maybe incorrectly) categorized events regarding a company into these two types, news related and market related, but there should be no difference as they are generally tigthly correlated. In any case, as it is not easy to have access and parse in real time news feeds we will focus on market related events, meaning that in the rest of the post an event must be intended as an anomalous behavior in the price of the stock whose consequences we could exploit to trade in a more efficient way.

Now that we have properly defined an event we can go back to the beginning and think a little bit more about what study an event really means. To understand it let’s walk through a complete example and suppose that we have an event whenever the closing price of a stock at the end of day i  is less than 10$ whilst  at the end of day i-1 was more than 10$. Thus we are examining a significant drop in the price of the stock. Given this definition the answer is: what does it statistically happen to prices of stocks experiencing those kind of fluctuations? Is there a trend that could be somehow exploited? The reason at the base of these questions is that if we knew in advance that a stock followed a specific pattern as a consequence of some event we could could adjust our trading strategy accordingly. If statistics suggests that the price is bound to increase maybe it is a good idea to long the shares whether in the opposite case the best decision is to short.

In order to run an even study we take advantage of the EventProfiler class inside the QSTK library. This class allows us to define an event and then, given a time inerval and a list of stocks, it works in the following way: it scrolls firm after firm and whenever it finds an event sets that day day as day 0. Then it goes 20 days ahead and 20 days before the event and saves the timeframe. After having analyzed all the stocks it aligns the events on the day 0, averages all the prices before and after and scales the result by the market (SPY). The output is a chart which basically answers this question: what happens on average when the closing price of a stock at the end of day i  is less than 10$ whilst  at the end of day i-1 was more than 10$? The test period was the one between 1 January 2008 and 31 December 2009 (in the middle of the financial crisis), while the stocks chosen were the 500 contained in the S&P index in 2012. The graph is shown below and the following information can be extracted: first, 461 such events were registered during the investigated time frame. Second, on the day of the event there is a drop of about 10% in the stock price w.r.t the day before. Third, the price seems to recover after day zero, even though the confidence intervals of the daily increase are huge. SPY2012_10$

 Now the idea is the following. If the observed behavior is respected what we can do is build a trading strategy consisting in buying on the day of the event and selling let’s say after 5 days (we don’t want to hold too long despite the price increasing almost monotonically). Just to recap here you find the whole pipeline from event definition to portofolio assessment.


Now that we have a plan let’s dive into the code (you can find all the code on Github).

First of all I’ll introduce one after the other the two main functions.

find_events(ls_symbols,  d_data,  shares=100):  given the list of the stocks in the portfolio, their historical prices and the number of shares to be traded identifies events and issues a Buy Order on the day of the event and a Sell Order after 5 trading days. Eventually it returns a csv file to be passed to the market simulator. The first lines of the csv file are previed below (year, month, day, stock, order, shares).









marketsim(investment, orders_file, out_file):  given the initial investment in dollars (50000 $ in our case), the csv files containing all the orders (the output of find_events()) and the file to save to the results of the simulation, this function places the order in chronologic order and updates automatically the value of the portfolio. It returns a csv file with the portfolio value in time, a plot comparing the portfolio performance against the market benchmark and print to screen a summary of the main financial metrics used to evaluate the portfolio.

main(): this function calls the previous two after getting and cleaning all the relevant data.

This is the output, as promised:


Well, despite the huge crisis (-19% market return) our trading strategy brought us to gain a remarkable +19%! This was just an example but in any case very powerful to show the possibilities of event studies in finance.




by Francesco Pochetti

Community Detection in Social Networks

In this post I would like to share a very basic approach to Community Detection in Social Networks.


I came across this fascinating topic following the superb course on Mining Massive Datasets provided on Coursera by Stanford University. The specific field of finding overlapping clusters in graphs is introduced and deeply treated during the third week of classes (links to the PDF slides available for Part 1 and Part 2). I immediately found it extremely interesting and decided to play around by myself. There are at least two very strong reasons to directly check the potentials of these group of algorithms: first of all my complete lack of knowledge in the field and secondly the data I found a couple of weeks ago on a the Kaggle competition “Learning Social Circles in Networks“. The contest challenges participants to correctly infer Facebook users’ social communities. Such circles may be disjoint, overlap, or be hierarchically nested.  To do this, machine learners have access to:

  1. A list of the user’s friends
  2. Anonymized Facebook profiles of each of those friends
  3. A network of connections between those friends (their “ego network”)

This is exactly what I needed for my learning purposes!


The approach I propose below is structured in two main parts:

  1. Build the Graph of the ego-networks extracting nodes and edges from Kaggle data. I implemented this step in Python, generating the graphs with Networkx and saving the Adjiacency matrix of each of them to a separate file.
  2. Community Detection on top of the undirected graph. I performed this step in R, loading the graphs as Adjiacency matrices and then run a bunch of Clustering Algorithms available in R-igraph.

The use of both Python and R was not planned in the first place. I directly dived into the first of them supported by Neyworkx, but as soon as I started deepening the community detection algorithms I realized that R-igraph had a woderful ensemble of methods directly available. Note that igraph supports Python as well but apparently there are not the same features between the two libraries and the R one seems to be much fancier. I was a bit disappointed at the very beginning but in the end I grabbed the opportunity of learning a new package.

Enough words, I’d say. Let’s go for some code.

Building Ego-Networks

The Kaggle data (available here) is organized in 110 .egonet files (corresponding to 110 anonymized facebook users), each containing the network of his friends. A practicle example may help to clarify the data structure.

Let’s focus on the file 0.egonet, which contains all the information on user 0‘s network. Each row of the file is the list of the friends of the first user in the line who is directly part of the ego’s network. Below the first 4 lines are shown (for the purpose of clearness only the first 5 five connections in the line are reported).

0 has 1 as friend who has 146-189-229… as friends as well.

0 has 2 as friend who has 146-191-229… as friends as well.

0 has 3 as friend who has 185-80-61… as friends as well.

0 has 4 as friend who has 72-61-187… as friends as well.

Well I guess you get the point…

Below  I attach the Python code which access every egonet file and builds a list of nodes and edges to be fed to the Networkx constructor.  Just to be clear [0, 1, 2, 3, 4 …] are vertices of the graph while [(0-1), (1-146), (1-189), (1-229) …] are edges or connections. After a graph has been constructed its adjiacency matrix is computed and saved in a csv file.

The result of the provided code are 110 CSV files containing the adjiacency matrices of  each ego network graph. Let’s move to the real Clustering part.

Detecting Communities

First of all let’s plot a graph and see how it looks like before clustering detection. Below the R code to load the data from CSV file, build the network (we stick to the 0.egonet) and draw it.


 Time for some clustering.

R-igraph provides several powerful community detection algorithms. Each of them works in a different way and I highly encourage you to have a look at this very informative post on Stack Overflow describing all of them in detail. I decided to go for the whole bunch of algorithms as I wanted to somewhat compare their performances, which I did with the help of Modularity. This metrics measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.

Modularity is basically the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. So the higher the better.

Here you find the results on the user-0-network.

 The algorithm (based on a statistical physics approach) is the best one, with a modularity of 0.4649. Turns out that for this particular problem of community detection in small ego-social-networks the spinglass method beats the others in all the 110 egonet graphs.

Below you can find a nice visualization of the detected clusters, in R as well. By the way the plot at the top of the post is exactly the same as the following one visualized in a fancier way.


by Francesco Pochetti

Image Text Recognition in Python

In this post I’m going to summarize the work I’ve done on Text Recognition in Natural Scenes as part of my second portfolio project at Data Science Retreat.

The importance of image processing has increased a lot during the last years. Especially with the growing market of smart phones people has started producing a huge amount of photos and videos which are continuously streamed on social platforms. Thus the increased interest of the industry towards this kind of problems is completely justified.

Machine learning obviously plays a very significant role in this field. Automatic text detection and character recognition is just an example. One can cite other sophisticated applications such as animal species or plants identification, human beings detection or, more in general, extraction of any kind of information of commercial use.

The topic I was interested to dive into is OCR which stands for Optical Character Recognition. This field has been object of very intensive study in the past decades. Actually, at present, the problem of character recognition from black and white documents is considered solved. It is pretty common practice to scan a sheet of paper and use some standard software to convert it to a text file. There are also very good open source tools out there, such as Tesseract-OCR, which can read and detect up to 60  languages. In any case those are easy cases. The image are gray scale, very good contrast, no specific issue in single character contour detection and little problems due to lighting or shadows.

A completely different scenario starts being depicted when we deal with natural scenes. For example a photo taken by a twitter user and then posted on the social platform. In this case the problem has not been solved at all. Actually there are still quite big issues in processing this kind of images. Very big improvements have been made by some Google services such as Translator which recently added a new feature capable of detecting and translating text from images, but anyway the results are not completely satisfactory and they highly depend on the quality of the picture and on the environmental conditions (night/day, light/shadow) in which it was taken.

Of course the target of my project is not to find a final solution to this kind of open problem but in any case it is still worth trying and practice with such a fascinating topic.

In order to show  the results of my work I’ll walk through a complete example, starting from the raw image (which could be ideally a picture from a user) and ending with detected text. The results are not satisfactory yet but in any case the pipeline (image processing + machine learning) is properly working, which opens the way to huge improvements.  I implemented the whole project in Python (Pandas/Scikit-Learn/Numpy/Skimage) but for the sake of simplicity and shortness I won’t walk through the code which is available on Github. The post is organized as follows:

  1. Image Preprocessing and Object Detection
  2. Text Detection
  3. Text Classification
  4. Text Reconstruction
  5. Conclusions

Image Preprocessing and Object Detection

The image I picked to test my code is the following one:

laoAs you can see together with text at the bottom the background image is quite complex and overwhelming. The quote and the name of the author are also printed in two different font size which adds some sort of additional challenge to the task.

After having loaded the image, it needs to be preprocessed. Specifically it goes through the next two steps:

  • Denoising: this is done applying a total variation approach which consists in reducing as much as possible the integral of the absolute gradient of the image, where the gradient of an image can simply be interpreted as a directional change in the intensity or color in the image itself.
  • Increasing Contrast: this is done applying Otsu‘s method which calculates an “optimal” threshold by maximizing the variance between two classes of pixels,  separated by the threshold. Equivalently, this threshold minimizes the intra-class variance.

After image cleaning, object detection is performed. Contours are identified and a rectangle is drawn around objects candidates. The result of this process is the following figure.

preprocessed1As you can see a lot of rectangles have been identified. Not all of them contain text but we’ll take care of that in the following section. After this the objects are converted to greyscale, resized to 20 X 20 pixels and then stacked into a 3D numpy array. The coordinates of each rectangle are also saved in order to be able to reconstruct the image afterwards. The result of this operations is showed in the following standard output and generated figure (plotting 100 random images from the text-candidates selected):


Text Detection

Here comes the interesting part. It’s time to dive into some Machine Learning.

The challenge now is to detect which ones of the identified objects contain text in order to be able to classify it. I approached the problem in the following way: basically I have to train a model to make such a decision which means that first of all I need a proper dataset, consisting ideally of half images containing text and half not containing it. To do that I decided two merge two existing data sources:

  1. I took 50k images from the 78903 available in the 74K Chars dataset. This is the half containing text and I labeled each image as a 1.
  2. I took all the 50k images in the CIFAR-10 dataset on Kaggle. This is the half NOT containing text and I labeled each image as a 0.

The complete dataset was then composed of 100k images, properly labeled and randomly shuffled. Then I needed a model to perform the binary classification.

The model I turned to worked in two steps:

  1. Feature Extraction: this step is performed computing the Histogram Of Gradient (HOG) of the image. This technique is based on the fact that local object appearance and shape within an image can be described by the distribution of intensity gradients, where the gradient of an image can simply be interpreted as a directional change in the intensity or color in the image itself. This approach is commonly used for object detection as it is able ti detect in a fairly easy way contours of shapes. The result of this step is generally an ensemble of data points which carry much more information than the beginning. These new features are ready to be passed to the classifier.
  2. Classification: this step is performed using Support Vector Machines with a Linear Kernel. The idea at the base of this choice is that we don’t want to over complicate the situation. On the contrary as we already performed an “enrichment” step such as HOG we want to apply a model which, being powerful, would keep things simple. Linear SVM is worth trying.

I run Grid Search Cross Validation in order to optimize the hyperparameters of the Pipeline (both for HOG and LinearSVC) and the following are my results on both train and test test for the binary classification problem:

 97.28% on previously unseen data is very good. We are now ready to run the model on all the rectangles we had detected in the first place and select only the ones containing real text. The result of this operations is showed in the following standard output and generated figure:

OCT1 You can see that the result is very good despite not being completely satisfactory. There are objects which were classified as being characters while it is not the case at all. This is going to cause us more than one problem in the future steps but anyway let’s go on.

Text Classification

Now that we have final candidates it’s time to classify the single characters. The approach I followed is exactly the same I considered for the Text Detection.

In this case the dataset is composed of the 78903 images available in the 74K Chars dataset. We are not dealing with a binary classification anymore as in this case the number of classes is 36:

  • integers [0-9] : 10 classes
  • lowercase letters of English alphabet [a-z] : 26 classes

I actually decided to reduce the initial number of classes from 62 to 36 as I counted as belonging to the same group uppercase ans lowercase English characters.

As for the Machine Learning part I followed the same exact approach considered in the previous section. The pipeline is composed by a feature extraction step performed by HOG and a classification step carried out by a Linear SVM. After hyperparameter selection by Grid Search CV the following are the results on train and test set:

This is quite promising so let’s run the model on our text candidates and see what happens. The output is shown in the plot below in which each character is reported together with the result of the prediction.

SCR1Nice! So now we have a prediction and it’s time to reconstruct the string.

Text Reconstruction

This is the last part of the work and simply consists in putting together all the pieces of the puzzle we have build so far. Just to recap, we have characters with rectangles coordinates from the original image and predictions. What we can do is simply build an other figure plotting the predicted strings in the right positions. The result of this approach is the following:

final1This chaotic outcome was sort of expected considering the errors accumulated during the several steps but it is very encouraging! I want to emphasize that I manually added the violet rectangles at the end of the procedure to point out the structure of the sentence, so they are not generated automatically by the code.


Concluding I developed a first basic system for automatic text detection and classification in natural scenes (code here on Github). It definitely suffers from several problems but a working pipeline was my first target and it is actually doing its job.

Now lots of possibilities for improvement are available. First of all an accurate analysis of the bottlenecks is necessary in order to define weak points and select the steps needing serious refactoring. As for the algorithmic part it is definitely worth giving a try to Neural Networks and Deep Learning (nntools by Theano could be an idea), both for binary text-no-text classification and for OCR multi-classification. A significant improvement in both steps would result in far less noise in the last part of the program turning into more affordable the text reconstruction phase.

To recap I think the following could be good starting points:

  1. Get rid of nested rectangles in object detection. Solves the problem of detecting a circle (classified as an ‘o’) inside an ‘a’.
  2. Manually labeling objects containing or not containing text. It is possible to add a wait_for_key during the object detection phase and as soon as a rectangle is identified manually specify if it’s text or not. For example a tree may be miscassified as text and then classified as a T. Manual detection is very time consuming and before diving into that it is necessary to analyze the pipeline and be sure that it is worth doing it.
  3. Introduce as a final step a ‘Guess Missing Text Phase’ to correct little mistakes. For example if in the end we should detect the word ‘house’ but we identify ‘hous’, well of course that’s a house!
  4. Implement Neural Network and Deep Learning.

That’s it!

by Francesco Pochetti