Tag Archives: Scikit

Support Vector Machines

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my pythonic playground about Support Vector Machines.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

Pythonic Neural Networks

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my implementation of Neural Networks in numpy.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

Predict physical and chemical properties of soil using spectral measurements

Check out on NBViewer the work I’ve done with Pandas, Scikit-Learn, Matplotlib wrapped up in  IPython about predicting physical and chemical properties of African soil using spectral measurements on Kaggle.

The code and the files are also available on Github.

Here the challenge: “Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping of soil functional properties, especially in data sparse regions such as Africa, is important for planning sustainable agricultural intensification and natural resources management.

Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum (Fig. 1). The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.

Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples selected to span the diversity in soils in a given target geographical area. The calibration models are then used to predict the soil test values for the whole sample set. The predicted soil test values from georeferenced soil samples can in turn be calibrated to remote sensing covariates, which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration model is then used to predict the soil test values for each pixel. The result is a digital map of the soil properties.

This competition asks you to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.”

by Francesco Pochetti

Image Text Recognition in Python

In this post I’m going to summarize the work I’ve done on Text Recognition in Natural Scenes as part of my second portfolio project at Data Science Retreat.

The importance of image processing has increased a lot during the last years. Especially with the growing market of smart phones people has started producing a huge amount of photos and videos which are continuously streamed on social platforms. Thus the increased interest of the industry towards this kind of problems is completely justified.

Machine learning obviously plays a very significant role in this field. Automatic text detection and character recognition is just an example. One can cite other sophisticated applications such as animal species or plants identification, human beings detection or, more in general, extraction of any kind of information of commercial use.

The topic I was interested to dive into is OCR which stands for Optical Character Recognition. This field has been object of very intensive study in the past decades. Actually, at present, the problem of character recognition from black and white documents is considered solved. It is pretty common practice to scan a sheet of paper and use some standard software to convert it to a text file. There are also very good open source tools out there, such as Tesseract-OCR, which can read and detect up to 60  languages. In any case those are easy cases. The image are gray scale, very good contrast, no specific issue in single character contour detection and little problems due to lighting or shadows.

A completely different scenario starts being depicted when we deal with natural scenes. For example a photo taken by a twitter user and then posted on the social platform. In this case the problem has not been solved at all. Actually there are still quite big issues in processing this kind of images. Very big improvements have been made by some Google services such as Translator which recently added a new feature capable of detecting and translating text from images, but anyway the results are not completely satisfactory and they highly depend on the quality of the picture and on the environmental conditions (night/day, light/shadow) in which it was taken.

Of course the target of my project is not to find a final solution to this kind of open problem but in any case it is still worth trying and practice with such a fascinating topic.

In order to show  the results of my work I’ll walk through a complete example, starting from the raw image (which could be ideally a picture from a user) and ending with detected text. The results are not satisfactory yet but in any case the pipeline (image processing + machine learning) is properly working, which opens the way to huge improvements.  I implemented the whole project in Python (Pandas/Scikit-Learn/Numpy/Skimage) but for the sake of simplicity and shortness I won’t walk through the code which is available on Github. The post is organized as follows:

  1. Image Preprocessing and Object Detection
  2. Text Detection
  3. Text Classification
  4. Text Reconstruction
  5. Conclusions

Image Preprocessing and Object Detection

The image I picked to test my code is the following one:

laoAs you can see together with text at the bottom the background image is quite complex and overwhelming. The quote and the name of the author are also printed in two different font size which adds some sort of additional challenge to the task.

After having loaded the image, it needs to be preprocessed. Specifically it goes through the next two steps:

  • Denoising: this is done applying a total variation approach which consists in reducing as much as possible the integral of the absolute gradient of the image, where the gradient of an image can simply be interpreted as a directional change in the intensity or color in the image itself.
  • Increasing Contrast: this is done applying Otsu‘s method which calculates an “optimal” threshold by maximizing the variance between two classes of pixels,  separated by the threshold. Equivalently, this threshold minimizes the intra-class variance.

After image cleaning, object detection is performed. Contours are identified and a rectangle is drawn around objects candidates. The result of this process is the following figure.

preprocessed1As you can see a lot of rectangles have been identified. Not all of them contain text but we’ll take care of that in the following section. After this the objects are converted to greyscale, resized to 20 X 20 pixels and then stacked into a 3D numpy array. The coordinates of each rectangle are also saved in order to be able to reconstruct the image afterwards. The result of this operations is showed in the following standard output and generated figure (plotting 100 random images from the text-candidates selected):


Text Detection

Here comes the interesting part. It’s time to dive into some Machine Learning.

The challenge now is to detect which ones of the identified objects contain text in order to be able to classify it. I approached the problem in the following way: basically I have to train a model to make such a decision which means that first of all I need a proper dataset, consisting ideally of half images containing text and half not containing it. To do that I decided two merge two existing data sources:

  1. I took 50k images from the 78903 available in the 74K Chars dataset. This is the half containing text and I labeled each image as a 1.
  2. I took all the 50k images in the CIFAR-10 dataset on Kaggle. This is the half NOT containing text and I labeled each image as a 0.

The complete dataset was then composed of 100k images, properly labeled and randomly shuffled. Then I needed a model to perform the binary classification.

The model I turned to worked in two steps:

  1. Feature Extraction: this step is performed computing the Histogram Of Gradient (HOG) of the image. This technique is based on the fact that local object appearance and shape within an image can be described by the distribution of intensity gradients, where the gradient of an image can simply be interpreted as a directional change in the intensity or color in the image itself. This approach is commonly used for object detection as it is able ti detect in a fairly easy way contours of shapes. The result of this step is generally an ensemble of data points which carry much more information than the beginning. These new features are ready to be passed to the classifier.
  2. Classification: this step is performed using Support Vector Machines with a Linear Kernel. The idea at the base of this choice is that we don’t want to over complicate the situation. On the contrary as we already performed an “enrichment” step such as HOG we want to apply a model which, being powerful, would keep things simple. Linear SVM is worth trying.

I run Grid Search Cross Validation in order to optimize the hyperparameters of the Pipeline (both for HOG and LinearSVC) and the following are my results on both train and test test for the binary classification problem:

 97.28% on previously unseen data is very good. We are now ready to run the model on all the rectangles we had detected in the first place and select only the ones containing real text. The result of this operations is showed in the following standard output and generated figure:

OCT1 You can see that the result is very good despite not being completely satisfactory. There are objects which were classified as being characters while it is not the case at all. This is going to cause us more than one problem in the future steps but anyway let’s go on.

Text Classification

Now that we have final candidates it’s time to classify the single characters. The approach I followed is exactly the same I considered for the Text Detection.

In this case the dataset is composed of the 78903 images available in the 74K Chars dataset. We are not dealing with a binary classification anymore as in this case the number of classes is 36:

  • integers [0-9] : 10 classes
  • lowercase letters of English alphabet [a-z] : 26 classes

I actually decided to reduce the initial number of classes from 62 to 36 as I counted as belonging to the same group uppercase ans lowercase English characters.

As for the Machine Learning part I followed the same exact approach considered in the previous section. The pipeline is composed by a feature extraction step performed by HOG and a classification step carried out by a Linear SVM. After hyperparameter selection by Grid Search CV the following are the results on train and test set:

This is quite promising so let’s run the model on our text candidates and see what happens. The output is shown in the plot below in which each character is reported together with the result of the prediction.

SCR1Nice! So now we have a prediction and it’s time to reconstruct the string.

Text Reconstruction

This is the last part of the work and simply consists in putting together all the pieces of the puzzle we have build so far. Just to recap, we have characters with rectangles coordinates from the original image and predictions. What we can do is simply build an other figure plotting the predicted strings in the right positions. The result of this approach is the following:

final1This chaotic outcome was sort of expected considering the errors accumulated during the several steps but it is very encouraging! I want to emphasize that I manually added the violet rectangles at the end of the procedure to point out the structure of the sentence, so they are not generated automatically by the code.


Concluding I developed a first basic system for automatic text detection and classification in natural scenes (code here on Github). It definitely suffers from several problems but a working pipeline was my first target and it is actually doing its job.

Now lots of possibilities for improvement are available. First of all an accurate analysis of the bottlenecks is necessary in order to define weak points and select the steps needing serious refactoring. As for the algorithmic part it is definitely worth giving a try to Neural Networks and Deep Learning (nntools by Theano could be an idea), both for binary text-no-text classification and for OCR multi-classification. A significant improvement in both steps would result in far less noise in the last part of the program turning into more affordable the text reconstruction phase.

To recap I think the following could be good starting points:

  1. Get rid of nested rectangles in object detection. Solves the problem of detecting a circle (classified as an ‘o’) inside an ‘a’.
  2. Manually labeling objects containing or not containing text. It is possible to add a wait_for_key during the object detection phase and as soon as a rectangle is identified manually specify if it’s text or not. For example a tree may be miscassified as text and then classified as a T. Manual detection is very time consuming and before diving into that it is necessary to analyze the pipeline and be sure that it is worth doing it.
  3. Introduce as a final step a ‘Guess Missing Text Phase’ to correct little mistakes. For example if in the end we should detect the word ‘house’ but we identify ‘hous’, well of course that’s a house!
  4. Implement Neural Network and Deep Learning.

That’s it!

by Francesco Pochetti

Part VI – Trading Algorithm and Portfolio Performance


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

Now that we have a prediction we can also develop a trading strategy and test it against the real markets.

Trading Strategy

The idea is the following. I built a forecasting algorithm and now I know with a certain confidence if the closing price of tomorrow will be higher or lower than the the closing price of today. How can I use this information?

The idea I’m about to go through is explained pretty much in detail on QuantStart, a very nice website with great financial tutorials in Python. Basically I picked their code and adapted it to my needs.

The strategy is very basic and works in this way: if the probability of the day being “up” exceeds 50%, the strategy purchases 500 shares of S&P 500 and sells it at the end of the day. if the probability of a down day exceeds 50%, the strategy sells 500 shares of S&P 500 and then buys back at the close. The idea is that I start with 100k US $ and buy and sell only playing with this amount of money.

It is quite evident that this strategy has only learning purposes. Even though we could be successfull and make at the end of the test period some positive returns, this approach is absolutely not applicable in real life for basically two reasons:

  1. Transaction costs (such as commission fees) have not been added to this backtesting system. Since the strategy carries out a round-trip trade once per day, these fees are likely to significantly curtail the returns.
  2. The strategy assumes that the closing price of today is going to be equal to the opening price of tomorrow which is unlikely to happen.

In any case I stress again that the purpose of this exercise is only a lerning one so it is worth going on and see how to implement this process in Python.

Basically everything is contained in the Python Code section. Instead of being too verbose in the post body I believed that in this context it would have been much better to comment directly inside the code,  so you’ll find all the relevant explanations below.

Portofolio Performance

This is maybe the most important part of all the blog posts I have written so far, as It summarizes in a single plot all the work done.

In the figure below (whose code you can find at the end in the Python Code section) there are two subplots:

  1. S&P 500 Close Price in the period 1 April 2014 – 28 August 2014. This first graph shows the actual trend of the market index in the backtest period. In this particular period the market had a return of almost 6%.
  2. Portofolio Value in the period 1 April 2014 – 28 August 2014. This graph shows the trend of the Porfolio generated on top of our predictions. As you can see the start value is 100k $ which end up at a final value, after 5 months of trading, of about 10%.


The results are quite good, and show the potential of this kind of approach. As I explained in all the recent posts there is much more work to be done and a lot to be improved. In any case i think that the whole process I just described can be the base of a more robust pipeline.

Thanks a lot for reading and see you with the next project!

Python Code

In the previous code snippet there are two call to the following external functions:

  1. getPredictionFromBestModel() : Function
  2. MarketIntradayPortfolio() : Class
  3. backtest_portfolio() : Class Method

Below I provide the code for all of them adding the line at which they were called right before the code itself.

– getPredictionFromBestModel()

– MarketIntradayPortFolio(Portolio)

Portolio interface is provided at the end

– backtest_portofolio()

 – Plotting Portfolio Performance with Matplotlib



by Francesco Pochetti

Part V – Results on Test Set


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

We closed the previous post with the results of Cross Validation.

Eventually we decided that our best combinations is the following:

  • Algorithm: Random Forests (n_estimators = 100)
  • Features: n = 9 / delta  = 9

Random Forests Results

From a strict Machine Learning point of view now that we have picked model and features it should be time of algorithm parameter selection. This goal can be achieved performing an other round of CV on train set looping over a set of parameters and then select the ones maximizing the average accuracy on folds. Obviously the parameters of interest change with the algorithm. In Random Forests Scikit-Learn implementation there are several elements we are allowed to play with:

  1. n_estimators (default = 10): The number of trees in the forest.
  2. max_features (default = sqrt(n_features)): The number of features to consider when looking for the best split
  3. Other minor parameters such as the maximum depth of the tree, the minimum number of samples required to split an internal node and so on so forth.

The capability of these parameters to significantly affect the performances of a model strongly depend on the specific problem we are facing. There is now generic rule of thumb to stick to. It’s quite well known that increasing the number of estimators decreases the train error (and hopefully the test one too), but in any case after a certain limit has been passed no relevant improvement will be recorded. Thus, we will only be raising the computation cost without a real benefit. For this reason I stick to 100 trees and never changed it.

As for the max_features parameter, this can be a tricky one. Random Forests is just an improved evolution on the Bagging or Bootstrap Aggregating Algorithm, which solves the high variance innate problem of a single tree but introduces an unavoidable high correlation among all the bootstrapped trees due to the fact that all the features are taken into account at each iteration. Thus, in presence of  few dominant predictors, Bagging basically splits the trees always in the same way and eventually we end up averaging a bunch of identical trees. To solve this issue Random Forests was implemented. The main improvement consists in the fact that only a subset of the available features are selected each time a tree is built. As a consequence the trees are generally uncorrelated one to another and the final result is much more reliable. Theoretically speaking if I set max_features to be equal to the total amount of features I’d end up performing Bagging without even realizing it. At the end of the day this parameter may be pretty relevant. In any case the default one (square root of the number of predictors) is a very good balance in terms of bias-variance trade-off.

Let’s go back to our firts concern which was to measure the accuracy of the best CV model on test set. To do that I train the model on the whole train set and report the accuracy on the test set. After the model has been trained it is saved to a file (.pickle) in order to be reused in the future and to avoid to generate it from scratch each time we need it. The output of the cose is provided below.

The accuracy of our final model is around 57%. This result is quite disappointing. We have always to keep in mind that we are performing a binary classification, in which case random guessing would have a success probability of 50%. So basically our best model is is only 7% better than tossing a coin. I have to admit that I struggled with this accuracies for quite a bit and after some reasoning and a lot of literature I arrived to the following conclusions.

As stressed at the very beginning of this set of posts the Stock Market Prediction Problem is very though. Lots of research has been produced on this topic and lots will surely be produced in the future and getting relevant results out of the blue is hard. We have always to remind ourselves that we are basically going against the Efficient Market Hypothesis (EMH), asserting that markets are informationally efficient, which means that they immediately auto adjust as soon as an event or a pattern is identified. As a result of this forecasting procedures in this kind of environment are very challenging.

The most common mistake is to believe that the relevant information is inside the market itself. There is for sure some significant information beyond historical exchange data but its extraction is not straightforward and generally can be achieved by technical time series analysis and econometrics. It’s feasible to get something out of raw data with basic financial analysis, as I did, but real information must be scraped much more in depth.

In addition to this we must account also for poor results of common Machine Learning Algorithms. As far as I read from the available literature much better performances can be obtained by implementing custom cost functions taking into account correlations and other more advanced metrics.

Concluding I’m very glad with the process I followed and the pipeline I built up, but the results are evidently not satisfactory.

In any case finally we have a model! Next step is to build a trading algorithm on top of our predictions and see what happens in real life with a backtest example.

by Francesco Pochetti

Part IV – Model/Feature Selection


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

In the last post I introduced the classification algorithms tested for the project’s purposes. The function in charge of data preparation and splitting has also been presented. Basically we are now ready for some real ML reasoning.

Cross Validation

We have introduced the algorithms and the features but now the real though question is the following:

  • What is the best algorithm and what are the best features?
  • But most importantly what does “best” mean?

In order to answer this questions we have to introduce Cross Validation which I actually already treated quite extensively in this post about Pythonic Cross Validation on Time Series.

Just to recap a little bit for the ones who don’t want to go through the linked post.

First of all we have to decide a metrics to evaluate the results of our prediction. Generally a well accepted measurement is the Area Under the Curve (AUC), which consists in the percentage of misclassified events counted at several probability thresholds. The output of a prediction algorithm can always be interpreted as a probability for a certain case to belong to a specific class or not. For a binary classification the default behavior of the algorithm is to classify as “0” a case whose probability of belonging to “1” is less than 50%, and viceversa. This threshold can be varied as needed, depending on the field of investigation. There may be situations in which this kind of tricks is absolutely fundamental. This is the case for example in which the relative proportion of the two classes is extremely skewed towards one of them. In this case the 50% threshold does not really make sense. If I’m building a model to detect whether a person is affected by a particular disease or not and the disease rate in the whole population is let’s say 3% then I do want to be very careful, as this 3% is very likely to fall in my misclassification error.

The AUC takes care of this kind of issues measuring the robustness of a classifier at several probability thresholds. In my case, being stock markets notoriously randomic I decided to stick to the more classic accuracy of a classifier fixing my threshold at 50%.

Having said that let’s come back to Cross Validation. In order to show the technique I’ll walk through a practical example.

We want to assess the performance of a Random Forest Classifier in the following conditions:

  1. 100 trees (n_estimators in Scikit-Learn)
  2. n = 2 / delta = 2 . Thus we are lagging the returns at maximun for 2 days and we are computing at maximum 2-day-returns and 2-day-moving average returns.

What we do next is what follows:

  1. Split data in train and test set given a Date (in my case after 1 April 2014 included).
  2. Split train set (before 1 April 2014 not included) in 10 consecutive time folds.
  3. Then, in order not lo lose the time information, perform the following steps:
  4. Train on fold 1 –>  Test on fold 2
  5. Train on fold 1+2 –>  Test on fold 3
  6. Train on fold 1+2+3 –>  Test on fold 4
  7. Train on fold 1+2+3+4 –>  Test on fold 5
  8. Train on fold 1+2+3+4+5 –>  Test on fold 6
  9. Train on fold 1+2+3+4+5+6 –>  Test on fold 7
  10. Train on fold 1+2+3+4+5+6+7 –>  Test on fold 8
  11. Train on fold 1+2+3+4+5+6+7+8 –>  Test on fold 9
  12. Train on fold 1+2+3+4+5+6+7+8+9 –>  Test on fold 10
  13. Compute the average of the accuracies of the 9 test folds (number of folds  – 1)

Repeat steps 1-12 in the following conditions:

  • n = 2 / delta = 3
  • n = 2 / delta = 4
  • n = 2 / delta = 5
  • n = 3 / delta = 2
  • n = 3 / delta = 3
  • n = 3 / delta = 4
  • n = 3 / delta = 5
  • n = 4 / delta = 2
  • n = 4 / delta = 3
  • n = 4 / delta = 4
  • n = 4 / delta = 5
  • n = 5 / delta = 2
  • n = 5 / delta = 3
  • n = 5 / delta = 4
  • n = 5 / delta = 5

and get average of the accuracies of the 9 test folds  in each one of the previous cases. Obviously there is an infinite number of possibilities to generate and assess. What I did is to stop at a maximum of 10 days. Thus I basically performed a double for loop up to n = 10 / delta = 10.

Each time the script gets into one iteration of the for loop it generates a brand new dataframe with a different set of features. Then, on top of the newborn dataframe, 10-fold Cross Validation is performed in order to assess the performance of the selected algorithm  with that particular set of predictors.

I repeated this set of operations for all the algorithms introducued in the previous post (Random Forest, KNN, SVM Adaptive Boosting, Gradient Tree Boosting, QDA) and after all the computations the best result is the following:

  • Random Forest | n = 9 / delta = 9 | Average Accuracy = 0.530748

The output for this specific conditions is provided below together with the python function in charge of looping over n and delta and performing CV on each iteration.

– function to perform Feature and Model Selection

It is very important to notice that nothing has been done at an algorithmic parameter level. What I mean is that with the previous approach we have been able to achieve two very important goals:

  1. Assess the best classification algorithm, comparing all of them on the same set of features.
  2. Assess for each algorithm the best set of features.
  3. Pick the couple (Model/Features) maximizing the CV Accuracy

Nice! So let’s move to the test set I would suggest!

by Francesco Pochetti

Part III – Scikit Classification Algorithms


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

So finally, as a result of last post, we have a dataframe to play with.

Before diving into model and feature selection I would like to make a little overview of the Classification Algorithms I tested. In the next post I’ll show the techniques and the decision process followed to select the best model and the best features. Anyway to keep it simple I’ll present actual results only for the best model, so it is worth to introduce the other ones I implemented.

Classification ML

The following is a pretty awesome algorithm cheat-sheet provided as part of the Scikit-Learn Documentation. I’ll cover the Classification branch of the tree, going through the code needed to have the selected algorithms running.


First of all we need to prepare our data for the proper Machine Learning stuff. So let’s take care of that with prepareDataForClassification(). This function takes care of

  1. turning S&P 500 daily returns from a float to a string column (Up, Down)
  2. encoding it as a binary integer 1,0 (the only form accepted by Scikit-Learn Classifiers).
  3. selecting the features used for prediction (basically all the columns except for the first one – float returns – and the last one – string returns).
  4. splitting the whole dataframe into train and test set (based on a date passed as argument) and returns for each of them predictors and actual prediction.

performClassification() is in charge of calling the selected algorithm and returning the classification result

Following are the code snippets implementing the algorithms

Random Forest

K- Nearest Neighbors

Support Vector Machines

Adaptive Boosting

Gradient Tree Boosting

Quadratic Discriminant Analysis

Well, time to turn to some Model and Feature Selection!

by Francesco Pochetti