Tag Archives: Pandas

Predict physical and chemical properties of soil using spectral measurements

Check out on NBViewer the work I’ve done with Pandas, Scikit-Learn, Matplotlib wrapped up in  IPython about predicting physical and chemical properties of African soil using spectral measurements on Kaggle.

The code and the files are also available on Github.

Here the challenge: “Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping of soil functional properties, especially in data sparse regions such as Africa, is important for planning sustainable agricultural intensification and natural resources management.

Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum (Fig. 1). The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.

Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples selected to span the diversity in soils in a given target geographical area. The calibration models are then used to predict the soil test values for the whole sample set. The predicted soil test values from georeferenced soil samples can in turn be calibrated to remote sensing covariates, which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration model is then used to predict the soil test values for each pixel. The result is a digital map of the soil properties.

This competition asks you to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.”

by Francesco Pochetti

Part III – Scikit Classification Algorithms


  1. Introduction and Discussion of the Problem
  2. Feature Generation
  3. Classification Algorithms
  4. Feature/Model Selection
  5. Results on Test Set
  6. Trading Algorithm and Portfolio Performance

So finally, as a result of last post, we have a dataframe to play with.

Before diving into model and feature selection I would like to make a little overview of the Classification Algorithms I tested. In the next post I’ll show the techniques and the decision process followed to select the best model and the best features. Anyway to keep it simple I’ll present actual results only for the best model, so it is worth to introduce the other ones I implemented.

Classification ML

The following is a pretty awesome algorithm cheat-sheet provided as part of the Scikit-Learn Documentation. I’ll cover the Classification branch of the tree, going through the code needed to have the selected algorithms running.


First of all we need to prepare our data for the proper Machine Learning stuff. So let’s take care of that with prepareDataForClassification(). This function takes care of

  1. turning S&P 500 daily returns from a float to a string column (Up, Down)
  2. encoding it as a binary integer 1,0 (the only form accepted by Scikit-Learn Classifiers).
  3. selecting the features used for prediction (basically all the columns except for the first one – float returns – and the last one – string returns).
  4. splitting the whole dataframe into train and test set (based on a date passed as argument) and returns for each of them predictors and actual prediction.

performClassification() is in charge of calling the selected algorithm and returning the classification result

Following are the code snippets implementing the algorithms

Random Forest

K- Nearest Neighbors

Support Vector Machines

Adaptive Boosting

Gradient Tree Boosting

Quadratic Discriminant Analysis

Well, time to turn to some Model and Feature Selection!

by Francesco Pochetti

Pythonic Cross Validation on Time Series

Working with time series has always represented a serious issue. The fact that the data is naturally ordered denies the possibility to apply the common Machine Learning Methods which by default tend to shuffle the entries losing the time information.

Dealing with Stocks Market Prediction I had to face this kind of challenge which, despite its being pretty common, is not well treated at a documentation level.

One of the first problem I encountered when applying the first ML was how to deal with Cross Validation. This technique is widely used to perform feature and model selection once the data collection and cleaning phases have been carried out. The idea behind CV is that in order to select the best predictors and algorithm it is mandatory to measure the  accuracy of our process on a set of data which is different from the one used to train the model.

Just to be sure we are on the same page let’s walk through the following basic example; imagine you are a professor teaching a student a new topic. To train the student you provide him some questions with solutions (supervised learning), just to give him the possibility to check whether his reasoning was right or wrong. But if you had to test his knowledge you wouldn’t ask him the same questions of the previous day, right? That would be unfair; well the student would be happy of course but the result of the test would not be reliable. To really check whether he has digested the new concept or not you have to provide him brand new exercises. Something he has not seen before.

This his the concept at the base of Cross Validation. The most accepted technique in the ML world consists in randomly picking samples out of the available data and split it in train and test set. Well to be completely precise the steps are generally the following:

  1. Split randomly data in train and test set.
  2. Focus on train set and split it again randomly in chunks (called folds).
  3. Let’s say you got 10 folds; train on 9 of them and test on the 10th.
  4. Repeat step three 10 times to get 10 accuracy measures on 10 different and separate folds.
  5. Compute the average of the 10 accuracies which is the final reliable number telling us how the model is performing.

The issue with Time Series is that the previous approach (implemented by the most common built-in Scikit functions) cannot be applied. And the reason is that in the Time Series case data cannot be shuffled randomly, cause we’ll lose its natural order, which in most cases matters.

Thus on possible solution is to the following one:

  1. Split data in train and test set given a Date (i.e. test set is what happens after 2 April 2014 included).
  2. Split train set (i.e. what happens before 2 April 2014 not included) in for example 10 consecutive time folds.
  3. Then, in order not lo lose the time information, perform the following steps:
  4. Train on fold 1 –>  Test on fold 2
  5. Train on fold 1+2 –>  Test on fold 3
  6. Train on fold 1+2+3 –>  Test on fold 4
  7. Train on fold 1+2+3+4 –>  Test on fold 5
  8. Train on fold 1+2+3+4+5 –>  Test on fold 6
  9. Train on fold 1+2+3+4+5+6 –>  Test on fold 7
  10. Train on fold 1+2+3+4+5+6+7 –>  Test on fold 8
  11. Train on fold 1+2+3+4+5+6+7+8 –>  Test on fold 9
  12. Train on fold 1+2+3+4+5+6+7+8+9 –>  Test on fold 10
  13. Compute the average of the accuracies of the 9 test folds (number of folds  – 1)

The result of the previous approach is encapsulated in the following function which I built myself to be exactly sure of what was doing. The code is fully commented and provided below.

To show the result of the performTimeSeriesCV() a sample real output is provided.

Specifically the following is the output of next command whose intention is to perform a Cross Validation for a binary classification on a stock price (Positive or Negative Return) using Quadratic Discriminant Analysis.


by Francesco Pochetti