Tag Archives: Python

Given a picture, would you be able to identify which camera took it?

[Link to Jupyter Notebook]

“Finding footage of a crime caught on tape is an investigator’s dream. But even with crystal clear, damning evidence, one critical question always remains-is the footage real?

Today, one way to help authenticate footage is to identify the camera that the image was taken with. Forgeries often require splicing together content from two different cameras. But, unfortunately, the most common way to do this now is using image metadata, which can be easily falsified itself.

This problem is actively studied by several researchers around the world. Many machine learning solutions have been proposed in the past: least-squares estimates of a camera’s color demosaicing filters as classification features, co-occurrences of pixel value prediction errors as features that are passed to sophisticated ensemble classifiers, and using CNNs to learn camera model identification features. However, this is a problem yet to be sufficiently solved.

For this competition, the IEEE Signal Processing Society is challenging you to build an algorithm that identifies which camera model captured an image by using traces intrinsically left in the image. Helping to solve this problem would have a big impact on the verification of evidence used in criminal and civil trials and even news reporting.”

As soon as I read the competition’s overview I thought it would be the perfect playground to dive into the Keras Deep Learning library. Less than a month ago I had given a shot to PyTorch in the context of a very interesting (and complex!) iceberg VS ship image classification challenge. The library delivered the promised results. It was robust, easy to use and, most importantly, building on top of Numpy, it made very handy any operation on models’ inputs. PyTorch was a must-to-try, nevertheless Keras remains one of the most widely spread Deep Learning frameworks around. For a reason. It is incredibly easy to build networks with it. The API is high level enough to let the user focus on the model only, almost forgetting the network’s low hanging fruits. I had quickly played with it in the past but I felt I needed to somewhat dig deeper into it to fully appreciate all its capabilities. The IEEE competition was just perfect to challenge myself on CNNs.

You can find the full code, with all my experiments, here on NbViewer. I won’t get into the the fine details of what I implemented and how. The linked notebook is not hard to follow if one needs to dive into it. I will keep this post short focusing on the high level strategy I went for, some key learnings and results.

The first point I am relieved to have figured out is how to perform K Fold Cross Validation for Image Classification. To be honest, I actually did not figure anything myself. I just found the solution I had been LONG looking for, here on StackOverflow. CV is very simple to perform in a “standard” ML framework, i.e. in a context where you have a clean labeled dataset you can freely shuffle as many times as needed. In an image classification challenge, though, this is not always the case. In general, the best way to structure your data here would be to have two separate folders, one for training and one for validation, each split in sub-folders, one per class. The reason behind this is that all relevant Deep Learning frameworks allow users to perform on-the-fly pre-processing on images, before sending them to the GPU for model training. This step generally requires the pictures to be stored as JPG/PNG files somewhere, and to be already split by class and by train/validation. This is great. The major drawback is, of course, that the structure is rather static, hence dynamically iterating over multiple training sets, as CV requires, is not trivial to implement. A possible solution is to keep the folders’ structure as it is, changing its contents during every iteration of the cross validation process. So, say we perform 5 fold CV. We split the dataset in 5 non-overlapping chunks. Then we stack 4 of them to create the training set and we save the related images into the respective folder tree (i.e. training/class1, training/class2, …, training/classN). We do the same with the 5th fold which is going to be saved into the validation directory tree. We repeat this process 5 times, overwriting the contents of the relevant folders each time. This ensures our data generators pick different images from the same folders during every iteration. Of course, this approach involves deleting/moving/saving images continuously, hence it is not really tractable for very big datasets. In my case, though, with only 2750 pictures it was ok to got for it (Note: I eventually ended up not implementing a CV approach as I preferred devoting time  to a faster experimenting instead, sacrificing a more robust model performance assessment).

The code to perform CV would look something like

What I tried

The main problem with this competition is the size of the dataset. 2750 images for 10 classes is not a lot, hence the biggest challenge consists in fighting overfitting. Almost any CNN would perfectly learn how to model the training set, but it would certainly fail to generalize to unseen samples. This is actually what happened when I tried to train a convolutional network from scratch. My training accuracy was close to 99% whereas on the validation set I was around 50%. This is of course not acceptable.

There are several ways to handle this situation, among which the most wide spread are fine tuning a pre-trained NN, playing around with data augmentation and adding dropout. I tried all of them and eventually I managed to push a ResNet50 to ~90% accuracy on the validation set.

Before providing more details on that, it is worth spending a few lines on the very first approach I went for.

One of the coolest aspects of pre-trained nets is that they can be used as static feature extractors. Here what I mean: every deep network is structured in multiple conv layers stacked on top of each other. The purpose of such architectures is to learn as many features as possible from the input images (the deeper the net the more details getting captured). These features are then flattened into a one dimensional array and passed to a couple of dense layers up until the softmax output. The interesting thing is that if we remove the dense layers and we run the pre-trained nets on top of new images we will just get the features learned by the convolutions. So basically we could think of extracting these dimensions and then using them as inputs to an independent classifier. This is exactly what I did. Let’s see how. I had 2062 training images (688 left for validation). I then loaded 3 pre-trained CNNs, popped their top layers and fed them with my images getting the following:

  • VGG16: numpy array of size (2062, 7, 7, 512) which reshaped returns (2062, 25088)
  • ResNet50: numpy array of size (2062, 1, 1, 2048) which reshaped returns (2062,  2048)
  • Inception V3: numpy array of size (2062, 8, 8, 2048) which reshaped returns (2062, 131072)

h-stacking the 3 together generates a dataset of shape (2062, 158208). Next step is to perform a dimensionality reduction on the matrix. I run a PCA, shrinking the features space to 1000 (generally the first dense layer after the last convolution maps to around 1K nodes as well) and then apply a Random Forest Classifier on top of it. Results were very poor, with a max of 30% accuracy on the validation set (after optimizing hyper-parameters). This is disappointing but rather expected. The main flaw with this approach is the assumption that the features learned by the pre-trained nets on ImageNet are meaningful to our dataset as well. This is the case only when the images in question resemble the ImageNet ones, or better when our model tries to accomplish a very similar task to the pre-trained one. Evidently, our case does not fall in this bucket. ImageNet networks were trained to recognize cats, dogs, birds etc. They were not tuned to distinguish between an iPhone and a Samsung’s camera. Cats VS dogs and iPhone VS Samsung are very different models and we should expect the features to achieve these two goals to be very different as well. It would be the same as teaching to a child the difference between a tree and a fish and then showing him a picture of a shark and asking whether it was taken by a Sony or a LG device. I would not be surprised if the kid looked a bit confused! This is actually what is happening with our Random Forest Classifier.

A better option consists in fine-tuning a pre-trained network (inspiration taken from this brilliant post). The idea is to load a model which has been trained on ImageNet (a ResNet50 architecture for example) and then to shift its knowledge from cats and dogs to iPhones and Nexus. The process of re-learning needs to be performed in baby steps, as if we were trying to teach the above kid a completely new task. He could get easily lost so we need to wisely leverage the fact that he already knows a bunch of rules about images without asking too much from him. His knowledge is just too specific, so it is a matter of taking a couple of steps back and walking him through a new route. Not as if he was completely ignorant. Just a new smaller branch of his vision capabilities. The way of accomplishing that with a deep net is to start from the output and walk backwards, unfreezing a couple of layers and re-training from scratch. The model would not re-learn the whole thing. We are just forcing the last N layers to adjust their weights for the new task at hand. The high level concepts of “picture” with all its core features are safely stored in the very first convolutions, which we don’t touch at all. If we are still not satisfied even after unfreezing a couple of the last layers then it means the features we are trying to tune are still too dependent on the original dataset the network saw in the first place (i.e. ImageNet). We then need to unroll a couple of additonal blocks and re-train. SGD with a very small learning rate and a few epochs should do the job, at least at the beginning of the exploration.

This is exactly what I did and it turns out that it worked decently with a ResNet50 (VGG16, Xception and Inception V3 did not perform good).

Evidently, I have not explored all the possible approaches for this competition. The idea was to get familiar with Keras, which I loved and I am sure I will use again in the future. Next challenge is to dive into TensorFlow. This is without doubt the Deep Learning framework most companies and researchers use and I absolutely have to check it out. Keep you posted on my progress!

 

by Francesco Pochetti

Recommendation Engines

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my pythonic playground about Recommendation Engines.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

K-means Clustering

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my pythonic playground about K-means Clustering.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

Support Vector Machines

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my pythonic playground about Support Vector Machines.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

The Bias v.s. Variance Tradeoff

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my pythonic playground about Bias v.s Variance in Machine Learning.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

Pythonic Logistic Regression

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my implementation of Logistic Regression in numpy.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

Pythonic Linear Regression

All posts in the series:

  1. Linear Regression
  2. Logistic Regression
  3. Neural Networks
  4. The Bias v.s. Variance Tradeoff
  5. Support Vector Machines
  6. K-means Clustering
  7. Dimensionality Reduction and Recommender Systems
  8. Principal Component Analysis
  9. Recommendation Engines

Here my implementation of Linear Regression in numpy.
The code below was originally written in matlab for the programming assignments of Andrew Ng’s Machine Learning course on Coursera.
I had some fun translating everything into python!
Find the full code here on Github and the nbviewer version here.

by Francesco Pochetti

Predict physical and chemical properties of soil using spectral measurements

Check out on NBViewer the work I’ve done with Pandas, Scikit-Learn, Matplotlib wrapped up in  IPython about predicting physical and chemical properties of African soil using spectral measurements on Kaggle.

The code and the files are also available on Github.

Here the challenge: “Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping of soil functional properties, especially in data sparse regions such as Africa, is important for planning sustainable agricultural intensification and natural resources management.

Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum (Fig. 1). The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.

Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples selected to span the diversity in soils in a given target geographical area. The calibration models are then used to predict the soil test values for the whole sample set. The predicted soil test values from georeferenced soil samples can in turn be calibrated to remote sensing covariates, which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration model is then used to predict the soil test values for each pixel. The result is a digital map of the soil properties.

This competition asks you to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.”

by Francesco Pochetti

PiPad – How to build a tablet with a Raspberry Pi

The Project

When I stepped into the Raspberry Pi for the first time on the web I immediately started thinking about a cool application of this amazing mini computer. There are actually a ton of very interesting projects it is possible to dive into using the Pi, ranging from a relatively simple web server to pretty complicated home automation stuff. The one which ultimately caught my full attention was with no doubt the PiPad, whose name and idea I am borrowing from Michael K Castor. So, first of all, thank you very much Michael for pioneering this application and for sharing your fantastic experience on your blog! I took inspiration from his work in the first place personalizing pipeline and components. Adding the Pi Camera Module is a good example (thanks Amandine Esser for pushing me to always raise the bar!).

I also need to address a huge thanks to Pierre Esser who helped me out with the electronic part of the project. My electronics skills are unfortunately very limited (work in progress on that) and his help was absolutely fundamental to put together an ON/OFF button which could power at the same time the board and the screen.

Before diving into the technical details I thought it would be worth sharing how the tablet looks like right now, just a couple of days after I finished putting everything together. Here a demo video. Seems to work pretty well actually!

I think we are done for the intro, so let’s get started.

The basic idea is to do the following:

  1. Get a Raspberry Pi
  2. Get a touchscreen
  3. Power both board and screen with an external battery
  4. Connect to the Pi all the necessary devices (WiFi dongle, bluetooth, camera, audio output etc)
  5. Build a wood enclosure with enough room for everything and with an easy-to-open structure (book-like) to replace any broken/mis-functioning pieces in the future.

The plan sounds a little bit oversimplified as it is stated above but those are the main points.

What I needed

How I did it

I started with the electronics. I plugged mouse, WiFi dongle, keyboard, Camera Module, SD Card (with Noobs – I planned to install Raspbian) into the Raspberry Pi. Then I focused on the screen. Here the first issues started raising. As soon as I began checking the cables I realized I had committed quite a big mistake. As explained here on the Chalk-Elec website the screen can be powered either by external power supply (5V/2A) or by USB. The second option was the one I was looking for as I planned to plug the screen directly into the Pi and get power from there. However by default the LCD can be run only via external power supply, which is not exactly what I had in mind. You don’t really want a tablet which needs  to be constantly attached to a plug. Not very portable I would say. For USB power to work some soldering is needed, as we have to detach a 0R resistor from a specific position and move it to another dedicated place on the board. Not too complicated but still a bit risky as I was not even sure I had enough voltage to power the LCD via the Pi. I was instead pretty sure of the fact that 5V were definitely enough to power Raspberry or screen alone.  Hence I went for the solution in parallel, also in the light of the future need for an ON/OFF button. The idea was the following:

  1. Cut a USB cable and solder the red wire (one of the 2 bringing power) to one of the external ON/OFF switch connectors. This USB cable links the battery to the switch, carrying 5V.
  2. Cut a second piece of wire and solder it to the central connector to get the power out of the switch. This piece of wire will work as a bridge between the ON/OFF button and the screen/Pi.
  3. Cut the external power cable provided with the LCD. The pin side of the cable will be plugged into the screen board while the other extreme will be soldered to the wire at point 2, ensuring 5V to the LCD.
  4. Cut an external power cable used to recharge mobile phones. The mini-usb side will be plugged into the Raspberry while the other extreme will be soldered to the wire at point 2, ensuring 5V to the main board.

I followed “my instructions” and there you go I had a fully functioning ON/OFF switch for my tablet. I connected everything as needed, switched on the button and both the Pi and the screen powered up.

I won’t spend too many words on the software side of the project. The first experience with Raspbian was pretty smooth. I needed to tweak a little bit the system to have the WiFi dongle work and to get the screen at full size. I also had to calibrate the LCD to adjust for my touch. Nothing impossible actually. It was everything pretty straightforward and without too much of an effort I had a fully working touch screen.

After making sure all the electronics was in place I started working on the enclosure. I wish I had a CNC machine to cut the plywood in a clean way. Unfortunately this was not the case. I was aware I had to sacrifice precision but it was an acceptable trade off in absence of more accurate machines. Hence I began with the frame following the below strategy:

  1. I cut 8 pieces of wood from a long regular plywood stick. Those would make the external part of my case. I glued them together 4 pieces at a time to obtain 2 separate frames.
  2. I connected the 2 frames with the hinges, making sure the folding was smooth enough to ensure a comfortable closing/opening of the tablet.
  3. The time for a first check had arrived. I needed to put all the electronics in place to achieve two main results. Optimize as much as possible the limited room I had available under the screen and consequently decide where to carve the frame to expose the key components (SD Card, USB, battery recharge, audio). As soon as I figured out the exact position of all my pieces, I also decided where to cut the enclosure and went on with all the carving. Specifically I created holes for the battery charger (bottom frame), ON/OFF switch (both frames), audio jack (bottom frame), USB exit (bottom frame), SD Card (bottom frame) and neodymium magnets (both frames). For all this wood work I used nothing more than drill, exacto knife and saw, cleaning everything up with rasp and sandpaper.
  4. I had my frame almost ready. Now, assuming that the top of it would have been covered by the screen, I still needed a back. So I went for another piece of thin plywood and cut a rectangular slice just for this purpose. I glued it to the bottom frame and after making sure everything had dried correctly I drilled the last 5 holes, 1 for the Pi Camera Module (which then would be used as a back camera), the other 4 for the status-lights of the external battery (to be able to check if and when to recharge it).
  5. Time for some varnishing. The enclosure was ready hence I moved to the next step which was to varnish the whole thing (a couple of layers were enough).
  6. Then I proceeded with putting all the electronics in place. I screwed Raspberry and Camera. I connected all the cables to the board and glued the relevant pieces into their respective holes (magnets included). I laid the battery and fixed it with extra strong double side tape strips. The same strips were quite useful to fix the screen to the top frame and finally close the enclosure.
  7. And now the moment of the truth. I switched it on and.. the screen lighted up and the PiPad booted! Fantastic! It was (and it still is) working!

Next Steps

I am pretty glad of the result, I must admit. For the moment I don’t have any specific plans to upgrade the tablet with new hardware. I still need to focus on the software side of the project and solve a couple of annoying issues. First of all the sound which is not working or wrapping the Camera Module into a more user-friendly interface rather than having to go to the command line every time. The touchscreen works smoothly (at least with Raspbian) and also the virtual keyboard I have installed is not too bad.

I also have to mention that a couple of months ago there was the official release of the 7″ Raspberry Pi Touchscreen which will definitely be a game changer and will probably deprecate pretty soon tablet solutions like mine. This is of course very cool as the community is always working very hard to continuously raise the bar.

There is still work to be done but the first results are pretty awesome! Keep you posted then!

by Francesco Pochetti

Stock Trading Algorithm on top of Market Event Study

This post is the result of the first six weeks of class from the Computational Investing course I’m currently following on Coursera. The course is an introduction to Portfolio Management and Optimization in Python and lays the foundations for the second part of the track which wil deal with Machine Learning for Trading. Let’s move quickly to the core of the business.

The question I want to answer with is the following:

  • Is it possible to exploit event studies in a trading strategy?

First of all we should clarify what an event study is. As Wikipedia states, an event study is a statistical method to assess the impact of an event on the value of a firm. This definition is very broad and can easily incorporate facts directly concerning the company (i.e. private life of the CEO, merging with other firms, confidential news from insiders) or anomalous fluctuactions in the price of the stock. I naively (and maybe incorrectly) categorized events regarding a company into these two types, news related and market related, but there should be no difference as they are generally tigthly correlated. In any case, as it is not easy to have access and parse in real time news feeds we will focus on market related events, meaning that in the rest of the post an event must be intended as an anomalous behavior in the price of the stock whose consequences we could exploit to trade in a more efficient way.

Now that we have properly defined an event we can go back to the beginning and think a little bit more about what study an event really means. To understand it let’s walk through a complete example and suppose that we have an event whenever the closing price of a stock at the end of day i  is less than 10$ whilst  at the end of day i-1 was more than 10$. Thus we are examining a significant drop in the price of the stock. Given this definition the answer is: what does it statistically happen to prices of stocks experiencing those kind of fluctuations? Is there a trend that could be somehow exploited? The reason at the base of these questions is that if we knew in advance that a stock followed a specific pattern as a consequence of some event we could could adjust our trading strategy accordingly. If statistics suggests that the price is bound to increase maybe it is a good idea to long the shares whether in the opposite case the best decision is to short.

In order to run an even study we take advantage of the EventProfiler class inside the QSTK library. This class allows us to define an event and then, given a time inerval and a list of stocks, it works in the following way: it scrolls firm after firm and whenever it finds an event sets that day day as day 0. Then it goes 20 days ahead and 20 days before the event and saves the timeframe. After having analyzed all the stocks it aligns the events on the day 0, averages all the prices before and after and scales the result by the market (SPY). The output is a chart which basically answers this question: what happens on average when the closing price of a stock at the end of day i  is less than 10$ whilst  at the end of day i-1 was more than 10$? The test period was the one between 1 January 2008 and 31 December 2009 (in the middle of the financial crisis), while the stocks chosen were the 500 contained in the S&P index in 2012. The graph is shown below and the following information can be extracted: first, 461 such events were registered during the investigated time frame. Second, on the day of the event there is a drop of about 10% in the stock price w.r.t the day before. Third, the price seems to recover after day zero, even though the confidence intervals of the daily increase are huge. SPY2012_10$

 Now the idea is the following. If the observed behavior is respected what we can do is build a trading strategy consisting in buying on the day of the event and selling let’s say after 5 days (we don’t want to hold too long despite the price increasing almost monotonically). Just to recap here you find the whole pipeline from event definition to portofolio assessment.

trade

Now that we have a plan let’s dive into the code (you can find all the code on Github).

First of all I’ll introduce one after the other the two main functions.

find_events(ls_symbols,  d_data,  shares=100):  given the list of the stocks in the portfolio, their historical prices and the number of shares to be traded identifies events and issues a Buy Order on the day of the event and a Sell Order after 5 trading days. Eventually it returns a csv file to be passed to the market simulator. The first lines of the csv file are previed below (year, month, day, stock, order, shares).

orders

 

 

 

 

 

 

 

marketsim(investment, orders_file, out_file):  given the initial investment in dollars (50000 $ in our case), the csv files containing all the orders (the output of find_events()) and the file to save to the results of the simulation, this function places the order in chronologic order and updates automatically the value of the portfolio. It returns a csv file with the portfolio value in time, a plot comparing the portfolio performance against the market benchmark and print to screen a summary of the main financial metrics used to evaluate the portfolio.

main(): this function calls the previous two after getting and cleaning all the relevant data.

This is the output, as promised:

portfolio

Well, despite the huge crisis (-19% market return) our trading strategy brought us to gain a remarkable +19%! This was just an example but in any case very powerful to show the possibilities of event studies in finance.

 

 

 

by Francesco Pochetti