Given a picture, would you be able to identify which camera took it?

[Link to Jupyter Notebook]

“Finding footage of a crime caught on tape is an investigator’s dream. But even with crystal clear, damning evidence, one critical question always remains-is the footage real?

Today, one way to help authenticate footage is to identify the camera that the image was taken with. Forgeries often require splicing together content from two different cameras. But, unfortunately, the most common way to do this now is using image metadata, which can be easily falsified itself.

This problem is actively studied by several researchers around the world. Many machine learning solutions have been proposed in the past: least-squares estimates of a camera’s color demosaicing filters as classification features, co-occurrences of pixel value prediction errors as features that are passed to sophisticated ensemble classifiers, and using CNNs to learn camera model identification features. However, this is a problem yet to be sufficiently solved.

For this competition, the IEEE Signal Processing Society is challenging you to build an algorithm that identifies which camera model captured an image by using traces intrinsically left in the image. Helping to solve this problem would have a big impact on the verification of evidence used in criminal and civil trials and even news reporting.”

As soon as I read the competition’s overview I thought it would be the perfect playground to dive into the Keras Deep Learning library. Less than a month ago I had given a shot to PyTorch in the context of a very interesting (and complex!) iceberg VS ship image classification challenge. The library delivered the promised results. It was robust, easy to use and, most importantly, building on top of Numpy, it made very handy any operation on models’ inputs. PyTorch was a must-to-try, nevertheless Keras remains one of the most widely spread Deep Learning frameworks around. For a reason. It is incredibly easy to build networks with it. The API is high level enough to let the user focus on the model only, almost forgetting the network’s low hanging fruits. I had quickly played with it in the past but I felt I needed to somewhat dig deeper into it to fully appreciate all its capabilities. The IEEE competition was just perfect to challenge myself on CNNs.

You can find the full code, with all my experiments, here on NbViewer. I won’t get into the the fine details of what I implemented and how. The linked notebook is not hard to follow if one needs to dive into it. I will keep this post short focusing on the high level strategy I went for, some key learnings and results.

The first point I am relieved to have figured out is how to perform K Fold Cross Validation for Image Classification. To be honest, I actually did not figure anything myself. I just found the solution I had been LONG looking for, here on StackOverflow. CV is very simple to perform in a “standard” ML framework, i.e. in a context where you have a clean labeled dataset you can freely shuffle as many times as needed. In an image classification challenge, though, this is not always the case. In general, the best way to structure your data here would be to have two separate folders, one for training and one for validation, each split in sub-folders, one per class. The reason behind this is that all relevant Deep Learning frameworks allow users to perform on-the-fly pre-processing on images, before sending them to the GPU for model training. This step generally requires the pictures to be stored as JPG/PNG files somewhere, and to be already split by class and by train/validation. This is great. The major drawback is, of course, that the structure is rather static, hence dynamically iterating over multiple training sets, as CV requires, is not trivial to implement. A possible solution is to keep the folders’ structure as it is, changing its contents during every iteration of the cross validation process. So, say we perform 5 fold CV. We split the dataset in 5 non-overlapping chunks. Then we stack 4 of them to create the training set and we save the related images into the respective folder tree (i.e. training/class1, training/class2, …, training/classN). We do the same with the 5th fold which is going to be saved into the validation directory tree. We repeat this process 5 times, overwriting the contents of the relevant folders each time. This ensures our data generators pick different images from the same folders during every iteration. Of course, this approach involves deleting/moving/saving images continuously, hence it is not really tractable for very big datasets. In my case, though, with only 2750 pictures it was ok to got for it (Note: I eventually ended up not implementing a CV approach as I preferred devoting time  to a faster experimenting instead, sacrificing a more robust model performance assessment).

The code to perform CV would look something like

What I tried

The main problem with this competition is the size of the dataset. 2750 images for 10 classes is not a lot, hence the biggest challenge consists in fighting overfitting. Almost any CNN would perfectly learn how to model the training set, but it would certainly fail to generalize to unseen samples. This is actually what happened when I tried to train a convolutional network from scratch. My training accuracy was close to 99% whereas on the validation set I was around 50%. This is of course not acceptable.

There are several ways to handle this situation, among which the most wide spread are fine tuning a pre-trained NN, playing around with data augmentation and adding dropout. I tried all of them and eventually I managed to push a ResNet50 to ~90% accuracy on the validation set.

Before providing more details on that, it is worth spending a few lines on the very first approach I went for.

One of the coolest aspects of pre-trained nets is that they can be used as static feature extractors. Here what I mean: every deep network is structured in multiple conv layers stacked on top of each other. The purpose of such architectures is to learn as many features as possible from the input images (the deeper the net the more details getting captured). These features are then flattened into a one dimensional array and passed to a couple of dense layers up until the softmax output. The interesting thing is that if we remove the dense layers and we run the pre-trained nets on top of new images we will just get the features learned by the convolutions. So basically we could think of extracting these dimensions and then using them as inputs to an independent classifier. This is exactly what I did. Let’s see how. I had 2062 training images (688 left for validation). I then loaded 3 pre-trained CNNs, popped their top layers and fed them with my images getting the following:

  • VGG16: numpy array of size (2062, 7, 7, 512) which reshaped returns (2062, 25088)
  • ResNet50: numpy array of size (2062, 1, 1, 2048) which reshaped returns (2062,  2048)
  • Inception V3: numpy array of size (2062, 8, 8, 2048) which reshaped returns (2062, 131072)

h-stacking the 3 together generates a dataset of shape (2062, 158208). Next step is to perform a dimensionality reduction on the matrix. I run a PCA, shrinking the features space to 1000 (generally the first dense layer after the last convolution maps to around 1K nodes as well) and then apply a Random Forest Classifier on top of it. Results were very poor, with a max of 30% accuracy on the validation set (after optimizing hyper-parameters). This is disappointing but rather expected. The main flaw with this approach is the assumption that the features learned by the pre-trained nets on ImageNet are meaningful to our dataset as well. This is the case only when the images in question resemble the ImageNet ones, or better when our model tries to accomplish a very similar task to the pre-trained one. Evidently, our case does not fall in this bucket. ImageNet networks were trained to recognize cats, dogs, birds etc. They were not tuned to distinguish between an iPhone and a Samsung’s camera. Cats VS dogs and iPhone VS Samsung are very different models and we should expect the features to achieve these two goals to be very different as well. It would be the same as teaching to a child the difference between a tree and a fish and then showing him a picture of a shark and asking whether it was taken by a Sony or a LG device. I would not be surprised if the kid looked a bit confused! This is actually what is happening with our Random Forest Classifier.

A better option consists in fine-tuning a pre-trained network (inspiration taken from this brilliant post). The idea is to load a model which has been trained on ImageNet (a ResNet50 architecture for example) and then to shift its knowledge from cats and dogs to iPhones and Nexus. The process of re-learning needs to be performed in baby steps, as if we were trying to teach the above kid a completely new task. He could get easily lost so we need to wisely leverage the fact that he already knows a bunch of rules about images without asking too much from him. His knowledge is just too specific, so it is a matter of taking a couple of steps back and walking him through a new route. Not as if he was completely ignorant. Just a new smaller branch of his vision capabilities. The way of accomplishing that with a deep net is to start from the output and walk backwards, unfreezing a couple of layers and re-training from scratch. The model would not re-learn the whole thing. We are just forcing the last N layers to adjust their weights for the new task at hand. The high level concepts of “picture” with all its core features are safely stored in the very first convolutions, which we don’t touch at all. If we are still not satisfied even after unfreezing a couple of the last layers then it means the features we are trying to tune are still too dependent on the original dataset the network saw in the first place (i.e. ImageNet). We then need to unroll a couple of additonal blocks and re-train. SGD with a very small learning rate and a few epochs should do the job, at least at the beginning of the exploration.

This is exactly what I did and it turns out that it worked decently with a ResNet50 (VGG16, Xception and Inception V3 did not perform good).

Evidently, I have not explored all the possible approaches for this competition. The idea was to get familiar with Keras, which I loved and I am sure I will use again in the future. Next challenge is to dive into TensorFlow. This is without doubt the Deep Learning framework most companies and researchers use and I absolutely have to check it out. Keep you posted on my progress!

 

by Francesco Pochetti

Comments

comments