Style Transfer with fast.ai and PyTorch -

Reading Time: 8 minutes

In this post, I will go over a fascinating technique known as Style Transfer. At the end of this experiment, we’ll literally end up creating our one pieces of art, stealing the brush from the hands of Picasso, Monet, and Van Gogh and painting novel masterpieces on our own!

As it has been the case for my last few posts, also for this one, the inspiration has come from fast.ai. In the 13th lesson of the DL Part 2 course, Jeremy Howards tackles the thrilling topic of modifying an image by applying a specific artistic style to it. In short, Style Transfer. The original paper was published in September 2015. A quite enjoyable read which I highly recommend. I am often scared by academic papers, and I am the first one in line admitting that they are not the most fun reads ever. This one, though, clearly stands out of the crowd for simplicity and conciseness. Lots of pictures, little math and straight to the point.

Ok, let’s stop the chit-chatting and dive into the material. There is a lot to discuss!

The idea consisted in starting from Jeremy’s notebook and extending it. Play around with the code, several artistic styles and check the effect of the various parameters on the end result. The outcome is this Jupyter notebook. The plan is, now, to go through it, exposing my findings together with the theory which goes along.

Art celebrities going wild with my portrait

Table of Contents

Demistyfying Style Transfer

If you think about it, the task of modifying a given image according to the style of another one is everything but obvious.

First, what exactly is the style of an image? We need something very specific, as it has to be converted into math, somehow.

Secondly, how do I actually modify an existing picture to the point when it reasonably resembles the original one, still, different enough to have its own style twist on top?

These are all good questions. Let’s start breaking down the task in two. The first challenge consists in reconstructing the content of the original image, up to a certain extent. The second, more challenging, sub-task consists in extracting the concept of style from a piece of art and apply it to our original object. Let’s go.

Content Reconstruction

Here the context. You are given an Original Image (OI for brevity) and a Random Noise Image (RNI) and you are tasked to tweak the pixels values of RNI (which are, by definition, random) until it resembles OI, as much as possible. This is not that hard. At the end of the day, it boils down to setting up a loss function, defined as the MSE between RNI and OI, and minimize it, tuning RNI at each iteration. This is something relatively standard to achieve with a PyTorch optimizer. Is this what we want, though? Not really. What we want is RNI to “look like” OI, not to be the exact copy of it. Copying OI into RNI would not work as there would be no room left to add the Style part down the line.

What we need is a way to create an abstraction of OI. By abstraction, I mean a summary of the content of OI which would be close enough to the pixels to look like OI, and distant enough to just capture the key brushstrokes defining the image itself. Guess what. This is what a CNN is really good at. So, what we will do is to stick OI through a neural network. Then feed into the MSE loss not the raw pixels, but the activations values calculated by the forward pass at some layer of choice. Of course, RNI needs to undergo the same treatment. Convolutional activations are going to be extracted at some point, MSE loss minimized and RNI pixels values tweaked accordingly. This process is illustrated in the below visualization, where a VGG16 architecture is used.

Which network to use and which specific Conv Layer to pick?

Up to you, in all honesty. The important bit is to understand what is going on. As for myself, I have tried both VGG16 (proposed by Jeremy Howards originally) and DenseNet121. The results I obtained using the former are way more visually appealing than the ones from the latter. Hence I will stick to VGG16 for the rest of the discussion, even though, all findings are perfectly applicable to any architecture. When it comes to which specific Conv Layer to pick to calculate the activations, the decision is about how much of OI you want to keep. The first layers are very close to the raw pixels and they would carry information very similar to the original image. The deeper you go into the network the higher the level of abstraction. So, the last layers would carry more condensed and complex information, as a result of the pitiless reshaping of pooling and of the multiple non-linearities they went through. The below picture shows the dramatic effect on content reconstruction of choosing increasingly deeper layers. The more you advance, the less you “see”. The image in the bottom right corner, instead, shows the result of aggregating all layers of the network. The data provided by all activations at the same time, clearly translates into an almost perfect reconstruction of the OI (similar to the one from Conv Layer 3).

As for the content part we should be all set. We know how to reconstruct an image starting from random noise. For the rest of the experiments I decided to pick the level of abstraction obtained by using the third Convolutional Layer starting from the top left corner in the above picture (Conv Layer 20, to be clearer). Now let’s move to the Style part.

Style Reconstruction

This one is a little more challenging than the previous, as it is not really clear what style even means. It is actually very understandable for us, humans. We immediately recognize a Monet or a Van Gogh. The sharp, geometric, semi-abstract strokes which made Picasso a master are easy to spot. How do we teach the concept of “sharp, geometric, semi-abstract strokes” to a machine, though? Guess what. Convolutional Neural Networks to the rescue again!

Not too fast. In this case, it won’t be a simple minimization of the Mean Squared Error between the activations of the Style Image (SI) and RNI. It can’t, as activations, even if abstractions of the picture, still don’t carry explicit features of the style (whatever that means in math language). Or better, they actually do. We just need to scratch the surface and dig a little to find them.

What Leon A. Gatys et al proposed in the original “A Neural Algorithm of Artistic Style” paper is to build the Gram Matrix with the activations of each Conv Layer. What is a Gram Matrix and why is it supposed to contain any relevant information about the style of a painting?

To better understand what is going on, here a visualization of what is happening behind the scenes.

The Gram Matrix of a set of Conv Layers activations is equal to the product of the matrix having has many rows as filters (with each row being the flattened filter) by its transpose. So basically, let’s say we take the third Convolutional Layer, having 256 filters, 60 x 60. We slice the first filter, flatten it out as a 3600-dimensional array. Do the same with the second filter, and calculate the dot product between the two. Then pick the third and repeat. And so on so forth, until we have exhausted all possible combinations of filters. Each time we compute a dot product, we store the output within a 256 x 256 matrix. The result is the Gram Matrix.

Why would we want such a thing? A couple of reasons. The first being that we need to lose the spatial information about SI. Let’s assume we want to teach a machine how to paint as Van Gogh did. Like this.

We don’t want to copy the moon, or the tree, or any actual concrete element of this masterpiece. What we want is to extract the peculiar brush strokes which made Van Gogh unique. Therefore spatial information has to get dropped, somehow. Flattening 2D filters and multiplying them is a good way of messing up with he spatial ordering.

Another good reason behind the Gram Matrix is that it allows ut to get an idea of which features appear as standalone elements, or together with other features, or don’t appear at all. I guess having such understanding of a painting would be a good starting point for defining a style. By features, I mean the aspects of an image which Conv filters are specifically trained to capture; like corners, or diagonals, or geometric shapes, or textures, or combinations of all of those. At the end of the day we are passing the SI into a pre-trained VGG model, which tells us that stuff like “hey, the blue filters are activated in here. The kernels responsible for recognizing bold, curvy strokes are very active too. The net, geometrically shaped filters are almost completely turned off though”.

Calculating the dot product between all these filters is nothing more than correlating them. This is actually what we want as it summarizes in a single number the exact concept of artistic style we need to extract. Look below (Image 5) at what happens when we apply this concept (also illustrated in Image 3) to Van Gogh’s “The Starry Night”. As usual, different layers provide different levels of style reconstruction. The last one, aggregating all of them, is just spot on. This is what we are looking for. A Deep NN successfully managed to synthesize the bold, vibrant brush strokes expressing the Dutch artist’s style.

Adding Style and Content

Now that we have content and style, we just need to put them together. The idea is relatively simple. For both tasks, we were minimizing a separate loss function. In the case of content, the MSE between the Conv Layer activations calculated for RNI and OI. In the case of style, the MSE between the Gram Matrices of RNI and SI. The next step is to minimize the sum of these two losses, as illustrated in Image 6. That’s it!

Now, of course, instead of calculating the straight sum of content and style, we could play around with a parameter to weigh one more than the other. I actually gave it a go with Kandinsky and the below series of pictures is the result. I ran the style transfer routine with 6 increasing values of Style2Content ratio, from 0 (only content) to 1 (only style). The image in the top right corner is obtained without any style contribution, whereas the other extreme, in the bottom right corner, is generated without any content. The results in between are pretty much self-explanatory (and quite appealing too!)

This is great! The results we can obtain using such a simple approach are outstanding. It was super fun to experiment with Style Transfer. Now onto the next challenge! Stay tuned for updates.

Follow me! If you like what you are reading feel free to follow me on LinkedIn or Twitter!