Skip to content

Is it a thriller? Inferring movie genre from its poster in AWS SageMaker

Reading Time: 10 minutes

Link to Jupyter Notebook

Setting the stage

Amazon SageMaker is a very popular service within the AWS ML suite, offering users the possibility to build, train and deploy Machine Learning models at scale. Despite it being introduced at re:Invent 2017, I had never had the chance to explore its functionality and test its true potential. Until a couple of week ago.

The idea I had in mind was to build a web application served by a SageMaker hosted model. This way I could experience the whole development cycle, from training, to deployment, to the integration of the production endpoint within the AWS ecosystem. What follows is the story of this journey. Given its complexity, I decided to split it into two parts:

  1. This blog post focuses purely on the Machine Learning section, on how to spin up a SageMaker notebook instance, run some modeling prototypes on it, train a model and eventually deploy it to a production endpoint.
  2. A second post will follow, focusing on what to do with the SageMaker endpoint. In that one, I will address how to build a serverless web application to expose the model to the world.

Movie genre from its poster

As for the modeling challenge to pick, I felt particularly inspired by the movies-posters dataset on Kaggle. A perfect combo of my passion for computer vision and cinema. In the original dataset, each movie is tagged with multiple genres, making it, technically, a multi-label multi-class classification task. I opted for a simplified version of the problem and decided to turn it into a binary classification exercise. Given my inevitable preference for thriller and crime movies, I thought it would make sense to categorize a movie as thriller/crime versus anything else.

The custom labeling logic I used is the following: a movie is flagged as thriller/crime if, among the genres it carries, there are thriller AND crime, AND there are NOT romance AND comedy. This might sound silly, but I found a lot of instances in which a movie was labeled as thriller and romance at the same time, and the end result was quite far from my definition of thriller. The only way I figured to get to a close proxy of my actual tastes was to apply the above restrictions.

The complete dataset is composed of ~39k posters. The previous labeling logic returned a total of ~1.1k thriller/crime movies. To furtherly simplify my life I decided to build a balanced dataset, randomly sampling an equal number of NON-thriller/crime movies from the original list.

I hence ended up with a perfectly balanced set of ~2.2k posters. Here some code to show a random selection of 12 of those (everything is included in the Jupyter Notebook linked at the top of the page).

def standard_transform(data, label):
    data = data.astype('float32')
    augs = mx.image.CreateAugmenter(data_shape=(3, 224, 224))
    for aug in augs:
        data = aug(data)
    return data, label
def show_batch(rec_file):
    dataset = mx.gluon.data.vision.ImageRecordDataset(os.path.join(FOLDER, rec_file),
                                                     transform=standard_transform)
    loader = mx.gluon.data.DataLoader(dataset, batch_size=64, shuffle=True)
    x, y = next(iter(loader))
    fig, axes = plt.subplots(3, 4, figsize=(12, 8))
    for i,ax in enumerate(axes.flat):
        im = (x[i])
        labels = ['Other', 'Thriller/Crime'][int(y[i].asnumpy()[0])]
        title = labels
        ax.set_title(title)
        ax.set_axis_off()
        ax.imshow((im.clip(0, 255)/255).asnumpy())
show_batch('valid_bi.rec')

Creating the dataset to feed to Gluon

show_batch('valid_bi.rec'). I know what you are thinking. Wait, what is a `.rec` file? Aren’t we reading `.png` images directly?

Ok, let’s rewind a little bit. I missed a key part of the story. Converting the dataset into a format Gluon likes. Technically, plain images in `.png` format are fine. Gluon features an `ImageFolderDataset` dataset, which accepts a root folder with image files. This is very convenient. I am running all my experiments within a notebook SageMaker instance, though. Where am I supposed to store the images? In an S3 bucket. So I moved all my posters into S3, spun up a notebook and run

data_iter = mxnet.gluon.data.vision.ImageFolderDataset("s3:/my-bucket-containing-jpg-images")

which threw all kind of authentication related errors, complaining about my `Access Key ID` and Secret Access Key not being set. This was weird as I had made sure to attach all relevant policies (S3-full-access included) to the IAM Role I had used for SageMaker. On top of it, from within the notebook itself, I was indeed capable of reading files from S3 via the `boto3` python library. Probably a Gluon specific issue.

Another option might have been to transfer all the posters from S3 to SageMaker. This would work very well, as reading from a local folder is not an issue. Anyway, given the problems I was facing with plain `.png`s I decided to drop raw images altogether and opt for the format MXNet officially recommends: recordIO. The library heavily relies on this binary format to design very efficient data loaders for deep learning. How do we obtain a .rec file given .pngs? There are a couple of tutorials available as part of the MXNet documentation. In a nutshell, it boils down to running im2rec.py, a script part of the library’s installation. To know where it is located on your computer, just run mxnet.test_utils.get_im2rec_path(), which returns the absolute path to the file. In theory, you need to run it twice. The first time to generate a `.lst` file, the second to produce the final .rec file. A `.lst` is just a fancy way of calling a text file with a list of all the images to be included in the dataset. One per row. Each row needs to list an image ID, a label and a path to the .png image (either absolute or relative depending on the directory from which you decide to run the script). To have more control on the process I preferred to generate the `.lst` files (one for training and another one for validation) manually. For that, I wrote a little function (save_lst) which saves a `pd.DataFrame` in the appropriate format. Let’s take the example of the training set, raw_train.

save_lsttakes a pd.DataFrame, already containing the relevant columns in the exact order, and simply saves it into a text file with a `.lst` extension (taking care of quotations around labels which might appear in the process).

def save_lst(x, name):
    x.to_csv(os.path.join(FOLDER, 'temp.lst'), index=False, header=None, sep='\t')
    
    with open(os.path.join(FOLDER, 'temp.lst'), "rt") as fin:
        with open(os.path.join(FOLDER, name), "wt") as fout:
            for line in fin:
                fout.write(line.replace('"', ''))

The result of running the function on both `raw_train` and raw_valid is the following.

Once the `.lst` files created we can proceed running the im2rec.py script. Given that we have used relative paths, we need to run the script from within the folder containing 1) the .lst files and 2) the subdirectory with the actual images. For clarity, running a ls command should produce the following

$ ls
train.lst    valid.lst    posters

with posters being the directory containing the .jpg images. Now we can execute the script from a terminal window. Like this

$ python full-path-to-im2rec.py-file train.lst posters/
$ python full-path-to-im2rec.py-file valid.lst posters/

The script will run for a while and eventually generate one `.rec` and one `.idx` file for each .lst processed.

Now that we have our images in `recordIO` format, we can upload them to S3 and, hopefully, load the dataset into SageMaker using a standard ImageRecordDataset. Yet, once again, I stumbled upon authentication issues, with Gluon complaining about not having the right credentials to access S3. As stated before, the IAM role attached to the notebook had all the necessary policies, and I was indeed capable of loading files with the boto3 python SDK from S3 into SageMaker. This was weird and rather frustrating as I was not able to find an easy fix online and it represented a clear blocker for the project. No data, no model. The only solution I could come up with was to download the `.rec` files from S3 into the notebook at instance boot. This is achievable with lifecycle configuration scripts, which are lines of code executed when the SageMaker EC2 instance is fired up. This is how my lifecycle configuration script looks like

#!/bin/bash
set -e
mkdir /tmp/data
aws s3 cp s3://movies-posters-raw/train_bi.rec /tmp/data/
aws s3 cp s3://movies-posters-raw/train_bi.idx /tmp/data/
aws s3 cp s3://movies-posters-raw/train_bi.lst /tmp/data/
aws s3 cp s3://movies-posters-raw/valid_bi.rec /tmp/data/
aws s3 cp s3://movies-posters-raw/valid_bi.idx /tmp/data/
aws s3 cp s3://movies-posters-raw/valid_bi.lst /tmp/data/

As you can see I create a `/tmp/data` directory and dump in there all the relevant files I had previously uploaded to S3. Eventually, I was able to open up my SageMaker Jupyter notebook and successfully run

dataset = mx.gluon.data.vision.ImageRecordDataset('/tmp/data/train_bi.rec')

Training a model in Gluon

Let’s recap for a sec. We have the data pre-packaged in the format MXNet/Gluon likes the most (`recordIO`). Next step is to actually build a binary classifier to predict whether a poster is showing a thriller/crime movie or not. I could actually go ahead and simply train a classifier with the built-in SageMaker algorithm. This is what I wanted to experiment with, after all. I still decided to take a small detour and train a network from scratch with Gluon.

I won’t spend too much time on this section as it is not the main purpose of the post. Still, I think it is worth mentioning, for the records, and also to have a benchmark for future comparison with the built-in SageMaker model. The steps I followed are pretty standard (here the notebook once again).

  1. Pre-processing images while creating the dataset. Both train and validation pictures are resized to `182 x 182` pixels (`182` is the width of the poster, i.e. the smallest of the two dimensions) and normalized using ImageNet stats (we will be using transfer-learning with pre-trained CNNs). Train images (not validation) also go through augmentation manipulations (random crop, brightness, contrast etc).
  2. Loading datasets in batches via a dataloader.
  3. Defining the architecture. I went with a pre-trained ResNet34.
  4. Training the model.
    1. Start with freezing the entire CNN except for the last fully connected layer. Train for a couple of epochs at a moderately high learning rate.
    2. Unfreeze the CNN and apply to the convolutional layers a learning rate 100 times lower than the one used for the last fully connected layer. We don’t want to screw up the ImageNet weights too much. We just need to tweak them a bit. Keep on training for another couple of epochs

Results were overall extremely poor. I managed to squeeze only a disappointing 63% accuracy from the validation set. Initially, I thought I was getting something really wrong as, no matter what I did, it was super hard to improve the score. Then I tried throwing the same dataset at a fastai learner. Fastai embeds state-of-the-art deep learning best practices in how it handles the training phase, and, generally, it achieves stunning results in almost no time. Therefore, in order to benchmark my model, I went for a standard fastai loop and got a 68% accuracy. This was on one side encouraging, as it showed I was not doing that bad after all, and on the other side, very frustrating, as it highlighted that either the problem was a very though one or something fundamentally wrong was going on in the dataset, i.e. mislabeled posters or too extreme augmentation. I could argue that an accuracy of ~63% is better than random, so I decided to park this issue for the time being and keep focused on the real scope of this exercise, i.e. practice with SageMaker.

Using the SageMaker default image-classification algorithm

Using SageMaker as a Jupyter notebook hosting platform is, to say the least, underestimating its real potential. The beauty of this AWS service resides in the ability to train, host and deploy a model all in one go. So, I gave it a go.

The Image Classification built-in algorithm was my first choice. I was impressed by the number of hyper-parameters which are left to the user to play around with. From the number of layers in the ResNet (the architecture used under the hood), to the learning rate, to the image augmentation strategies, to using pre-trained models, the possible flavors have nothing to envy to a fully custom CNN implementation as my previous one in Gluon.

As soon as the data is ready in `recordIO` format, the rest is basically a copy-paste from this notebook, with appropriate tuning of the relevant hyper-parameters. The ones I ended up using are the following

    "HyperParameters": {
        "image_shape": '3,182,182',
        "num_layers": '34',
        "num_training_samples": '1758',
        "num_classes": '2',
        "mini_batch_size": '128',
        "epochs": '10',
        "learning_rate": '0.001',
        "use_pretrained_model": '0',
        "augmentation_type": 'crop_color_transform'}

In terms of the other parameters to be set for the SageMaker’s API to work, there are only a few things to take care of. Basically, the S3 buckets containing the data and the EC2 instance type to spin up for training. We can kickstart the job, which (when completed) results in a window such as the following, in the console training jobs section.

For whatever reason, training a model from scratch (`”use_pretrained_model”: ‘0’`) gives similar results to using transfer-learning in Gluon, which is really surprising and probably the further proof of how though the task is. Once again, even with this approach I managed to reach only ~63% accuracy. Here the training and validation metrics I extracted from CloudWatch.

import boto3
client = boto3.client('logs')
lgn = '/aws/sagemaker/TrainingJobs'
lsn= 'is-thriller-movie-2019-02-21-18-06-52/algo-1-1550772609'
log = client.get_log_events(logGroupName=lgn, logStreamName=lsn)
train_acc = []
val_acc = []
for event in log['events']:
    msg = event['message']
    if 'Train-accuracy' in msg:
        train_acc.append(float(msg.split('=')[1]))
    if 'Validation-accuracy' in msg:        
        val_acc.append(float(msg.split('=')[1]))
        
train_acc = list(set(train_acc))
val_acc = list(set(val_acc))
results = pd.DataFrame({'Train Accuracy': train_acc, 'Validation Accuracy': val_acc, 'Epochs': np.arange(10)+1})
fig, ax = plt.subplots(figsize=(11,6))
results.plot(ax=ax, x='Epochs', y=['Train Accuracy', 'Validation Accuracy'])
plt.show()

CloudWatch (CW) was arguably the most useful experimentation partner during this whole exercise. After running a training job in SageMaker, it is possible to check its outcomes as part of the CW Logs at a quite interesting level of detail. Basically the output we would be accustomed to seeing within a notebook. This is how my final model looked like in CW (its validation and training accuracy over epochs are charted above). As you can see, lots of interesting info is in there. As highlighted below, SageMaker takes care of tracking the evolution of the validation set’s performance and saving the best model whenever a new one is trained.

Deploying and testing the endpoint

Once the training is done we can move to the deployment part. In a nutshell, this consists in spinning up an EC2 instance which will be in charge of running the model at inference time. The result of the process is a SageMaker endpoint with an ARN attached to it. Just a fancy name to say that now we have a string we can use to call an AWS service. The only interesting choice you get to make at this stage is the EC2 instance type you want to perform predictions. In my case, as I don’t need to serve batch predictions, i.e. just one image at a time, I opted for the least expensive machine available, `ml.t2.medium`.

Let’s see if this works. I called the newly created endpoint within the SageMaker notebook itself submitting the poster of the 2005 movie Hostage, starring Bruce Willis.

import json
runtime = boto3.Session().client(service_name='runtime.sagemaker')
with open(test_image, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)
response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=payload)
result = response['Body'].read()
# result will be in json format and convert it to ndarray
result = json.loads(result)
# the result will output the probabilities for all classes
# find the class with maximum probability and print the class index
index = np.argmax(result)
object_categories = ['Non Thriller', 'Thriller']
print("Result: label - " + object_categories[index] + ", probability - " + str(result[index]))
# Result: label - Thriller, probability - 0.6329127550125122

Nice! SageMaker correctly processes the poster and returns a prediction. In this case, the model is 63% confident the movie is a thriller/crime.

The network has been correctly deployed to an endpoint. Next step is to make the endpoint accessible to the world, to be able to serve the model to external customers. For that, we’ll build a little web application, so rendez-vous soon to a next post around that!

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading