Benchmarking TorchVision ResNet18 on Amazon SageMaker CPU, GPU, and Inferentia instances (with a Neo twist) -

Reading Time: 8 minutes

Code and disclaimer: you can find the notebook with the relevant code here. Most of it was copied from the official Amazon SageMaker repo, without much change.

Table of Contents

TLDR 🤷‍♂️

Head to the `Benchmark results` section for more details

The what and the why💡

The goal of this short post is to benchmark, both cost and latency-wise, a couple of ML inference options we have on Amazon SageMaker. We’ll do that by deploying a pretrained PyTorch ResNet18 model on:

CPU
GPU
AWS Inferentia

Which one is the fastest and most cost-effective? As a heavy AWS user in the ML space, that’s the kind of question I want a clear answer to. Let’s go.

Deploy a pretrained PyTorch model to SageMaker 🛠️

The prerequisite for any benchmarking is to have a model productionised somewhere first. So, let’s see how to deploy a PyTorch NN to SageMaker under the above scenarios. I opted for simplicity and went for an ImageNet-pretrained ResNet18. The process is the following:

Download the model from the Torch Hub
Convert it to TorchScript and save it locally. TorchScripting a model is incredibly useful and you should consider this route whenever possible. In my experiments, scripting does not necessarily translate into a latency boost, wrt the non-scripted version. The big advantage is that if you script a model and save it, when you load it back into a PyTorch object, you don’t need the network definition around. E.g. in the non-scripted case, you have to create an “empty” model object with all its layers’ definitions and then load the saved weights into it. This means you have to keep the network architecture around in your production code. Not being obliged to do so is potentially a lot of saved lines of code. In our specific test, scripting is also a prerequisite for compiling the model with Neo, so I am killing two birds with one stone.
Zip the model together with its inference script into a model.tar.gz file (model artifacts). The inference script contains the logic to load and pre-process the input to the endpoint (an image in our case), load the model from within the Docker container the code is running on, run inference, and post-process the output. This script allows the user to override a set of pre-defined functions AWS exposes to customize the prediction pipeline. There are several resources about this, but I found this notebook one of the clearest so I’d suggest going for it if you feel you need more details.
Upload the model artifacts to S3. The model.tar.gz file is pulled by SageMaker, decompressed, and saved in the docker image.
Create a PyTorchModel object from the SageMaker SDK. This is where we set the S3 location of the model’s artifacts, the IAM role in case we need SageMaker to perform actions on additional AWS services, the name of the inference script, together with a ton of additional potentially useful arguments.
(Optional) Compile the model with Neo. This step is needed only if we want to optimize the trained model wrt a target hardware. We’ll try this approach and check if it results in any latency improvements as advertised by AWS.
Deploy the model to a SageMaker real-time instance and check latency.

The code to achieve the above (except imports, boto3 objects, and inference script) fits onto a single screen. Here it is.

# DOWNLOAD MODEL, SCRIPT IT AND SAVE
resnet18 = models.resnet18(pretrained=True)
input_shape = [1, 3, 224, 224]
trace = torch.jit.trace(resnet18.float().eval(), torch.zeros(input_shape).float())
trace.save("model.pth")

# ZIP MODEL ARTIFACTS AND UPLOAD TO S3
with tarfile.open("model.tar.gz", "w:gz") as f:
    f.add("model.pth")
    f.add("serve_base.py")
model_uri = sess.upload_data(path="./model.tar.gz", key_prefix="neo_pytorch")

# DEFINE PYTORCHMODEL
pytorch_model = PyTorchModel(
    model_data=model_uri,
    role=role,
    entry_point="serve_neo.py",
    framework_version="1.5.0",
    py_version="py3",
)

# COMPILE IT WITH NEO (OPTIONAL)
pytorch_model = pytorch_model.compile(
    target_instance_family="ml_c5",
    input_shape={"input0": [1, 3, 224, 224]},
    output_path=model_uri,
    framework="pytorch",
    framework_version="1.5.1",
    role=role,
    job_name=f"neo-pytorch-{int(time.time())}",
)

# DEPLOY TO SAGEMAKER
predictor = pytorch_model.deploy(instance_type="ml.c5.xlarge",
                                 endpoint_name="pytorch-endpoint",
                                 initial_instance_count=1)

# INVOKE THE ENDPOINT WITH AN IMAGE OF A CAT
with open("cat.jpg", "rb") as f:
    payload = f.read()

response = sm_runtime.invoke_endpoint(
    EndpointName=predictor.endpoint_name, ContentType="application/x-image", Body=payload
)

As for the inference scripts, you might have noticed from the notebook that I am using two different versions. One for the Neo-compiled models (server_neo.py), and another one for pure PyTorch networks (serve_base.py). The former implements just the input_fn function, whereas the latter overrides the whole suite of inference functions. The only reason behind this difference is the fact that I lazily copied the two examples (Neo VS standard-non-optimized deployment) from two different GitHub sources, and I didn’t invest time in trying to merge them. Therefore, I am not sure whether the Neo-compiled models work also with the serve_base approach, or if there is some AWS dark magic at play.

Choose inference hardware and inference strategies 🤔

All right, we know how to deploy models to a SageMaker endpoint. Which machine shall we choose though?

I wanted to have some sort of apples2apples comparison here, to be able to say something like “I changed X, all the rest being equal”. This is not entirely possible of course, so I did my best at picking machines with similar amount of cores/RAM within the same price range. Both ml.c5.xlarge and ml.inf1.xlarge have 4 vCPUs and 8 GB RAM. The first belongs to the Compute Optimized group of instances, hence quite performant already, the second is the least powerful of the AWS Inferentia machines. As for GPUs, I went for the least expensive, the ml.g4dn.xlarge featuring a 16 GB NVIDIA T4.

If you are wondering what AWS Inferentia is, I’d suggest heading to the service page. In a nutshell 👇

“AWS Inferentia is our first purpose-built accelerator designed to accelerate deep learning workloads and is part of a long-term strategy to deliver on this vision” … “Each AWS Inferentia chip has four first-generation NeuronCores and supports up to 128 tera operations per second (TOPS) of performance with up to 16 Inferentia chips per EC2 Inf1 instance” … “Developers can train models using popular frameworks such as TensorFlow, PyTorch, and MXNet, and easily deploy them to AWS Inferentia-based Inf1 instances using the AWS Neuron SDK”

This chip promises wonders, and it is the main reason I wanted to run some benchmark tests. We are not obliged to deploy on AWS Inferetian though. We can simply stick to SageMaker Neo and optimize the model for a different target hardware. What is Neo? Same as above, I won’t venture into explaining something I barely scratched the surface of. The service page is the best way to get started. In a nutshell 👇

“Amazon SageMaker Neo enables developers to optimize machine learning (ML) models for inference on SageMaker in the cloud and supported devices at the edge” … “Amazon SageMaker Neo automatically optimizes machine learning models to perform up to 25x faster with no loss in accuracy. SageMaker Neo uses the tool chain best suited for your model and target hardware platform while providing a simple standard API for model compilation”

Summarizing, here our four deployment scenarios

CPU: ml.c5.xlarge
GPU: ml.g4dn.xlarge
CPU (same as #1) with hardware optimization (Neo compilation): ml.c5.xlarge
AWS Inferentia with hardware optimization (Neo compilation): ml.inf1.xlarge

Also, if you want to read more about how to accelerate your inference Deep Learning workflows, I highly recommend this wonderful read.

Benchmark results 🏃‍♂️💨

I then deployed the same model to 4 different endpoints, following the strategies illustrated above.

I timed the endpoint-invoking operation from within the notebook like so …

%%timeit
_ = sm_runtime.invoke_endpoint(EndpointName=predictor.endpoint_name, 
                               ContentType="application/x-image", 
                               Body=payload)

… and checked the latency reported in CloudWatch too. The results are summarized below in the table and the barchart.

It was important to me to measure both the speed and the cost effectiveness of the inference pipeline. To get a sense of the latter I estimated the number of predictions occuring in 1 hour, converting the CloudWatch latency from milliseconds to seconds and then multiplying by 3600 (seconds in 1 hour). This obviously assumes predictions occur in sequence, which is not exactly what might happen in a real production setting but was a good proxy for me. I then divided the price of the AWS instance per hour by the number of predictions in 1 hour and obtained the estimated cost per prediction.

Long story short

AWS Inferentia wins hands down. 6x faster and 4x cheaper compared to the plain CPU solution 🤯.
Neo optimization alone on bare CPU already helps bring down latency and costs by 1.3x wrt a standard ml.c5.xlarge.
The GPU is 4.3x faster compared to CPU but it is also more expensive compared to the CPU-Neo-compiled version (1.2x for GPU vs 1.3x for CPU+Neo).

It seems AWS Inferentia is worth the investment if you can afford getting Neo to work on your model. I tried ResNet18, which is a super simple one, but can’t guarantee the process will be as smooth on more complex networks.

Ranting time 😭

I hesitated until the very end to add a “what-the-heck-let-me-rant” section to this post. Truth is, if I didn’t, I’d omit a big chunk of the work, so I eventually went for it. Let me explain how everything panned out over time.

It happened more or less in the following way:

I had an ONNX model of a YoloV5 face detector. An output from this previous post of mine. My first idea was to deploy it to Inferentia. The attempt failed quickly though. SageMaker Neo only supports Image Classification and SVM (really?) ONNX models. Why? I hope this will change in the future, as ONNX is literally the most widely known platform and framework-agnostic solution to deploy models out there. Why support only image classification, and most importantly why SVM?
I then told myself, “all right, let’s go with MXNet then!”. At the time of writing, MXNet had the most supported models for Neo compilation. On top of it, MXNet is the official AWS Deep Learning framework of choice. Amazon is backing it and has an AI unit contributing to it actively. Considering I wanted to deploy to SageMaker, I thought the combo was a great choice. How naive 😅! I literally spent more than 3 weeks (working a couple of hours, every other day) trying to figure this MXNet+Neo+Inferentia madness out (thanks to Mike Chambers for the support here!). The screenshots below show my GitHub commit history (proving the time spent) plus the completely incomprehensible error message I kept getting from SageMaker (here is my failed notebook for transparency). This entire post was supposed to take no more than a weekend. For real. How hard can it be to deploy a pretrained model and run some benchmarks? Turns out I was off. By a lot. Or better I was wrong about MXNet but I was right about the initial time estimate. It was just a matter of using the “right” DL framework. The moment I switched to PyTorch (thanks to Matthew McClean for the suggestion!) it literally took me two days end2end to get the code to work, run the tests, and put together the write-up. How on earth did MXNet fail so badly? I found this deeply concerning and very unexpected.

Anyway, it is what it is. The important bit is that I managed to ship what I had in mind.

Happy hacking!