LogoNet: the journey to an AWS-powered cloud application running ControlNet on SageMaker async endpoints -

Reading Time: 9 minutes

Here the ControlNet fork
Here our repo with the code we used in our stack
🚨We pulled the plug on LogoNet, so the service is currently not available

Table of Contents

Context

In this post, Lucas and I describe the end2end cloud solution we built to deploy LogoNet. That’s the name we gave to a web application accepting an image as user input, and returning 30 variations of it, based on curated prompts to a ControlNet model. Below you can find the high-level workflow, starting with the user uploading a logo and submitting an email address, circling back to the user receiving an email with the ControlNet results.

High-level LogoNet flow using the PyTorch logo

The diffusion magic

We won’t spend much time on the ML component of the product. Despite it being at the core of the project, we didn’t do anything too exciting there. We just forked this repo which comes with a very clean implementation of ControlNet and used the code with minimal adjustments. Specifically, among all the different provided options, we went for Canny edge detection, which consists in conditioning the diffusion model on the contours of the object(s) depicted in the image. This is why it works well on stylized and simple images such as brand logos. Those are generally objects on a monochrome background with clearly defined edges. That helps to almost completely preserve the object and nicely integrate it with however we prompt Stable Diffusion. In the previous slide, we see how this strategy is applied to the PyTorch logo. Its clear shape is unaltered and then rendered as part of the prompt. So, when we “ask” the model to come up with “Japanese Spaceship in the style of a woodblock print” (literally one of the prompts we used), the PyTorch logo nicely melts into the landscape, creating unexpected and (hopefully) visually-pleasing artifacts.

So, you guessed right. Prompts are the magic ingredient to get good results. We iterated quite a bit on those and came up with a list of 30. At inference time we would run the input image through the model, submitting one prompt at a time, in a loop (without setting a seed).

The AWS architecture

Let’s dive directly into the cloud solution. Here’s how it works.

Users navigate to visualneurons.com. This is a static website hosted on an S3 bucket, with Amazon Route 53 resolving the public domain and pointing to it. They type in their email address (needed for subsequent communication) and upload a logo from local storage.
Then they hit Upload.
- The button triggers a basic JS email validation, plus the actual upload of the image file to our private S3 bucket. The S3 putObject operation permits attaching the email address to the file as metadata. This will allow us to retrieve it later on in the pipeline.
- Now, here is an interesting challenge. How do we let users upload an object to an S3 bucket they don’t have access to? Meet Amazon Cognito. This service, together with the Javascript SDK makes the entire process a 🍰. The idea is to create Identity Pools. Those are Cognito entities whose purpose is to provide temporary access to specific AWS resources to all kinds of users. This post does a fantastic job of introducing the service setup, so we won’t reinvent the wheel.
The event of the image file landing on S3 triggers a Lambda function responsible for the following:
- Read the image file it was triggered by, and extract the user’s email address from the metadata.
- Create and save to S3 the payload for the SageMaker async endpoint. This is a JSON file containing whatever the endpoint was built to accept and process. In our case, it features three fields: the user’s email address, the S3 URI of the input logo, and the S3 URI of the location where the endpoint will save ControlNet’s outputs.
- Invoke the SageMaker async endpoint with the S3 location of the JSON payload file. Given the endpoint is async (more details on this later), we don’t need to wait for it to process our request. The call immediately returns and we are done.
- Send an email to the user confirming the logo has been correctly ingested and that they’ll hear from us soon. This is achieved via Amazon Simple Email Service (SES). If you are setting it up for the first time, refer to this old post of mine for how to get out of the sandbox and activate it correctly.
Once the above is done, Lambda exits, effectively passing the ball to SageMaker. We are going to dive deep into how this works in the next section. In a nutshell, the SageMaker async endpoint is in charge of:
- Running inference on the input image. This boils down to passing the image to ControlNet, each time with a different prompt, and saving the result to S3.
- Compressing all the output images in a single zip file.
- Sending an email to the user with a link to the S3 location of the zip file (once again via SES).

Deploy ControlNet to a SageMaker async endpoint

Why async?

First things first. What are SageMaker async endpoints, and why did we go for asynchronous inference at all? The first sentence of the official docs says it all:

Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.
https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html

At least for us, the key elements were:

⏳”long processing times (up to one hour)“: we used a beefy ml.g5.2xlarge (NVIDIA A10 with 24 GB of GPU RAM). The model does not allow to process multiple prompts at a time. Basically, the input image had to be fed to ControlNet in a for loop, 30 times (we had a list of 30 curated prompts). Even on an A10, that takes 15 minutes. Any other SageMaker inference solution has a hard timeout of 1 minute. Async was basically our only option.
💸 “save on costs by autoscaling the instance count to zero when there are no requests to process“: this is a killer feature. We were expecting very low traffic to the endpoint. A potential spike following the public announcement, fading away quickly though. Given ml.g5.2xlarge is priced at ~1.2$/h, having the endpoint run 24/7 would cost ~30$/day. Far from cheap, especially for a fun side project. It was imperative to find a solution. We immediately turned to SageMaker serverless inference, but it currently does not support GPUs and we were not sure timeouts would be as high as we needed. The ability to autoscale instance count to zero made async endpoints THE ideal solution. We’d have to deal with cold starts when waking up from zero, but it was a fair tradeoff with respect to high costs.

How does SageMaker async inference work?

The endpoint creation is no different than a standard real-time one. It’s just a matter of setting the AsyncInferenceConfig when defining the endpoint configuration (more on this later). As for the inference pipeline, the key components to keep in mind are the following:

Incoming requests are handled instantaneously, meaning that, upon an invocation, SageMaker immediately returns a unique identifier to a file in S3. Say 1234.out. That’s the file SageMaker will write to when it’s done processing the request internally. Its contents are whatever the predict function, tied to the /invocations endpoint, returns. So, how do we know when inference is done? By regularly polling S3 and checking for the existence of s3://output_bucket_defined_in_the_config/1234.out. The moment you find it, that’s the signal the endpoint returned successfully.
The input payload to the endpoint must be an S3 URI pointing to a file. So, no, you don’t pass a JSON as usual. You pass the location of a file. The endpoint reads that file and does whatever you instructed it to do. Async endpoints are tightly coupled with S3, given inputs/outputs are read/written from/to S3.
Incoming requests are placed in an internal queue. After you invoke the endpoint and receive the file identifier as a response, the request is queued up and will be handled later when its time comes.
You can optionally use Amazon Simple Notification Service (SNS) to send success/error notifications. We set it up to send email notifications in case of failures. Very helpful, as it would tell us when to go look in CloudWatch.

Deployment

Given the very custom nature of ControlNet’s execution environment, we went for a deployment off of a custom Docker container. I (Francesco) have described in detail how it’s done in this previous post of mine, so we won’t repeat ourselves. To recap:

Put together a serve script to use in SageMaker. Test it out locally first, outside and inside (point 2) a Docker container. You don’t want to go to the cloud without doing this first. You won’t imagine the number of headaches you’ll save.
Write a Dockerfile and try invoking the serve script from within a running container locally. If this succeeds, we are ready to go to the cloud.
Build and upload the Docker image to Amazon ECR.
Deploy the model to Amazon SageMaker and test it out. Note that we don’t have any tar.gz model artifacts in S3. The model is downloaded directly inside the image when building it.

You can follow the entire deployment process in this notebook. The key part is the👇. That’s when we create the endpoint config, that needs to feature a AsyncInferenceConfig section. Its contents are quite self-explanatory. To keep going with the example in the previous section, the S3OutputPath is where SageMaker will save 1234.out. That’s it.

Setting up autoscaling

Useful links: here, here, and here.

Unexpectedly, this was quite counterintuitive and confusing. We defined a “standard” scaling policy that would scale-out ⬆️ the instance count from 1 (number of instances at deployment) to 2 max, and then scale-in ⬇️ from 2 to 0, based on ApproximateBacklogSizePerInstance, aka the number of items in the queue divided by the number of instances behind an endpoint.

First scaling policy. Did not quite work as expected though…

We thought we were all set, and indeed the policy kicked in as expected, increasing and decreasing instances. Except for a not-so-minor detail. Once scaled-in to 0, the endpoint would not scale-out again when new requests came in. We tried all sorts of metrics and thresholds. Just did not respond.

Turns out that’s because we needed a second policy on top. A StepScaling policy to be precise. Something specifically designed to wake up the endpoint from zero. Weird, to say the least. Once that was configured, everything started functioning as expected.

What to watch out for

Resources permissions, essentially. As always those are critical and we banged our heads on a wall for way too long due to them. A couple of once-fixed-rather-obvious come to mind 🤦:

The S3 bucket where Cognito users upload images needs CORS to be set in order for visualneurons.com to send files.
Both the IAM roles used for Lambda and for SageMaker need SES access to be able to send emails.
The IAM Role used to spin up the SageMaker async endpoint needs SNS Publish access to be able to send notifications.
The Cognito identity pool used to provide temporary S3 access to users has 2 IAM roles attached to it. The Unauth Role needs to know which S3 buckets to have write access to.
Can’t stress this enough. Make sure to test your serve script locally before wrapping it up inside a SageMaker endpoint. I wish AWS made this possible. Right now it requires a lot of Docker gymnastics.
Navigating SageMaker CloudWatch logs is a nightmare. Especially when you have multiple requests being handled at the same time. It is extremely hard to follow sequentially the execution steps, making debugging tricky if not impossible.

What went wrong

As we documented here and here, among other things, the main issue we stumbled upon was the GPU running out of memory. Yes, even that beefy ml.g5.2xlarge (NVIDIA A10 – 24 GB RAM). The reason was a very rookie mistake we made while defining the AsyncInferenceConfig section of the endpoint config. We had set MaxConcurrentInvocationsPerInstance to 2. The name of the parameter is self-explanatory. We naively thought “Well, this is gonna speed things up for sure! SageMaker can obviously handle more than 1 request at a time on a single instance”. The answer is that, yes, SageMaker can of course do that. The GPU can’t though. We had completely missed the fact that we had one GPU and that SageMaker was not going to smartly access it based on its memory consumption. If 2 requests are coming in at the same time, 2 different processes will try accessing the GPU at the same time. 24 GB of RAM is not sufficient to handle that, even resizing incoming images to 512 resolution. Diffusion models are just very big. When CUDA OOM is thrown, the machine gets in some sort of unrecoverable state. It’s simply dead, with no other option than rebooting. Redeploying with MaxConcurrentInvocationsPerInstance set to 1 solved it. This is obvious in hindsight but it still left us thinking. How is it even possible to execute GPU inference at scale? Simply through more money at it (e.g. more and bigger GPUs)? Batching if latency constraints allow doing so? That’s what NVIDIA Triton offers for instance, but it comes with a very very steep learning curve. Implementing the GPU-access logic from scratch, across multiple GPUs and machine is a nightmare even only to think of 😱.

Overall this experience reminded us of the obvious: deploy on CPU when possible (not an option for us as latency would have increased from 15 mins to 1h). Cheaper and easier to maintain and debug. Seems like the barrier-to-entry to GPU-deployment is still quite high.