Note: you can follow along with the post in this Jupyter notebook
Context
The ML announcement I liked the most at AWS re:Invent 2022 was SageMaker support for shadow deployment. What is it? The idea is the following. You have model A deployed on an endpoint. You have developed model B (say, a supposedly more performant version of A) and you want to check if it works as expected.
- You could go for canary deployments, redirecting a % of live traffic to model B. This might be dangerous though. In case your model fails, no matter how small the % is, some users will be negatively affected.
- You could also manually run inference on model B. I mean select a portion of historical traffic and ping the (deployed-but-not-live) endpoint with it. Super manual, but also super safe.
- Or, you can deploy in shadow mode. This translates into deploying model B to a new endpoint and automatically redirecting to it all the traffic which goes through A. Only A is wired up with the production system, so only predictions coming out of A are being used to take business decisions downstream. B is shadowing A, so its predictions are simply logged in CloudWatch (and S3 if you want to). The diagram below illustrates the logic. You can keep the test going for as long as you want. Then, when you have enough data to establish if B works indeed better than A, you can promote B to production, officially replacing A.
Why did I find this exciting? Because we basically rebuilt this functionality at Bolt (the company I work for). It’s an internal feature to maintain, so I am glad the SageMaker team has made this available to the public. We might kill our implementation altogether and rely on SageMaker only (which we currently use to productionize models anyway).
Trigger a shadow deployment with the SageMaker SDK
As explained in the excellent announcement post by Antje Barth, you can use the AWS console to create shadow tests in Amazon SageMaker. I wanted to explore the automatable route offered by the SDK though, so I opted for revisiting this official tutorial with an HuggingFace (HF) twist. This also gave me the chance to experiment with deploying HF models using the raw SageMaker SDK and not the Estimator
, Predictor
and Model
HF wrappers I was already familiar with. Long story short, everything worked super smoothly, so let’s dive in.
First things first, we need to pick an HF Image URI on Amazon ECR. For this you need to pick the exact combination of version
, image_scope
, py_version
, instance_type
and base_framework
. They allow the image_uris.retrieve
method to pull up the URI pointing to the correct ECR image.
from sagemaker import image_uris
image_uri = image_uris.retrieve(framework="huggingface",
region="eu-west-1,
version="4.17",
image_scope="inference",
py_version="py38",
instance_type="ml.g4dn.xlarge",
base_framework_version="pytorch1.10"
)
If, like me, you are like, “wait, how am I supposed to know the correct combo of all these values?”, then here is how I figured those out. Not sure there is a better way, but it kinda worked. The image_uris.config_for_framework
method returns a JSON with all the combinations of images available. You can inspect the result to extract what you are interested in.
Next step is creating the models we’ll be testing against each other, the production and the shadow variant. Keep in mind that up until here there are still no differences between the two. They are both perfectly legit models living independently with respect to each other and not tied to an endpoint yet. HuggingFace made the use of create_model
a super neat experience, by providing environment variables to the container via the Environment
dictionary within the Containers
list.
model_name1 = "PROD-nvidia-segformer"
model_name2 = "SHADOW-microsoft-beit"
print(f"Prod model name: {model_name1}")
print(f"Shadow model name: {model_name2}")
resp = sm.create_model(
ModelName=model_name1,
ExecutionRoleArn=role,
Containers=[{"Image": image_uri,
"Environment": {
'HF_MODEL_ID':'nvidia/segformer-b0-finetuned-ade-512-512',
'HF_TASK':'image-segmentation'
},}],
)
resp = sm.create_model(
ModelName=model_name2,
ExecutionRoleArn=role,
Containers=[{"Image": image_uri,
"Environment": {
'HF_MODEL_ID':'microsoft/beit-base-finetuned-ade-640-640',
'HF_TASK':'image-segmentation'
},}],
)
By setting HF_MODEL_ID
and HF_TASK
, the container will download the right model from the HF Hub automatically. No need to point it to a model.tar.gz
artifact in S3 (you can do that too if you need something custom). How do you get those values? You can extract them from the HF model card in the Hub. In my case I wanted to deploy:
- SegFormer (b0-sized) model fine-tuned on ADE20k by NVIDIA (thanks Philipp Schmid for the super informative post – credit to him for the mask plotting section below) as a production variant
- BEiT (base-sized model, fine-tuned on ADE20k) by Microsoft as a shadow variant
Both are semantic segmentation models.
Then we define the endpoint config. This is where the action happens. The create_endpoint_config
method was updated (make sure to install the latest sagemaker
python library) to accept the ShadowProductionVariants
argument, allowing users to specify anything that matters at the endpoint level (the model, the instance type and count, the amount of traffic going through the endpoint, etc).
model_name1 = "PROD-nvidia-segformer"
model_name2 = "SHADOW-microsoft-beit"
production_variant_name = model_name1
shadow_variant_name = model_name2
ep_config_name = "shadow-hf-epconfig"
create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=ep_config_name,
ProductionVariants=[
{
"VariantName": production_variant_name,
"ModelName": model_name1,
"InstanceType": "ml.g4dn.xlarge", # we deploy to a GPU instance
"InitialInstanceCount": 1,
"InitialVariantWeight": 1,
}
],
ShadowProductionVariants=[
{
"VariantName": shadow_variant_name,
"ModelName": model_name2,
"InstanceType": "ml.g4dn.xlarge", # we deploy to a GPU instance
"InitialInstanceCount": 1,
"InitialVariantWeight": 1,
}
],
)
We are just left with deploying the endpoint.
endpoint_name = "shadow-hf"
create_endpoint_api_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=ep_config_name,
)
Once up and running, let’s see if inference works. When invoking the endpoint, we are going to get the prediction from the production variant. The payload will also get redirected to the shadow model though. Each image runs through both of them.
# download sample image to run inference onto
!wget https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/raw/main/ADE_val_00000001.jpg
image_path = "ADE_val_00000001.jpg"
with open(image_path, "rb") as f:
payload = f.read()
payload = bytearray(payload)
response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
ContentType="image/x-image",
Body=payload)
with open(image_path, "rb") as data_file:
image_data = data_file.read()
res = json.loads(response["Body"].read().decode("utf-8"))
get_overlay(image_path,res)
Next step is parsing CloudWatch logs and extracting variant-related metrics to compare production and shadow. Let’s invoke the endpoint in a loop for a while to create some data…
import random
import time
def invoke_endpoint(endpoint_name, should_raise_exp=False):
with open(image_path, "rb") as f:
payload = f.read()
payload = bytearray(payload)
try:
for i in range(200):
time.sleep(random.uniform(0, 1) * 3)
response = sm_runtime.invoke_endpoint(
EndpointName=endpoint_name, ContentType="image/x-image", Body=payload
)
except Exception as e:
print("E", end="", flush=True)
if should_raise_exp:
raise e
invoke_endpoint(endpoint_name)
… and plot aggregated CloudWatch metrics.
Nice! That’s exactly what we wanted. For the records, you would get the same charts within the SageMaker console, triggering manually a shadow test, as explained in Antje’s post. Shadow deployment allows to easily compare the 2 models. In this case, for instance, we get to notice that, despite RAM utilization being almost equal, latency is an order of magnitude higher for the shadow variant with respect to the production one. That’s something you would want to know before promoting the shadow model to live.
Feedback to the AWS team
The clear missing part of this new SageMaker functionality is the impossibility of comparing business metrics between the models’ variants. As shown above, shadow tests currently allow checking the standard metrics automatically logged by CloudWatch. Latency, network overhead, CPU and RAM consumption, but that’s about it. Don’t get me wrong, this is already very useful, as those numbers are indeed part of the decision whether to promote a variant to production or not. They are not the only ones taken into account though. A key component is business KPIs too. I am including under the “business” umbrella, both ML-related metrics, such as accuracy, precision, recall, confidences distributions, etc and any other financial metrics, e.g. monetary impact of models’ inferences (customers’ churn, cost of a false positive, forecasting errors, etc).
Also, it’d be great to visualize different predictions between the variants. E.g. an image the production model predicts as a cat and the shadow model as a dog. Error analysis is incredibly valuable and helps uncover non-obvious patterns that average metrics simply hide.
I hear the SageMaker team protesting: “But you can save your predictions to S3! Once they are logged there, you can compute any performance indicator you wish and compare shadow and production variants”.
True. It is not automated though. It’s a “feature” customers have to build themselves. I understand it is not easy to automate the calculation of such custom metrics, but I am sure the AWS team can do better and figure something out. For the moment, shadow deployments are a great addition to the SageMaker suite of products, but they are only half their way there.