Skip to content

Neural Magic: Training YoloV5 with Sparse Transfer Learning and deploying to Amazon SageMaker with a custom Docker container

Reading Time: 9 minutes

Note: You can find the Jupyter notebook with all the steps I followed here, and the folder with the relevant accompanying files here.

Introduction

The goal of this post is to experiment with the Neural Magic (NM) suite of open-source libraries by training a face detector running at GPU speed on CPU (yes, you heard me right) and then deploying it to Amazon SageMaker (SM) with a custom Docker container.

NM is on a mission to make Deep Learning more accessible to everybody. The way they are doing this is by heavily optimizing both the training and the inference steps of a modeling pipeline, getting to the point where the network runs on CPU at GPU latency. The constraint NM is relaxing is the assumption that in order to reach real-time performance from a DL model, it needs to execute on GPU. GPUs are fast indeed, but they are also expensive and not trivial to set up and maintain, acting as a gatekeeper to the world of competitive AI. If we could obtain the same level of performance from a standard CPU, we’d open the doors to much more widespread adoption of the technology, dramatically lowering the barriers to entry into the field. If you think this is science-fiction, then let me introduce you to Neural Magic. Not only are they a very friendly and helpful bunch (their Slack channel is a testament to that), but they are also literally crushing it when it comes to delivering on their mission.

Here is what we are going to do:

  1. Prepare a dataset to train YoloV5 from Ultralytics. We will use the NM sparseml library for that. Under the hood, sparseml invokes a fork of the Ultralytics repo, so we need a dataset following the its training conventions.
  2. Apply Sparse Transfer Learning starting off from a pre-sparsified pre-trained model from NM.
  3. Convert the sparsified and quantized model to ONNX.
  4. Leverage the DeepSparse inference engine (NM proprietary engine) to run at 60+ FPS on CPU.
  5. Deploy the model on Amazon SageMaker on top of DeepSparse. We’ll use a custom Docker container for this one.

Let’s get started.

The dataset

As always, the very first thing we need when embarking on an ML project is the data. We are going to use the WIDER Face dataset. Feel free to download any of the training, validation, or test set. Either way, to speed up the process, we’ll be random sampling 1k images only. Always start small! You’ll be surprised by how far you’ll get.

WIDER Face ships with labels in a weird .mat format. Ultralytics requires each image to be labeled with its own text file following the below constraints (screenshot).

To simplify the conversion, I opted to parse labels with IceVision. The library offers a very flexible data processing functionality, by coupling each image and its tags into a record object. You can check how I did that in the first section of the Jupyter notebook coming with this post. Once records created, we can easily access labels from each one of them, reformat those accordingly, and save them as text files. Once again, check out the notebook for the code snippet used to do so. We complete this part by putting together a faces_data.yaml file with the details of the dataset to be passed to NM and Ultralytics for training.

path: /home/ubuntu/data/neuralmagic_faces
train: train/images
val: valid/images
test: valid/images

nc: 1
names: ['face']

Sparse Transfer Learning with Neural Magic

We are ready to leverage the magic of the SparseML library and sparsified networks. What is sparsification? It’s a technique based on removing redundant information from a model. It is not new for anyone working in the DL domain that neural networks are almost always over-parametrized. Sparsifying a model means cutting away unnecessary connections between neurons (pruning), converting weights to a less accurate storage format, e.g. from FP32 to INT8 (quantization), or both at the same time. Neural Magic provides pre-sparsified models and recipes (training strategies, essentially) allowing users to plug in their own datasets and fine-tune their own networks.

The command below points to the pre-sparsified weights and the recipe to “pruned-Quantized YOLOv5s model with HardSwish activations sparsified using the ultralytics/yolov5 SparseML integration on the COCO dataset. [It] achieves 94% recovery of the performance for the dense baseline. The majority of layers are pruned between 50 and 75%, with some more sensitive layers pruned to 40%. The final accuracy is 52.5 mAP@0.5 as reported by the Ultralytics training script“.

The recipe runs for 50 epochs, conveniently logging to Weights & Biases. It took 24 minutes on a g4dn.xlarge GPU-powered EC2 machine (800 training + 200 validation images).

sparseml.yolov5.train \
  --data faces_data.yaml \
  --cfg models_v5.0/yolov5s.yaml \
  --weights zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94?recipe_type=transfer \
  --hyp data/hyps/hyp.finetune.yaml \
  --recipe zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94?recipe_type=transfer_learn \
  --project neural_magic --name yolov5s_pruned_quant-aggressive_94

Once done, we convert the best.pt weights (saved automatically by Neural Magic at the end of training process) to ONNX, to be able to use the optimized DeepSparse engine for inference.

sparseml.yolov5.export_onnx \
 --weights ~/KagglePlaygrounds/neuralmagic_faces/neural_magic/yolov5s_pruned_quant-aggressive_94/weights/best.pt \
 --dynamic

DeepSparse inference engine

The ONNX model can be then tested by running the deepsparse CLI.

deepsparse.object_detection.annotate \
 --source ./testfaces/fraface7.jpeg \
 --model_filepath ~/KagglePlaygrounds/neuralmagic_faces/neural_magic/yolov5s_pruned_quant-aggressive_94/weights/best.onnx

The above inference command saves the👇 image. It seems to work!

We can also check the YoloV5 Inference Pipelines shipping with DeepSparse. This is basically the python code we’ll run on the SageMaker endpoint. Let’s see how it works.

from deepsparse.pipeline import Pipeline

model_stub = "./neural_magic/yolov5s_pruned_quant-aggressive_94/weights/best.onnx"
images_paths = ['./testfaces/fraface7.jpeg']
images = [load_image(i) for i in images_paths]

yolo_pipeline = Pipeline.create(
    task="yolo",
    model_path=model_stub,
)

pipeline_outputs = yolo_pipeline(images=images, iou_thres=0.6, conf_thres=0.5)

boxes = pipeline_outputs[0].boxes
scores = pipeline_outputs[0].scores

img = PIL.Image.open(images_paths[0])
img.thumbnail((512, 512))
img = draw_bbox(img, boxes[0])
img

Bingo! We got a face detected here as well.

How fast is the pruned-quantized model we have just trained? Let’s check it out. deepsparse.benchmark comes to the rescue. For reference, I executed it on a c6i.2xlarge CPU-only EC2 machine (8 CPUs, 16 GB RAM).

deepsparse.benchmark ~/KagglePlaygrounds/neuralmagic_faces/neural_magic/yolov5s_pruned_quant-aggressive_94/weights/best.onnx \
 --scenario async

2022-08-21 13:58:04 deepsparse.benchmark.benchmark_model INFO     Thread pinning to cores enabled
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.0.2 (7dc5fa34) (release) (optimized) (system=avx512, binary=avx512)
2022-08-21 13:58:08 deepsparse.benchmark.benchmark_model INFO     num_streams default value chosen of 2. This requires tuning and may be sub-optimal
2022-08-21 13:58:08 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
	onnx_file_path: /home/ubuntu/KagglePlaygrounds/neuralmagic_faces/neural_magic/yolov5s_pruned_quant-aggressive_94/weights/best.onnx
	batch_size: 1
	num_cores: 4
	num_streams: 0
	scheduler: Scheduler.multi_stream
	cpu_avx_type: avx512
	cpu_vnni: True
2022-08-21 13:58:08 deepsparse.utils.onnx INFO     Generating input 'input', type = uint8, shape = [1, 3, 640, 640]
2022-08-21 13:58:08 deepsparse.benchmark.benchmark_model INFO     Starting 'multistream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/KagglePlaygrounds/neuralmagic_faces/neural_magic/yolov5s_pruned_quant-aggressive_94/weights/best.onnx
Batch Size: 1
Scenario: async
Throughput (items/sec): 62.4997
Latency Mean (ms/batch): 31.9485
Latency Median (ms/batch): 31.8533
Latency Std (ms/batch): 0.8996
Iterations: 626

We are getting 62 FPS on CPU! For comparison, I also trained a YoloV5 small model with IceVision (no optimization whatsoever there) and obtained a mean latency of 537 ms against the 32 ms achieved with NM, e.g. ~17x slower. The first section of the notebook shows the training part with IceVision and how I got there.

I decided not to invest too much time into more exhaustive benchmarking. If you are curious about what Neural Magic can actually achieve, and you are ready to be mind-blown, head without further ado to this great post from Dickson: Supercharging YOLOv5: How I Got 182.4 FPS Inference Without a GPU (yes, 182.4 FPS 🤯).

Deploying to Amazon SageMaker

All right, we have a blazing fast model running on CPU. How about deploying it to Amazon SageMaker?

We’ll go down the road of a custom Docker container, which, believe it or not, I had never played around with myself, so I thought it would be a great opportunity to go and check out. We’ll advance step by step and try to figure everything out along the way. Here is what we are going to do:

  1. Put together a serve script to use in SageMaker. Test it out locally first.
  2. Write a Dockerfile and try invoking the serve script from within a running container locally. If this succeeds, we are ready to go to the cloud.
  3. Upload the Docker image to Amazon ECR.
  4. Package the sparse ONNX YoloV5 into a tar.gz file and upload it to S3.
  5. Deploy the model to Amazon SageMaker and test it out.

The serve script

First things first: under our working directory let’s create a /opt/ml/model folder structure. This is technically not strictly needed to deploy to SageMaker. It is very useful though, as it mimics the SM deployment environment. When AWS spins up the production machine, it downloads the model’s artifacts from S3 and untars them into /opt/ml/model inside the running container. Having the same folder structure locally is useful to reproduce what happens when deploying and catch any potential nasty bug much harder to spot remotely.

The next thing we’ll do is to soft link the local absolute /opt/ml path to the standalone /opt/ml. This is important as we’ll be able to use /opt/ml as is inside the serve script and it will work both locally and when containerised.

sudo ln -s /home/ubuntu/KagglePlaygrounds/neuralmagic_faces/sagemaker_deploy/opt/ml /opt/ml

The serve script is relatively simple. Here is the original, whereas below is its skeleton with the most important parts. Keep in mind that we are not training the model in SageMaker. We already have a trained network, (best.onnx). We just need to deploy it. What does the serve script need to contain?

There are two critical bits:

  1. A webserver: done in Flask (any other option, such as FastAPI, would work too). We have to define at least 2 endpoints: the /ping, needed for healthchecks and the /invocations one, wrapping the predict function in charge of the inference logic.
  2. A function loading the model from /opt/ml/model/.
from whatever import whatever

from flask import Flask
from flask import Response

app = Flask(__name__)    

def load_model(): return model_located_at="/opt/ml/model/best.onnx"

@app.route("/invocations", methods=["POST"])
def predict():
    model = load_model()
    outputs = model.predict(input)
    return Response(response=json.dumps(outputs), status=200)

@app.route("/ping")
def ping(): return Response(response="OK", status=200)

app.run(host="0.0.0.0", port=8080)

Running the serve script locally

Let’s run it as is

cd sagemaker_deploy
chmod +x serve
./serve

This is what we get

Good. Our Flask application is running at http://127.0.0.1:8080. Let’s try pinging it.

In sequence this is the ping-pong we get with the two terminals

Terminal 2 >> curl http://127.0.0.1:8080/ping
Terminal 1 >> "GET /ping HTTP/1.1" 200 -
Terminal 2 >> OK

Terminal 1 >> (echo -n '{"image": "'; base64 ./testfaces/fraface7.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://127.0.0.1:8080/invocations
Terminal 2 >> DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.0.2 (7dc5fa34) (release) (optimized) (system=avx512, binary=avx512)
INFO:root:Processing image of size (4608, 3456, 3)
Terminal 1 >> {"boxes": [[1127, 955, 2167, 2633]], "scores": [0.5956084728240967]}

Great! It seems everything is working. Let’s move to Docker.

Running the serve script in Docker locally

The first step is the Dockerfile. This is super simple, given we just need three libraries for serving: opencv (image preprocessing), deepsparse (model inference) and flask (web server). We copy over the local serve script in the location SageMaker will go and look for it (/usr/local/bin) and we expose port 8080 from within the container to be able to ping the application.

FROM python:3.9

RUN apt-get update
RUN apt-get install ffmpeg libsm6 libxext6  -y

RUN pip install opencv-python-headless==4.6.0.66
RUN pip install deepsparse[yolo]==1.0.2   
RUN pip install flask

WORKDIR /usr/local/bin
COPY serve /usr/local/bin/serve

EXPOSE 8080

Then we open up a terminal to build the image and run a container on top of it

cd path_of_dockerfile
docker build -t neuralmagic .
docker run --name nmserve --rm -v /opt/ml:/opt/ml neuralmagic serve

You should get the exact same output as before. The screenshot shows just the container running and the result of a healthcheck ping. The commands to ping the endpoints are also the same as above. Just make sure to replace localhost (127.0.0.1) with the container’s IP address (172.17.0.2 in my case).

Fantastic! We are getting closer. The serve script works from within a Docker container too. We are ready for the next step: SageMaker.

Deploy a SageMaker endpoint

SageMaker needs a couple of things:

  1. The Docker image needs to be stored in Amazon ECR
  2. The model artifact, e.g. the best.onnx file needs to be compressed into a tar.gz file and stored in S3

Number 1 can be achieved with the below script, which builds the Docker image, logs into ECR and uploads the built image.

TAG="neuralmagic"
IMAGE_URI="custom-images"
ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
REGION="eu-west-1"

docker build -t $TAG .
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
docker tag $TAG:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$IMAGE_URI:$TAG
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$IMAGE_URI:$TAG

Number 2 is super easy too. We first compress the model…

cd KagglePlaygrounds/neuralmagic_faces/sagemaker_deploy/opt/ml/model/
tar -cvpzf model.tar.gz best.onnx

… and then upload to S3 using python. That’s it.

import boto3, sagemaker

region = "eu-west-1"
session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
default_bucket = session.default_bucket()
model_uri = session.upload_data(path="/home/ubuntu/KagglePlaygrounds/neuralmagic_faces/sagemaker_deploy/opt/ml/model/model.tar.gz", key_prefix="neural_magic")

What is left is just deploying the model with the python SageMaker SDK. Let’s do that.

import boto3, sagemaker

region = "eu-west-1"
sm_client = boto3.client(service_name="sagemaker", region_name=region)
runtime_sm_client = boto3.client("sagemaker-runtime", region_name=region)
session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
role = "IAM Role with SageMaker permissions"
ecr_image = "257446244580.dkr.ecr.eu-west-1.amazonaws.com/custom-images:neuralmagic"

model = sagemaker.model.Model(image_uri=ecr_image,
                              name="neural-magic",
                              model_data="s3://sagemaker-eu-west-1-257446244580/neural_magic/model.tar.gz",
                              role=role,
                              sagemaker_session=session,
                              predictor_cls=sagemaker.Predictor
                             )

predictor = model.deploy(initial_instance_count=1, 
                         instance_type='ml.m5.large', 
                         endpoint_name="neural-magic")

Once deployed, we can invoke it with a base64 encoded image as we were doing from the terminal. How cool is that!

We can also check in CloudWatch if the logs are what we expect (they should).

We are done. Just remember to delete the SageMaker endpoint to avoid incurring in some nasty unexpected costs.

We trained a blazing fast face detector in Neural Magic, achieving GPU speed on CPU via sparsification. We then packed it up and deployed it to Amazon SageMaker with a customer Docker container. Happy hacking!

1 thought on “Neural Magic: Training YoloV5 with Sparse Transfer Learning and deploying to Amazon SageMaker with a custom Docker container”

  1. Pingback: Benchmarking TorchVision ResNet18 on Amazon SageMaker CPU, GPU, and Inferentia instances (with a Neo twist) -

Comments are closed.

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading