Skip to content

IceVision + SAHI: democratise small object detection

Reading Time: 4 minutes

In this short post, I will introduce the latest addition to the IceVision library: Slided Aided Hyper Inference, also known as SAHI.

A little bit of context first: IceVision is an agnostic computer vision framework allowing users to train object detection and segmentation models end2end, from data preparation to fully-fledged inference. Together with a super flexible and entirely customizable data API, it offers a wide range of models, from YOLOv5, to TorchVision, EfficientDet, and MMDetection. I have already explored IceVision here and here, so check those posts out for more info.

As the GitHub repo puts it, SAHI is

A lightweight vision library for performing large scale object detection & instance segmentation. […] detection of small objects and inference on large images are still major issues in practical usage. Here comes the SAHI to help developers overcome these real-world problems with many vision utilities.

As shown in the below GIF (taken from the repo), SAHI allows running inference on slices of the original image, as opposed to feeding the entire picture to the model in one go.

The benefit is clear. Let’s say I had an HD image (720 x 1280 pixels resolution). This is not even “big” for current image quality standards. As an example, my Huawey P20 Lite front camera shoots pictures as large as 3456 x 4608 pixels, and it’s a 4-year-old device! Regardless, even an HD image is (generally) already too big to fit on a GPU as is. In most cases, we need to resize it. Obviously, this comes at the cost of losing valuable information. If you are not convinced, take a look at what happens when shrinking the below HD JPG (720 x 1280)…

HD resolution (720 x 1280)

… to a fourth of its size (180 x 320). Those are items from the Fridge dataset which I randomly pasted onto a greyish background. A lot harder to distinguish what is going on down here.

1/4 HD resolution (180 x 320)

This is the visual summary of why “small object detection” is hard, and also why it’s way more effective to keep the original resolution and run inference on a smaller sliding window, so as to keep the pixels’ information intact. The idea is great but the execution far from being trivial. How do we aggregate bounding boxes predicted at the window’s borders? Should we simply merge them or use NMS? How much overlap shall consecutive windows have? Those are all questions that are easier asked than answered. But fear not. SAHI comes to the rescue turning all the above in a couple of function calls.

To make practitioners’ lives even easier, we have fully integrated it with the IceVision API, making it possible to have the same single line of code work out of the box for all models. Check it out in action in this notebook.

For instance, training a VFNet MMDet model and running SAHI inference on it is as simple as putting together 30 lines of code.

from icevision.all import *
from icevision.models.inference_sahi import IceSahiModel

url = "https://cvbp-secondary.z19.web.core.windows.net/datasets/object_detection/odFridgeObjects.zip"
dest_dir = "fridge"
data_dir = icedata.load_data(url, dest_dir)
parser = parsers.VOCBBoxParser(annotations_dir=data_dir / "odFridgeObjects/annotations", images_dir=data_dir / "odFridgeObjects/images")
train_records, valid_records = parser.parse()

image_size = 384
train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=(image_size, image_size), presize=512), tfms.A.Normalize()])
valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad((image_size, image_size)), tfms.A.Normalize()])

train_ds = Dataset(train_records, train_tfms)
valid_ds = Dataset(valid_records, valid_tfms)

model_type = models.mmdet.vfnet
backbone = model_type.backbones.resnet50_fpn_mstrain_2x
model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map)) 

train_dl = model_type.train_dl(train_ds, batch_size=16, num_workers=8, shuffle=True)
valid_dl = model_type.valid_dl(valid_ds, batch_size=16, num_workers=8, shuffle=False)

metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]
learn = model_type.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics)
learn.fine_tune(20, 3e-4, freeze_epochs=1)

sahimodel = IceSahiModel(model_type=model_type, model=model, class_map=parser.class_map, tfms=valid_tfms, confidence_threshold=0.4)
pred = sahimodel.get_sliced_prediction(
                "small_fridge.jpg",
                keep_sahi_format=False,
                return_img=True,
                slice_height = 128,
                slice_width = 128,
                overlap_height_ratio = 0.2,
                overlap_width_ratio = 0.2,
            )

Results are astounding.

On the left, below, is what you get when you run IceVision’s `end2end_detect` (check Single Image Inference section here) on the “small Fridge” image, resizing it to 384 x 384, e.g. executing inference on one go. Not a single bounding box is detected with a confidence threshold of 40%. We are using a VFNet model trained on the Fridge dataset up to >95% mAP. Not a bad one at all. Nevertheless, all objects get lost.

On the right side of the slider, instead, there is the SAHI prediction. Executed on sliding squared 128 x 128 patches, with 20% overlap on both height and width. I can count only 4 milk bottles falling through the cracks. To be fair, not all detections are correctly labeled with the right class, but the result is still pretty impressive.

Comparison of single-shot prediction (left) vs SAHI inference (right).

In terms of API, the `get_sliced_prediction` method of the IceSahiModel class is just a convenience wrapper around the original `get_sliced_prediction` from SAHI. It accepts the same arguments together with a couple more.

  • `keep_sahi_format`: if True, allows to return a SAHI native PredictionResult object. If False it returns a dictionary of the same format as end2end_detect, with predicted labels, confidence scores and bounding boxes coordinates.
  • return_img: if True, adds to the dictionary mentioned in the previous point a PIL.Image.Image annotated with boxes, scores and labels.
  • Arguments to control the PIL.Image.Image appearance, such as display_label, display_box, display_score, etc.

Feel free to give it a shot and get back to us on Discord if you have any questions!

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading