Skip to content

fast.ai DL2 Lesson 9: Single Shot Detection detailed walkthrough

Reading Time: 24 minutes

Notes

  • The code (functions, classes etc) I refer to, in this post, comes from this notebook, which I put together during along with deep dive on SSD
  • Ground Truth Bounding Box: 4-dimensional array representing the rectangle which surrounds the ground truth object in the image (related to the dataset)
  • Anchor Box: 4-dimensional array representing the rectangular patch of the input image the model looks at, to figure out which objects it contains (related to the model)
  • Anchor Box Activations or Predicted Bounding Box: 4-dimensional array which should be as close as possible to the Ground Truth Bounding Box. I found the difference between Anchor Box and its Activations extremely confusing. In fact, they are the same thing. The Activations are nothing else that an updated version of the Anchor Box as the training phase proceeds. We keep them separated just because (in this specific implementation of SSD) we apply some position/size constraints to the Activations, based on the position/size of the original Anchor. The job of the Anchor Box is to fit the Ground Truth Bounding Box (and its class) as well as possible. Nothing else.

Introduction

A couple of months ago I decided to take a closer look at object detection. The result of my experiments, at the time, was this post, where I challenged myself to detect pneumonia in chest radiographs. At the time, I had considerably simplified my life, turning the original “detect ANY pneumonia trace in an x-ray” into “detect the LARGEST pneumonia trace in an x-ray”. The tasks look similar. Truth is the former is much more general and tougher than the latter. The inspiration had come, as usual, from the excellent fast.ai course. Lesson 8 and 9 of DL2 deal with object detection. Specifically, lesson 8 covers the basics, going through the simpler example of classifying and identifying only the largest element in a picture. The following lecture then extends the code to the more general and complex Single Shot Detection (SSD) case. At the time, I had found lesson 9 complicated and hard to follow. Therefore I had decided to experiment with the ideas developed during lesson 8 and postpone a more serious SSD’s analysis to the future. This time has finally come, so let’s get into it!

As an outcome of the deep dive, I ended up restructuring Jeremy’s original notebook, putting together a new one. Same code just organized in a more linear way (at least according to me). Here the summary of my learnings.

As in any deep learning task, we need to figure out three pillars:

  1. The data
  2. The neural network architecture
  3. The loss function

The Data

The dataset we’ll deal with is the MNIST counterpart of object detection: Pascal VOC 2007. A collection of ~2500 images, featuring objects from 20 different classes. Each image might contain one or more objects from the same or different classes. Objects are identified within the corresponding image by a bounding box (BB), represented by 4 numbers. As per (annoying) Computer Vision conventions:

  1. Y value of top left corner
  2. X value of top left corner
  3. height
  4. width

The first thing we do is turning those into a more friendly format (calling get_trn_anno(); FYI bb_hw operates in the opposite direction, getting back to height-width format).

  1. X value of top top left corner
  2. Y value of top left corner
  3. X value of bottom right corner
  4. Y value of bottom right corner

Here how the data looks like for image ID 17. The picture is annotated with 2 objects, a pottedplant and a motorbike, whose coordinates are stored as 4 dimensional arrays.

trn_anno = get_trn_anno()
trn_anno[17]
# [(array([ 61, 184, 198, 278]), 15), (array([ 77,  89, 335, 402]), 13)]
id2cat[15], id2cat[13]
# ('pottedplant', 'motorbike')

We have arranged the original Pascal VOC dataset in a more usable form but we still need to turn it into an appropriate shape to be fed to a neural network. The model will receive 3 inputs:

  1. Images: we know how to handle that
  2. Label 1: objects’ classes
  3. Label 2: objects’ bounding boxes

So the first challenge is to hack into the `ImageClassifierData` class, forcing it to receive 2 sets of labels. To achieve that, we define `mcs` (an array containing as many arrays as images, each one containing as many classes’ IDs as many objects in each image). For instance, as shown below, the first image features just one object, identified as class 6, i.e. a `car`. From `mcs` we define an appropriate validation (`val_mcs`) and training set (`trn_mcs`).

We then address the BBs. For that we define (and save to CSV) a `pandas.DataFrame` with images’ filenames and a tab separated list of `4n` numbers where `n` is the numbers of objects per image.

mc = [[cats[p[1]] for p in trn_anno[o]] for o in trn_ids]
id2cat = list(cats.values())
cat2id = {v:k for k,v in enumerate(id2cat)}
mcs = np.array([np.array([cat2id[p] for p in o]) for o in mc])
mcs 
# array([array([6]), array([14, 12]), array([ 1,  1, 14, 14, 14]), ..., array([17,  8, 14, 14, 14]),
#       array([6]), array([11])], dtype=object)
val_idxs = get_cv_idxs(len(trn_fns))
((val_mcs,trn_mcs),) = split_by_idx(val_idxs, mcs)
val_mcs.shape, trn_mcs.shape
# ((500,), (2001,))
mbb = [np.concatenate([p[0] for p in trn_anno[o]]) for o in trn_ids]
mbbs = [' '.join(str(p) for p in o) for o in mbb]
df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids], 'bbox': mbbs}, columns=['fn','bbox'])
df.to_csv(MBB_CSV, index=False)
df.head()

We then point a `fastai.dataset.ImageClassifierData` to the CSV file, calling the class with a ResNet34 model for preprocessing and data augmentations transforms. Note that both the images and the BBs get augmented; YES, the dependent variable needs to be augmented too as it must follow the adjustments of the background, i.e. tfm_y=TfmType.COORD).

f_model=resnet34
sz=224
bs=64
aug_tfms = [RandomRotate(3, p=0.5, tfm_y=TfmType.COORD),
            RandomLighting(0.05, 0.05, tfm_y=TfmType.COORD),
            RandomFlip(tfm_y=TfmType.COORD)]
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=TfmType.COORD, aug_tfms=aug_tfms)
md = ImageClassifierData.from_csv(PATH, JPEGS, MBB_CSV, tfms=tfms, bs=bs, continuous=True, num_workers=4)
x,y=to_np(next(iter(md.val_dl)))
x.shape, y.shape
# ((64, 3, 224, 224), (64, 56))

What we get out of it is `md`. This object, as of now, contains just images (`x`) and bounding boxes (`y` – 14 per image; we’ll see later on that fastai pads batches with zeros to have consistent shapes).

Let’s add objects’ classes to the mix. To do that we define a custom ConcatLblDataset, concatenating the `md` original datasets with the arrays containing the classes’ IDs. Looks good. As you can see in the beneath picture, `y` contains 2 labels, bounding boxes `(64, 56)` and objects’ classes (64, 14), 64 being the batch size.

class ConcatLblDataset(Dataset):
    def __init__(self, ds, y2):
        self.ds,self.y2 = ds,y2
        self.sz = ds.sz
    def __len__(self): return len(self.ds)
    
    def __getitem__(self, i):
        x,y = self.ds[i]
        return (x, (y,self.y2[i]))
trn_ds2 = ConcatLblDataset(md.trn_ds, trn_mcs)
val_ds2 = ConcatLblDataset(md.val_ds, val_mcs)
md.trn_dl.dataset = trn_ds2
md.val_dl.dataset = val_ds2
x,y=to_np(next(iter(md.val_dl)))
x=md.val_ds.ds.denorm(x)
x.shape, y[0].shape, y[1].shape
# ((64, 224, 224, 3), (64, 56), (64, 14))

Let’s display 12 random images within a batch to make all of this a little more tangible. Here it is. Pictures and ground truth (GT) BBs.

fig, axes = plt.subplots(3, 4, figsize=(16, 12))
for i,ax in enumerate(axes.flat):
    show_ground_truth(ax, x[i], y[0][i], y[1][i])
plt.tight_layout()

The Architecture

In terms of architecture, the idea behind Single Shot Detection (SSD) is quite simple. The basic principle consists in dividing the image into a grid and classifying each cell of the grid independently.

Let’s assume, for the sake of simplicity, that we split the image using a 4 x 4 grid. Each one of the 16 cells (anchor boxes – AB) is responsible for the area it is covering, providing one class and one set of AB activations (ABAct – the difference between AB, BB and ABAct was one of the most confusing parts to me; I already highlighted it in the Notes at the top of the post and I will try to make myself clearer later on too). The reason why we need ABs and we cannot simply throw the entire picture to the network is that we want to force the model’s attention on multiple parts of the input at the same time. If we didn’t do that, we’d likely end up catching just the most prominent objects, potentially missing out on the rest. Given the 4x4 grid, we end up with 16 25-dimensional vectors, one per AB: 4 coordinates to identify ABAct and 21 classes probabilities (20 + 1 for background). We can visualize this logic with the following cubic shape.

The above cube is of shape BS x 16 x 25 (BS = batch size). Now, if you think about it, this is a shape a ConvNet might produce. That would be very convenient as we could build a network capable of getting directly from the input (the raw image) to the output. All in one shot. Without having to loop through the image 16 times. Hence, the name, Single Shot Detection.

Here the architecture in more detail. We start from a ResNet34 backbone, removing the fully connected layers after the last residual block. We replace the top with a custom head, consisting of a couple of additional `conv` layers which then split up to separately “specialize” into classification and regression respectively. The result is the `16 x 25` tensor we are looking for.

What follows is the detailed code used to put together the architecture. We’ll explore the idea behind the variable `k` later on. For the moment ignore it.

class StdConv(nn.Module):
    def __init__(self, nin, nout, stride=2, drop=0.1):
        super().__init__()
        self.conv = nn.Conv2d(nin, nout, 3, stride=stride, padding=1)
        self.bn = nn.BatchNorm2d(nout)
        self.drop = nn.Dropout(drop)
        
    def forward(self, x): return self.drop(self.bn(F.relu(self.conv(x))))
        
def flatten_conv(x,k):
    bs,nf,gx,gy = x.size()
    x = x.permute(0,2,3,1).contiguous()
    return x.view(bs,-1,nf//k)
class OutConv(nn.Module):
    def __init__(self, k, nin, bias):
        super().__init__()
        self.k = k
        self.oconv1 = nn.Conv2d(nin, (len(id2cat)+1)*k, 3, padding=1)
        self.oconv2 = nn.Conv2d(nin, 4*k, 3, padding=1)
        self.oconv1.bias.data.zero_().add_(bias)
        
    def forward(self, x):
        return [flatten_conv(self.oconv1(x), self.k),
                flatten_conv(self.oconv2(x), self.k)]
    
class SSD_Head(nn.Module):
    def __init__(self, k, bias):
        super().__init__()
        self.drop = nn.Dropout(0.25)
        self.sconv0 = StdConv(512,256, stride=1)
        self.sconv2 = StdConv(256,256)
        self.out = OutConv(k, 256, bias)
        
    def forward(self, x):
        x = self.drop(F.relu(x))
        x = self.sconv0(x)
        x = self.sconv2(x)
        return self.out(x)
k = 1
f_model = resnet34
head_reg4 = SSD_Head(k, -3.)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam
head_reg4
# output
SSD_Head(
  (drop): Dropout(p=0.25)
  (sconv0): StdConv(
    (conv): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (drop): Dropout(p=0.1)
  )
  (sconv2): StdConv(
    (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (drop): Dropout(p=0.1)
  )
  (out): OutConv(
    (oconv1): Conv2d(256, 21, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (oconv2): Conv2d(256, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
)

The Loss Function

This is for sure the most complex part of the whole process. What actually motivates a loss function? Its purpose is to quantify how good or bad the outputs of the neural network are compared to the ground truth. In all the deep learning problems I have faced so far that had been an almost naive task. In regression, you are just comparing two numbers. The MSE loss works off the shelf. In classification, depending on whether you are facing binary or multiclass you are comparing a probability output with a label. The binary or categorical cross entropy work off the shelf too.

The situation here is more complicated. First of all we are dealing with regression and classification at the same time. This is actually not a big deal. We have a model architecture producing N 4+c+1 dimensional arrays (with c being the number of classes, +1 for the background and `4` ABAct coordinates). It is just a matter of calculating the MSE loss for the ABAct, a categorical loss for the classification part and summing them up.

The real challenge is the following:

  • Given that we have N ABs, how do we match the ABAct each one carries, with the GT BBs?
  • For instance, how do we consider the ABAct generated by the top left AB? To which ground object shall we map it to, in order to check if it is a good ABAct or not?

This is not a trivial task. Assuming we have 2 objects, a horse and a person (like in the image below), ideally, we would like the network to produce 2 ABAct only. These ABAct would need to overlap as much as possible with the GT BBs. An acceptable function must provide a measure of how far or close we are from this target.

Let’s put together the thoughts we have poured in the previous paragraph and come up with a strategy. Here is what we are going to do within the loss (`ssd_1_loss`)

  1. The number of ground truth objects change by image. Each minibatch contains multiple images though. Therefore the GT image-level tensors need to be padded with zeros to produce consistent shapes. The first step consists in cleaning up the padding and fetching just non-zero GT BBs bbox and GT classes clas (`bbox,clas = get_y(bbox,clas)`).
  2. The predicted ABAct b_bb, especially at the beginning of the training, could be scattered all across the image. They need to be re-anchored to the AB anchors they belong to. On top of it, we also want the ABAct to expand outside of the boundaries of the AB, shifting them on the X/Y axis and dilating/shrinking their heights and widths. This is critical to give the model the opportunity to find big objects spreading over multiple ABs. (`a_ic = actn_to_bb(b_bb, anchors)`)
  3. We start implementing the matching logic, mapping each GT BB bbox with the ABs anchor_cnr. As hinted before, the way to achieve that is to calculate the overlap between each GT BB and each AB. This is quantified by the Jaccard Index or Intersection Over Union IOU. (overlaps = jaccard(bbox.data, anchor_cnr.data))
  4. Now that we have the Jaccard overlaps, we actually map each AB to a GT object (and vice-versa). First, we iterate over the ABs and assign to each one the class of the object which overlaps with it the most. At the end of this cycle, 0 could either mean that the AB overlapped with none of the objects (in this case it is background) or that it overlapped with an object of class 0 (this does not really matter). Then we loop through the objects and assign to each one the AB it overlaps with it the most. We finally override the choice made in the first loop for the ABs coming out of the second loop. This is just to make sure we don’t miss the ABs we are very confident about, i.e. with high IOUs wrt to GT BBs. gt_class provide the class which was assigned to each AB. (gt_overlap,gt_idx = map_to_ground_truth(overlaps); gt_clas = clas[gt_idx])
  5. This step is purely operational. No mind-bending computations. What we do is select all ABs with an overlap with GT exceeding a specific threshold. We then return their IDs pos_idx and GT classes and BBs, appropriately reshaped to be ingested into the relevant losses. (pos = gt_overlap > 0.4; pos_idx = torch.nonzero(pos)[:,0]; gt_clas[1-pos] = len(id2cat); gt_bbox = bbox[gt_idx])
  6. We compute the regression loss, simply as the mean L1 distance between GT BBs and ABAct (loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean())
  7. We compute the binary cross entropy loss between GT and predicted classes. This step is crucial and will be expanded below in more detail. (clas_loss = loss_f(b_c, gt_clas) where loss_f will be discussed in detail later on)
  8. We sum 7 and 8
  9. We repeat steps 1-8 for each image in the minibatch (`ssd_loss`)

Let’s now go over the previous list step by step, providing relevant code.

0. Getting a batch of images and running it through a pre-trained model

idx = 14 # just grabbing an image out of the 64 within the batch
x,y = next(iter(md.val_dl)) # fetching one batch of data
x,y = V(x),V(y)
x.shape, y[0].shape, y[0].view(64,-1, 4).shape, y[1].shape
# output
(torch.Size([64, 3, 224, 224]),
 torch.Size([64, 56]),
 torch.Size([64, 14, 4]),
 torch.Size([64, 14]))
# this picture contains 2 objects: ['person', 'horse']
y[0][idx].view(-1, 4), y[1][idx], [id2cat[i] for i in to_np(y[1][idx]) if i > 0]
# output
(Variable containing:
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
     0     0     0     0
    20    98   117   152
    45    31   191   187
 [torch.cuda.FloatTensor of size 14x4 (GPU 0)], Variable containing:
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
  14
  12
 [torch.cuda.LongTensor of size 14 (GPU 0)], ['person', 'horse'])
learn.model.eval() # setting model in evalutaion mode for prediction
batch = learn.model(x) # predicting on batch
b_clas,b_bb = batch # classes and ABAct for for the 16 ABs
b_clas.size(), b_bb.size()
# (torch.Size([64, 16, 21]), torch.Size([64, 16, 4]))

1. Cleaning up batch padding and scaling Bounding Boxes between 0 and 1

b_clasi = b_clas[idx] # getting classes predictions for image ID 14
b_bboxi = b_bb[idx] # getting ABAct for image ID 14
ima=md.val_ds.ds.denorm(to_np(x))[idx] # denormalising the image to be able to plot it
# removing padded 0 and grabbing just non 0 entries in classes and BBs.
# in this case image contains just 2 objects.
# also scaling BB between 0 ans 1
bbox,clas = get_y(y[0][idx], y[1][idx]) 
bbox,clas
# output
(Variable containing:
  0.0893  0.4375  0.5223  0.6786
  0.2009  0.1384  0.8527  0.8348
 [torch.cuda.FloatTensor of size 2x4 (GPU 0)], Variable containing:
  14
  12
 [torch.cuda.LongTensor of size 2 (GPU 0)])
# checking if we can scale back the BBs correctly.
# numbers match unscaled BBs, so it looks ok
to_np((bbox*224).long())
# array([[ 20,  98, 117, 152],
#       [ 45,  31, 191, 187]], dtype=int64)
# visualizing the ground truth image and classes and BBs
fig, ax = plt.subplots(figsize=(7,7))
torch_gt(ax, ima, bbox, clas)

2A. Defining and visualizing Anchor Boxes

# generating the 16 ABs
anc_grid = 4
k = 1 # this is the number of possible combinations of ABs we are producing.
anc_offset = 1/(anc_grid*2)
# X coords of centers of ABs
anc_x = np.repeat(np.linspace(anc_offset, 1-anc_offset, anc_grid), anc_grid)
# Y coords of centers of ABs
anc_y = np.tile(np.linspace(anc_offset, 1-anc_offset, anc_grid), anc_grid)
# putting X and Y centers together
anc_ctrs = np.tile(np.stack([anc_x,anc_y], axis=1), (k,1))
anc_sizes = np.array([[1/anc_grid,1/anc_grid] for i in range(anc_grid*anc_grid)])
anchors = V(np.concatenate([anc_ctrs, anc_sizes], axis=1), requires_grad=False).float()
anchors # this represents the center of the ABs with height and width (they are 0.25 squares)
# output
 Variable containing:
 0.1250  0.1250  0.2500  0.2500
 0.1250  0.3750  0.2500  0.2500
 0.1250  0.6250  0.2500  0.2500
 0.1250  0.8750  0.2500  0.2500
 0.3750  0.1250  0.2500  0.2500
 0.3750  0.3750  0.2500  0.2500
 0.3750  0.6250  0.2500  0.2500
 0.3750  0.8750  0.2500  0.2500
 0.6250  0.1250  0.2500  0.2500
 0.6250  0.3750  0.2500  0.2500
 0.6250  0.6250  0.2500  0.2500
 0.6250  0.8750  0.2500  0.2500
 0.8750  0.1250  0.2500  0.2500
 0.8750  0.3750  0.2500  0.2500
 0.8750  0.6250  0.2500  0.2500
 0.8750  0.8750  0.2500  0.2500
[torch.cuda.FloatTensor of size 16x4 (GPU 0)]
# visualizing the centers of the ABs
fig2 = plt.figure()
ax2 = fig2.add_subplot(111, aspect='equal')
plt.scatter(anc_x, anc_y)
plt.xlim(0, 1)
plt.ylim(0, 1);
# these are instead the ABs formatted as centers-height-width which
# are turned into top-left/bottom-right corners.
# we'll visualize the ABs soon with appropriate class predictions
anchor_cnr = hw2corners(anchors[:,:2], anchors[:,2:])
anchor_cnr
# output
Variable containing:
 0.0000  0.0000  0.2500  0.2500
 0.0000  0.2500  0.2500  0.5000
 0.0000  0.5000  0.2500  0.7500
 0.0000  0.7500  0.2500  1.0000
 0.2500  0.0000  0.5000  0.2500
 0.2500  0.2500  0.5000  0.5000
 0.2500  0.5000  0.5000  0.7500
 0.2500  0.7500  0.5000  1.0000
 0.5000  0.0000  0.7500  0.2500
 0.5000  0.2500  0.7500  0.5000
 0.5000  0.5000  0.7500  0.7500
 0.5000  0.7500  0.7500  1.0000
 0.7500  0.0000  1.0000  0.2500
 0.7500  0.2500  1.0000  0.5000
 0.7500  0.5000  1.0000  0.7500
 0.7500  0.7500  1.0000  1.0000
[torch.cuda.FloatTensor of size 16x4 (GPU 0)]
# classes assigned to each AB (highest probability)
[f"AB {i+1}; class ID {j}; class {'bg' if j==len(id2cat) else id2cat[j]}" for i, j in enumerate(to_np(b_clasi.max(1)[1]))]
# output
['AB 1; class ID 20; class bg',
 'AB 2; class ID 20; class bg',
 'AB 3; class ID 14; class person',
 'AB 4; class ID 14; class person',
 'AB 5; class ID 20; class bg',
 'AB 6; class ID 14; class person',
 'AB 7; class ID 14; class person',
 'AB 8; class ID 20; class bg',
 'AB 9; class ID 20; class bg',
 'AB 10; class ID 20; class bg',
 'AB 11; class ID 12; class horse',
 'AB 12; class ID 20; class bg',
 'AB 13; class ID 20; class bg',
 'AB 14; class ID 20; class bg',
 'AB 15; class ID 20; class bg',
 'AB 16; class ID 20; class bg']
# visualizing the ABs with class prediction for the part of the image they are responsible for
fig, ax = plt.subplots(figsize=(7,7))
torch_gt(ax, ima, anchor_cnr, b_clasi.max(1)[1])

2B. Turning ABAct into Bounding Boxes (re-centering into ABs + perturbing position/size)

As I stated before, the difference between Anchor Box (AB), Anchor Box Activations (ABAct, i.e. the 4 numbers representing the bounding box prediction within the 25-dimensional array each AB carries) and Bounding Box (BB) was one of the most confusing parts of all this exercise. For reference, I might use ABAct and Predicted Bounding Box, interchangeably. In fact, they are the same thing. We define squared ABs at the beginning of the process and then allow them to get tweaked along the training. Technically, the original ABs gets replaced by their ABAct as gradient descent operates. What we do, at the end of each batch, is grab the ABAct and constraint it according to the shape and position of its original AB. If we did not do that, ABAct would start moving all over the image during training (look at the image just below which reconstructs this process for one AB!). There is nothing bad in this behavior. We do it because later on, we will add more flexible and dynamic ABs which will cover most of the input anyway, even if constrained.

Let’s visualize what happens to ABAct when the original AB’s constraints are applied to it.

a_ic = actn_to_bb(b_bboxi, anchors)
# slightly modified show_ground_truth to visualize how actn_to_bb works
def show_ground_truth(ax, im, bbox, clas=None, prs=None, thresh=0.3):
    bb = [bb_hw(o) for o in bbox.reshape(-1,4)]
    if prs is None:  prs  = [None]*len(bb)
    if clas is None: clas = [None]*len(bb)
    ax = show_img(im, ax=ax)
    k=0
    for i,(b,c,pr) in enumerate(zip(bb, clas, prs)):
        if((b[2]>1) and (pr is None or pr > thresh)):
            k+=1
            draw_rect(ax, b, color=colr_list[i%num_colr])
            txt = f'{k}: '
            if isinstance(c, str): 
                txt += c
            else:
                if c is not None: txt += ('bg' if c==len(id2cat) else id2cat[c])
            if pr is not None: txt += f' {pr:.2f}'
            draw_text(ax, b[:2] + np.array([0, np.random.randint(0, 40)]), txt, color=colr_list[i%num_colr])
# example with the first ABAct of the batch
idx = 0
bbo = torch.cat((b_bboxi[idx,:], anchor_cnr[idx,:], a_ic[idx,:]), 0).view(3,4)
# as you can see the ABAct (1) is far from the AB it is anchored to (2)
# we re-center it on 2 and alter its position/size
fig, ax = plt.subplots(figsize=(7,7))
show_ground_truth(ax, ima, to_np((bbo*224).long()), ['Predicted Bounding Box', 'Original Anchor Box', 'Adjusted Bounding Box'])
# this is the result after all ABAct have gone through AB re-centering
# and position/size alterations
fig, ax = plt.subplots(figsize=(7,7))
torch_gt(ax, ima, a_ic, b_clasi.max(1)[1], b_clasi.max(1)[0].sigmoid(), thresh=0.0)

3. Calculating IOU between Anchor Boxes and Ground Truth Bounding Boxes

# overlapping ABs and ground truth BBs.
# as you can see overlaps is shaped 2x16 as jaccard calculates the IOU
# between all possible combinations of the 2 objects and the 16 ABs
overlaps = jaccard(bbox.data, anchor_cnr.data)
overlaps
# output
Columns 0 to 9 
 0.0000  0.0640  0.2077  0.0000  0.0000  0.1033  0.3652  0.0000  0.0000  0.0084
 0.0107  0.0244  0.0244  0.0081  0.0571  0.1377  0.1377  0.0428  0.0571  0.1377
Columns 10 to 15 
 0.0245  0.0000  0.0000  0.0000  0.0000  0.0000
 0.1377  0.0428  0.0227  0.0523  0.0523  0.0172
[torch.cuda.FloatTensor of size 2x16 (GPU 0)]
# For each of the 2 objects, this produces the IDs of the AB 
# with the highest IOU 
overlaps.max(1)
# output
(
  0.3652
  0.1377
 [torch.cuda.FloatTensor of size 2 (GPU 0)], 
   6
  10
 [torch.cuda.LongTensor of size 2 (GPU 0)])
# For each of the 16 ABs, this produces the IDs of the object 
# with the highest IOU 
overlaps.max(0)
# output
(
  0.0107
  0.0640
  0.2077
  0.0081
  0.0571
  0.1377
  0.3652
  0.0428
  0.0571
  0.1377
  0.1377
  0.0428
  0.0227
  0.0523
  0.0523
  0.0172
 [torch.cuda.FloatTensor of size 16 (GPU 0)], 
  1
  0
  0
  1
  1
  1
  0
  1
  1
  1
  1
  1
  1
  1
  1
  1
 [torch.cuda.LongTensor of size 16 (GPU 0)])

4. Mapping each Anchor Box to a Ground Truth Object

# we put the 2 above sets of IDs (ABs and objects) together and 
# produce a final assignment. Each AB gets assigned an object
gt_overlap,gt_idx = map_to_ground_truth(overlaps)
gt_overlap, gt_idx, clas 
# output
(
  0.0107
  0.0640
  0.2077
  0.0081
  0.0571
  0.1377
  1.9900
  0.0428
  0.0571
  0.1377
  1.9900
  0.0428
  0.0227
  0.0523
  0.0523
  0.0172
 [torch.cuda.FloatTensor of size 16 (GPU 0)], 
  1
  0
  0
  1
  1
  1
  0
  1
  1
  1
  1
  1
  1
  1
  1
  1
 [torch.cuda.LongTensor of size 16 (GPU 0)], Variable containing:
  14
  12
 [torch.cuda.LongTensor of size 2 (GPU 0)])
# these are the classes of the objects assigned to each AB.
# we are not done yet, though, for most of these objects
# the IOU with the AB was really small. We have to clean those up
# and assign the uncertain ones to background
gt_clas = clas[gt_idx]
gt_clas
# output
Variable containing:
 12
 14
 14
 12
 12
 12
 14
 12
 12
 12
 12
 12
 12
 12
 12
 12
[torch.cuda.LongTensor of size 16 (GPU 0)]

5. Cleaning up Anchor Boxes and selecting only the ones with high IOU with Ground Truth Objects (rest is background)

thresh = 0.4
pos = gt_overlap > thresh
pos_idx = torch.nonzero(pos)[:,0]
neg_idx = torch.nonzero(1-pos)[:,0]
pos_idx, neg_idx
# output
(
   6
  10
 [torch.cuda.LongTensor of size 2 (GPU 0)], 
   0
   1
   2
   3
   4
   5
   7
   8
   9
  11
  12
  13
  14
  15
 [torch.cuda.LongTensor of size 14 (GPU 0)])
gt_clas[1-pos] = len(id2cat)
[id2cat[o] if o



6+7. Calculating regression and classification losses

gt_bbox = bbox[gt_idx]
loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean()
loss_f = BCE_Loss(len(id2cat))
clas_loss  = loss_f(b_clasi, gt_clas)
loc_loss,clas_loss
# output
(Variable containing:
  0.5397
 [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing:
  0.1056
 [torch.cuda.FloatTensor of size 1 (GPU 0)])

Putting together the loss function

Eventually, we put together all the steps we have reviewed in detail before and pack them in a clean function.

#putting everything together
def ssd_1_loss(b_c,b_bb,bbox,clas,print_it=False):
    bbox,clas = get_y(bbox,clas)
    a_ic = actn_to_bb(b_bb, anchors)
    overlaps = jaccard(bbox.data, anchor_cnr.data)
    gt_overlap,gt_idx = map_to_ground_truth(overlaps,print_it)
    gt_clas = clas[gt_idx]
    pos = gt_overlap > 0.4
    pos_idx = torch.nonzero(pos)[:,0]
    gt_clas[1-pos] = len(id2cat)
    gt_bbox = bbox[gt_idx]
    loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean()
    clas_loss  = loss_f(b_c, gt_clas)
    return loc_loss, clas_loss
def ssd_loss(pred,targ,print_it=False):
    lcs,lls = 0.,0.
    for b_c,b_bb,bbox,clas in zip(*pred,*targ):
        loc_loss,clas_loss = ssd_1_loss(b_c,b_bb,bbox,clas,print_it)
        lls += loc_loss
        lcs += clas_loss
    if print_it: print(f'loc: {lls.data[0]}, clas: {lcs.data[0]}')
    return lls+lcs

Training the model

Having figured out the loss, we now have all the three elements (data, architecture, loss) to finally train a model.

head_reg4 = SSD_Head(k, -3.)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam
loss_f = BCE_Loss(len(id2cat))
learn.crit = ssd_loss
lr = 3e-3
lrs = np.array([lr/100,lr/10,lr])
learn.lr_find(lrs/1000,1.)
learn.sched.plot(1)
learn.fit(lr, 1, cycle_len=5, use_clr=(20,10))
# output
    epoch      trn_loss   val_loss                                                                                                                                                                                                              
        0      43.001565  31.549975 
        1      33.59092   28.610656                                                                                                                                                                                                             
        2      29.400848  26.712649                                                                                                                                                                                                             
        3      26.551586  26.045135                                                                                                                                                                                                             
        4      24.349144  25.652218       
# quickly inspecting network's predictions
fig, axes = plt.subplots(3, 4, figsize=(16, 12))
for idx,ax in enumerate(axes.flat):
    ima=md.val_ds.ds.denorm(to_np(x))[idx]
    bbox,clas = get_y(y[0][idx], y[1][idx])
    ima=md.val_ds.ds.denorm(to_np(x))[idx]
    bbox,clas = get_y(bbox,clas); bbox,clas
    a_ic = actn_to_bb(b_bb[idx], anchors)
    torch_gt(ax, ima, a_ic, b_clas[idx].max(1)[1], b_clas[idx].max(1)[0].sigmoid(), 0.01)
plt.subplots_adjust(wspace=0.15, hspace=0.15)

Most of the predicted BBs read bg (background). This can be easily fixed by filtering out these boxes. The real problem is that none of the BB is big enough to correctly bound bigger objects. The third image from the top left corner is such an example. The model actually figures out that there is a bird, but none of the 16 AB is wide enough to fit an appropriate BB.

How do we fix that? That' easy. We throw more anchors into the mix. Let's see how.

Improving SSD model: more anchors!

The trick is to introduce anchors of different shapes and height/width ratios, allowing them to overlap with each other and covering patches of the image of various size. Another way to introduce more flexibility is to add more convolutional grids. So far, we have trained the model on a 4x4 grid. We could throw into the mix a `2x2` and a `1x1` grid too. This would basically mean having 3 SSD models running in parallel.

Playing around with grids coarseness (additional convolutional layers), zooms and scaling we generate a list of 189 different ABs. All of them looking at a different part of the input. All of them at the same time. That's a lot more than 16 and it is almost guaranteed that some of them would overlap with GT objects with an IOU higher than 40-50% on a regular basis.

Note: time to reveal what the variable k is for. It stores the number of alterations (zooms/scaling) we adopt for each original AB within a grid (we have 3 grids here `[4,2,1]`) to generate new ABs. In our case k = 9 and it is crucial to pass it to the forward method of the model to adjust tensors' shapes accordingly along the way.

anc_grids = [4,2,1]
anc_zooms = [0.7, 1., 1.3]
anc_ratios = [(1.,1.), (1.,0.5), (0.5,1.)]
anchor_scales = [(anz*i,anz*j) for anz in anc_zooms for (i,j) in anc_ratios]
k = len(anchor_scales)
anc_offsets = [1/(o*2) for o in anc_grids]
anc_x = np.concatenate([np.repeat(np.linspace(ao, 1-ao, ag), ag)
                        for ao,ag in zip(anc_offsets,anc_grids)])
anc_y = np.concatenate([np.tile(np.linspace(ao, 1-ao, ag), ag)
                        for ao,ag in zip(anc_offsets,anc_grids)])
anc_ctrs = np.repeat(np.stack([anc_x,anc_y], axis=1), k, axis=0)
a=np.reshape((to_np(anchor_cnr) + to_np(torch.randn(*anchor_cnr.size()))*0.01)*224, -1)
anchor_cnr.size()
# torch.Size([189, 4])
# look at this mess! We are visualizing 189 ABs!
fig, ax = plt.subplots(figsize=(7,7))
show_ground_truth(ax, x[0], a)

Given that we have 189 anchors and not 16 anymore, we slightly adapt our network architecture to accomodate for that.

class SSD_MultiHead(nn.Module):
    def __init__(self, k, bias):
        super().__init__()
        self.drop = nn.Dropout(drop)
        self.sconv0 = StdConv(512,256, stride=1, drop=drop)
        self.sconv1 = StdConv(256,256, drop=drop)
        self.sconv2 = StdConv(256,256, drop=drop)
        self.sconv3 = StdConv(256,256, drop=drop)
        self.out0 = OutConv(k, 256, bias)
        self.out1 = OutConv(k, 256, bias)
        self.out2 = OutConv(k, 256, bias)
        self.out3 = OutConv(k, 256, bias)
    def forward(self, x):
        x = self.drop(F.relu(x))
        x = self.sconv0(x)
        x = self.sconv1(x)
        o1c,o1l = self.out1(x)
        x = self.sconv2(x)
        o2c,o2l = self.out2(x)
        x = self.sconv3(x)
        o3c,o3l = self.out3(x)
        return [torch.cat([o1c,o2c,o3c], dim=1),
                torch.cat([o1l,o2l,o3l], dim=1)]
drop=0.4
head_reg4 = SSD_MultiHead(k, -4.)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam

That's a lot of anchors! Let's run a forward pass and check tensors' shapes along the way

I honestly found very hard to visualize how the model could properly process that many anchors. Therefore, to convince myself everything made sense, I ran the PyTorch forward pass line by line, printing tensors' shapes and checking how the input evolved along the network.

x = V(torch.randn(64, 512, 7, 7))
x.size()
# torch.Size([64, 512, 7, 7])
x = head_reg4.drop(F.relu(x))
x.size()
# torch.Size([64, 512, 7, 7])
x = head_reg4.sconv0(x)
x.size()
# torch.Size([64, 256, 7, 7])
x = head_reg4.sconv1(x)
x.size()
# torch.Size([64, 256, 4, 4])
o1c, o1l = head_reg4.out1(x)
o1c.size(), o1l.size()
# (torch.Size([64, 144, 21]), torch.Size([64, 144, 4]))
x = head_reg4.sconv2(x)
x.size()
# torch.Size([64, 256, 2, 2])
o2c, o2l = head_reg4.out2(x)
o2c.size(), o2l.size()
# (torch.Size([64, 36, 21]), torch.Size([64, 36, 4]))
x = head_reg4.sconv3(x)
x.size()
# torch.Size([64, 256, 1, 1])
o3c, o3l = head_reg4.out3(x)
o3c.size(), o3l.size()
# (torch.Size([64, 9, 21]), torch.Size([64, 9, 4]))
torch.cat([o1c,o2c,o3c], dim=1).size(), torch.cat([o1l,o2l,o3l], dim=1).size()
# (torch.Size([64, 189, 21]), torch.Size([64, 189, 4]))

Training and showing new results

Let's train this enhanced model and see how predictions look like.

learn.crit = ssd_loss
lr = 1e-2
lrs = np.array([lr/100,lr/10,lr])
learn.lr_find(lrs/1000,1.)
learn.sched.plot(n_skip_end=2)
learn.fit(lrs, 1, cycle_len=4, use_clr=(20,8))
learn.freeze_to(-2)
learn.fit(lrs/2, 1, cycle_len=4, use_clr=(20,8))
x,y = next(iter(md.val_dl))
y = V(y)
batch = learn.model(V(x))
b_clas,b_bb = batch
x = to_np(x)
fig, axes = plt.subplots(3, 4, figsize=(16, 12))
for idx,ax in enumerate(axes.flat):
    ima=md.val_ds.ds.denorm(x)[idx]
    bbox,clas = get_y(y[0][idx], y[1][idx])
    a_ic = actn_to_bb(b_bb[idx], anchors)
    torch_gt(ax, ima, a_ic, b_clas[idx].max(1)[1], b_clas[idx].max(1)[0].sigmoid(), 0.21)
plt.subplots_adjust(wspace=0.15, hspace=0.15)

That's a lot better than before. The new model can now correctly bound the bird in the third image from the top left corner. This is obviously the result of having such a variety of ABs as input.

Something is still off though. Look at the cactus and the potted plant (third column, second and third row). Why did the model believe there was no object at all in there? Is there a reason why it thinks only background is in there?

The Focal Loss

It looks like the classification chunk of our pipeline is not working as expected. If the model flags every AB as background, then getting the BB right doesn't really matter. What is going wrong?

The time has come to take a closer look at a topic I purposely touched upon only vaguely so far: the loss function we use for the multi-class classification task.

It turns out that up until here we have used a Binary Cross Entropy Loss (BCE - implemented below in the `BCE_Loss` class). This might look weird as, generally, BCE is used for binary classification. For multi-class tasks, Categorical Cross Entropy (CCE) is a much more common choice. This is because CCE looks at all classes at the same time, providing an answer to the very natural question: "which of the N classes is this object more likely to belong to?". The class with the highest probability wins. We are done in just one step.

In the case of SSD, though, the question we are asking CCE to provide an answer to is a little more complicated. Specifically, we ask: "which of the N classes is this object HAVING an IOU of at least 40% with our box AND NOT being background more likely to belong to?". This was a mouthful. We are asking CCE to come up with an answer within just one computation. This is hard. A better option would be to use BCE instead. Its strength resides in the fact that we ask a simpler question: "is this a plane? A car? A pottedplant? A person?" and so on so forth. If the answer is always NO, then we are dealing with background. It turns out that this tweak works pretty well, overperforming CCE considerably.

class BCE_Loss(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.num_classes = num_classes
    def forward(self, pred, targ):
        t = one_hot_embedding(targ, self.num_classes+1)
        t = V(t[:,:-1].contiguous())#.cpu()
        x = pred[:,:-1]
        w = self.get_weight(x,t)
        return F.binary_cross_entropy_with_logits(x, t, w, size_average=False)/self.num_classes
    
    def get_weight(self,x,t): return None

Nevertheless, as you saw in our modeling attempts so far, the BCE twist does not provide enough flexibility. The network still fails.

What are we missing?

What is happening is that we are not considering the dangers of class imbalance. In one stage detectors such as SSD, the model parses in one go a lot of different anchors. The vast majority of those contain nothing interesting and are hence flagged as background. Only a handful get the occasion to overlap with an actual object. This is a big problem as, during training, the model opts for the easy and most likely option: label everything as background. This is why a couple of examples above show no boxes at all. The network has a much easier life in guessing nothing than a specific object.

This issue, even though well known in any classification task, was actually surfaced as a major pitfall of SSD only in August 2017, when FAIR published the game-changing paper Focal Loss for Dense Object Detection. The biggest contribution of this work is having figured out the major issue within SSD (BCE not accounting for class-imbalance) and providing a solution for it, the Focal Loss (FL). From the paper itself:

We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause.

We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

Focal Loss for Dense Object Detection Paper

The chart below is also extracted from the paper. On the y axis we find the value of the loss function, plotted against the probability of ground truth class (pt). Any meaningful loss should decrease as pt increases. This is indeed the case for CE. It is worth noting though, that even when the classifier is reasonably confident about its predictions (pt > 0.6), still the CE loss has a non-trivial magnitude (`~0.5`). For it to really get negligible, the level of confidence needs to approach 0.9. This is too high a threshold to represent a viable payoff for a model during the training phase.

Multiplying CE by an appropriate weight, instead, manages to significantly lower the loss even for cases when the network is less sure about its guess.

Figure 1

As in the previous quote, the paper does an excellent job at summarizing what is going on:

The CE loss can be seen as the blue (top) curve in Figure 1. One notable property of this loss, which can be easily seen in its plot, is that even examples that are easily classified (pt > .5) incur a loss with non-trivial magnitude. When summed over a large number of easy examples, these
small loss values can overwhelm the rare class. [...]

A common method for addressing class imbalance is to introduce a weighting factor alpha between [0, 1] for class 1 and 1-alpha for class -1. In practice, alpha may be set by inverse class frequency or treated as a hyperparameter to set by cross-validation. [...]

As our experiments will show, the large class imbalance encountered during training of dense detectors overwhelms the cross-entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. Instead, we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives. [...]

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives low loss. For instance, with gamma = 2, an example classified with pt = 0.9 would have 100x lower loss compared with CE and with pt ~ 0.968 it would have
1000x lower loss. This, in turn, increases the importance of correcting misclassified examples (whose loss is scaled down by at most 4x for pt < .5 and gamma = 2)

Focal Loss for Dense Object Detection Paper

Literally, the fix proposed by the FAIR folks is to multiply CE by a factor. As simple as that. The simplicity of the innovation is also pretty clear by the amount of code needed to actually implement it in Python. Below the `FocalLoss` class, adding to BCE_Loss (from which it inherits) the `get_weight` method, in charge of calculating the scaling factor proposed in the paper (within `BCE_Loss` it returned `None`).

class FocalLoss(BCE_Loss):
    def get_weight(self,x,t):
        alpha,gamma = 0.25,1
        p = x.sigmoid()
        pt = p*t + (1-p)*(1-t)
        w = alpha*t + (1-alpha)*(1-t)
        return w * (1-pt).pow(gamma)

All the rest stays the same. Not one other line of code changes.

So, here is what th model produces after having been re-trained with the Focal Loss. Nice results!

Non Maximum Suppression

The SSD results with the addition of Focal Loss look a lot better than before. There is still one annoying thing to solve, though. In all the plotted examples, the network detects the same object multiple times. Look at the third picture on the first row. The one representing a bird. Ideally, we'd want only one box around the bird, not an arbitrary number.

The way to address this issue is the so called Non Maximum Suppression (NMS). It works in the following way. For each pair of predicted boxes, we calculate their IOU. If they got assigned the same class and if IOU is higher than a specific threshold, the 2 boxes get merged. The process goes on until we end up with just one box per object.

Here a few examples from the previously shown collection, after going through NMS. It seems we nailed the bird!

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading