Plant Pathologies: fastai's wonders in Computer Vision -

Reading Time: 11 minutes

Note: full Jupiter notebook here.

Table of Contents

Introduction

I am writing this post to summarize my latest efforts in exploring the Computer Vision functionality of the new fastai library.

After reading the first eight chapters of fastbook and attending five lectures of the 2020 course, I decided it was the right time to take a break and get my hands dirty with one of the Deep Learning applications the library offers: Computer Vision. As always, the prerequisite for any experimenting in Machine Learning is a dataset. Unsurprisingly, I turned myself towards Kaggle, and borrowed the data from the Plant Pathology competition. As the overview page puts it

Misdiagnosis of the many diseases impacting agricultural crops can lead to misuse of chemicals leading to the emergence of resistant pathogen strains, increased input costs, and more outbreaks with significant economic loss and environmental impacts. Current disease diagnosis based on human scouting is time-consuming and expensive […] Objectives of ‘Plant Pathology Challenge’ are to train a model using images of training dataset to […] Accurately classify a given image from testing dataset into different diseased category or a healthy leaf;

Stated differently, the goal of the competition is to classify pictures of leaves into four categories: healthy OR rust OR scab OR multiple diseases. Notice the OR. This is a multi-class classification problem, not a multi-label one, e.g. each leaf is tagged with one category only.

As previously mentioned, the purpose of this exercise was to explore to a maximum the breadth of possibilities offered by the new powerful API fastai offers. Let’s check out the techniques/tricks/approaches I ended up experimenting with:

DataBlock API: package data into a model-usable format.
Learning rate (LR) finder: stop guessing the LR for a neural network and pick the best one since the beginning.
Discriminative LRs: use higher rates for the final layers of the model and lower, less aggressive ones to tune the shallowest layers (e.g. closest to the input).
Fit one cycle: smart learning rate and momentum schedule during training.
Callbacks: access and edit properties of the model during training, to customize loss functions, plots, performance metrics and much more.
Test Time Augmentation (TTA): at inference time, don’t run an image through the network just once. Create multiple versions of it via data augmentation and average predictions over all of them (check chapter 7 of fastbook).
LabelSmoothing: address the issue of the model’s over-confidence driven by softmax. Pushing a neural network to predict 1 for one class and 0 for all the rest is a little harsh (check chapter 7 of fastbook).
Mixed precision training (MPT): speed up training by reducing the accuracy of computations (from 32 floating-point precision to 16) in non-critical parts of the training process. It’s like rounding numbers when the full float is not strictly needed, such as in the forward and backward calculations. This does not apply to the weights update in the optimizer step, for which we need the highest possible precision. MPT allows to reduce the model’s memory footprint making training lighter and faster. The best of the story is that activating MPT in fastai is as simple as calling Learner.to_fp16. That’s it.
Progressive resizing: start training with low-resolution images and progressively increase image size. This is basically another version of transfer learning, e.g. train a model on 128-shaped pics and re-use learned weights to initialize a 256-shaped-pics CNN (check chapter 7 of fastbook).
Rich visualizations: pretty much any step of the pipeline, from data preparation to training and validation can be further explored by producing insightful visualizations (mostly powered by callbacks, of course).

Computer Vision with fastai

DataBlock API

The DataBlock API is the foundational brick of the entire data ingestion and processing pipeline in fastai. Its flexibility was already made apparent in my NLP experiments here, yet it does not stop to amaze me. For the Plant Pathology challenge, the data came in CSV format, containing the list of picture IDs and labels, zipped up with the images themselves.

The first step was to load the CSV file into a pandas.DataFrame and turn the one-hot-encoded label into a single-column-multi-class target. Like this:

df = pd.read_csv(path/'train.csv')
df['target'] = np.array(cols)[np.argmax(df[cols].values, axis=1)]
df.head()

That’s basically all we need to do. The rest is taken care of in the DataBlock: fetch images and labels, train/validation split and data augmentation.

We simply pass the DataFrame to the constructor and fastai bundles up datasets and dataloaders in a single object. For context, notice below how I have defined a little helper function (get_dls) and charged it with grabbing one of the three stratified folds I have divided my images into, and running the DataBlock magic on top of it. This will also come handy when performing progressive resizing later on.

def get_x(r): return path/'images'/f"{r['image_id']}.jpg"
def get_y(r): return r['target']
def get_dls(bs, size, fold, df):
    df_fold = df.copy()
    df_fold = df_fold.loc[df_fold.fold==fold].reset_index()
    
    dblock = DataBlock(blocks = (ImageBlock, CategoryBlock),
                   get_x = get_x, 
                   get_y = get_y,
                   splitter=IndexSplitter(df_fold.loc[df_fold.which=='valid'].index),
                   item_tfms=Resize(700),
                   batch_tfms=aug_transforms(size=size, max_rotate=30., min_scale=0.75, flip_vert=True, do_flip=True))
    dls = dblock.dataloaders(df_fold, bs=bs)
    assert (len(dls.train_ds) + len(dls.valid_ds)) == len(df_fold)
    return dls

Now look at how easy it is to play with data!

Learning rate (LR) finder

Jeremy always says: “If you have to get one thing right, that’s the learning rate”. The LR is the single most important hyper-parameter the destiny of a neural network depends on. Too low and training takes forever. Too large and training diverges. It just needs to be the right amount. Deep Learning practitioners used to tune it in an endless iteration of experiments, simply trying multiple values and checking what worked best. That was before Leslie Smith et al introduced the method currently proposed by fastai within Learner.lr_find(). As it often happens, the idea is quite simple: we start from a very low LR and progressively increase it. During each iteration, grab a minibatch of data and calculate the loss, pushing the LR higher until the loss explodes. When we plot it over LR, we get something similar to the chart below.

This is expected. When the LR is too low, the network does not train. Only at higher LRs SGD starts operating its magic. As you can see, the learning process accelerates in the range of LRs 1e-4/1e-1, with the loss plunging rapidly. At 1e-1 we witness a U-turn though. The loss quickly increases and ends up exploding. Actually, an LR value of 1e-1 is already too big. Notice how at ~5e-2 the loss descent has already decelerated. Looking at the above chart, we want to pick a learning rate corresponding to the steepest decrease in loss. 3e-3 seems a good candidate, and this is what I will end up using in the notebook. How hard is it to produce such a graph? As hard as typing the following two lines of code

learn = cnn_learner(dls, ARCH, loss_func=LOSS_FUNC)
learn.lr_find()

Discriminative Learning Rates

lr_find() allows to appropriately choose a LR. Does it mean it is THE right one? The answer would probably be a plain and simple YES, if we weren’t using transfer learning (TL). As a reminder, transfer learning consists of re-using models that have been already trained on another task, somehow similar to the one at hand. This avoids starting the learning process from scratch, and instead leveraging meaningful weights. TL is probably the single most important concept in Deep Learning, and of course, it applies to our use case as well. We are classifying leaves, so it makes total sense to use models trained on ImageNet as it is almost guaranteed they already know pretty well what a leaf looks like. The whole TL process is then based on slightly tweaking the shallowest layers (the ones closest to the input), as after all this is where the low-level-ImageNet vision features are encoded, and more aggressively training the final layers, as those are the least similar to the original ImageNet challenge. Differently put, we have to be cautious with shallower weights and a bit more reckless with the rest of the network, e.g. the learning rate cannot be the same for both! We have to use different LRs for different layers at different depths. This is what discriminative LRs is all about.

In fastai that’s achieved by passing slice(low_lr, high_lr) to any fit function. Fastai automatically allocates low_lr to the shallowest layers’ group, high_lr to the model’s head and equally spaced LR values in between, which is what the following line of code is doing. Neat.

learn.fit_one_cycle(5, slice(base_lr/(2.6**4),base_lr))

Fit one cycle

Together with lr_find, this is something I have implemented from scratch already in this notebook and discussed in this post. fit_one_cycle represents another great innovation introduced by Leslie Smith and proposed by fastai. This learning policy consists of starting the training phase with a very low LR, linearly increasing to the optimal rate obtained by the LR finder, and then annealing it to almost zero in a cosine fashion. At the same time, momentum (MOM) follows an opposite schedule, based on the logic that higher LRs, e.g. bigger jumps on the loss landscape, should not be constrained by high MOMs. Hence high LR-low MOM, and vice-versa. Read more here. Everything comes conveniently packaged in a single function call, as shown below.

Callbacks

If you need a single reason to use fastai, that’s its Callback-ing system.

As Deep Learning practitioners, we often need more than just creating dataloaders and invoking a fit method. We might want to track losses, learning rates and momentums during training, or alter parts of the model and optimizer altogether, to quickly experiment with different setups and implement a specific research paper. Thing is, even if of vital importance, in most libraries this is incredibly painful to achieve. Assuming we manage to, getting there very likely involves writing tons of fully custom and not reusable code. Considering that the strength of a ML practitioner lays in its ability to iterate as fast as possible across ideas, this is obviously not an ideal scenario. It’d be great if we had a way to trigger events at specific parts of the learning pipeline, just by invoking a callback as JavaScript does, e.g. onclick or onload. In the DL world, onclick would probably be replaced by on_epoch_end or on_batch_start. Something along these lines. Well, it turns out that fastai implements exactly that! As per the Learner API, the list of supported callbacks is the following:

'Start Fit', 'begin_fit', 'Start Epoch Loop', 'begin_epoch', 'Start Train', 'begin_train',
'Start Batch Loop', 'begin_batch', 'after_pred', 'after_loss', 'after_backward',
'after_step', 'after_cancel_batch', 'after_batch','End Batch Loop','End Train',
'after_cancel_train', 'after_train', 'Start Valid', 'begin_validate','Start Batch Loop',
'**CBs same as train batch**', 'End Batch Loop', 'End Valid', 'after_cancel_validate',
'after_validate', 'End Epoch Loop', 'after_cancel_epoch', 'after_epoch', 'End Fit',
'after_cancel_fit', 'after_fit'

What does this mean in practice?

Let’s say you need to implement mean column-wise ROC AUC as an evaluation metric, which was exactly my problem for the Plant Pathology challenge. ROC AUC is only natively supported for binary classification tasks. Ours is a multi-class problem, so we need to implement something custom.

The idea is the following:

begin_epoch: at the beginning of each epoch, we have to create 2 placeholder Tensors, one for the ground-truth label (y_true) and the other for model-output probabilities (probas).
after_batch: at the end of each batch, only during the validation phase, we have to:
1. turn model-output activations into probabilities, via softmax, and append them to probas.
2. grab ground-truth labels and append them to y_true.
after_epoch: at the end of each epoch, we have to compare y_true and probas and calculate the column-wise ROC AUC.

Good luck with coding the above in a neat way. With fastai it all boils down to the following piece of cake. I was mind-blown when I figured it out. Incredibly flexible.

import sklearn.metrics as skm
class ColumnWiseRocAuc(Callback):
    def begin_epoch(self):
        self.probas, self.y_true = Tensor([]), LongTensor([])   
    
    def after_batch(self): 
        if not self.training:
            y = self.y[:, None].cpu()
            self.probas = torch.cat((self.probas, preds))
            self.y_true = torch.cat((self.y_true, self.y.cpu()))
    
    def after_epoch(self):
        print(f"mean column-wise ROC AUC: {skm.roc_auc_score(self.y_true, self.probas, multi_class='ovr')}")
# dls, arch and loss_func need to be defined separately of course
learn = cnn_learner(dls, arch, loss_func=loss_func, metrics=accuracy, cbs=ColumnWiseRocAuc())

Note: to be fair, this is probably not the best example of leveraging Callbacks, as, if you need to implement a custom metric in fastai, you should really use the AccumMetric class. This is what I ended up doing in practice as I wanted the metric to be automatically displayed in the progress bar during training. Below the relevant code. Still, I hope you got the point of my previous demonstration!

def _accumulate(self, learn):
    pred = learn.pred
    if self.sigmoid: pred = torch.nn.functional.softmax(pred, dim=1) #hack for roc_auc_score
    if self.thresh:  pred = (pred >= self.thresh)
    targ = learn.y
    pred,targ = to_detach(pred),to_detach(targ)
    if self.flatten: pred,targ = flatten_check(pred,targ)
    self.preds.append(pred)
    self.targs.append(targ)
AccumMetric.accumulate = _accumulate
def RocAuc(axis=-1, average='macro', sample_weight=None, max_fpr=None,multi_class='ovr'):
    "Area Under the Receiver Operating Characteristic Curve for single-label binary classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr,
                         flatten=False,multi_class=multi_class,sigmoid=True)
# dls, arch and loss_func need to be defined separately of course
learn = cnn_learner(dls, arch, loss_func=loss_func, metrics=[accuracy, RocAuc()])

Test Time Augmentation (TTA)

The idea behind TTA is very simple: at inference time, instead of running an image through the classifier once, we apply the same data augmentation techniques used for training, generating N different versions of the same picture and producing N different predictions. The final output is just the average of the N probabilities. This adds robustness to the classifier, and generally helps with accuracy, at a cost of slowing down the inference process though. Nothing fastai came up with, really; TTA has been around for some time. Still, fastai ships with a single neat function that takes care of everything under the hood. Learn.TTA runs on the validation set by default, but any arbitrary dataloader can be passed too. In my case, I used it on the Kaggle test set. From the documentation:

LabelSmoothing

First introduced in this paper, LabelSmoothing brings a very interesting idea to the Deep Learning table. Here the context: especially in a multi-class scenario, one-hot encoding the target variable might not be the smartest thing to do. By definition, it assumes one label is correct (flagged with 1) and the rest are not (flagged with 0). This is harsh though. Consider the case some of your data points are mislabeled. Noisy targets are far more common than we think as the labeling process often involves a human in the loop. Assigning a 1 to a class (and 0 to the others) encourages the model to produce as high activations as possible for that specific entry, in an attempt to push the sigmoid output to a value close to 1. This drives a general overconfident behavior of the neural network, which, learning only ones and zeroes, leads to potential overfitting and to some undesirable bumps during the learning phase (e.g. very large penalties for wrong predictions).

LabelSmoothing addresses this issue by refactoring the one-hot encoded target in a less-confident way. Assuming we have 10 classes, an idea could be to replace 1 with 0.82 and the remaining 0s with 0.02, to still make sure everything adds up to one.

# CLASSIC ONE-HOT ENCODED TARGET
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
# LABEL-SMOOTHED TARGET
[0.02, 0.02, 0.82, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02]

The more general concept is to pick a threshold eps (0.2 in the example above, meaning we are 20% unsure of our labels) and then replace 0 with eps/N_classes and 1 with 1-eps+eps/N_classes. LabelSmoothing shines in case of noisy labels, of course. It generally requires training for more epochs to pick up steam and really add value to the learning process.

In fastai, all you have to do is picking LabelSmoothingCrossEntropy as a loss function when defining your Learner object.

Progressive resizing (PR)

Progressive resizing belongs to the simple-yet-brilliant bag of tricks fastai makes possible. The idea is incredibly powerful and consists of starting model training on low-resolution images, to progressively increase image size along the way. The learning pipeline could look something like the following:

Train the model on (3, 128, 128)-shaped pictures.
Resize images to (3, 256, 256) and keep training for another couple of epochs.
Resize images to (3, 512, 512) and train for a little more.
Stop

PR indirectly achieves a couple of very important things:

it acts as a data augmentation technique, hence it helps prevent overfitting.
each time we upscale images, we are basically implementing transfer learning, where the high-resolution model re-uses the pre-trained weights of the low-resolution network.
it allows training models faster, at least in the early stages, due to the small images size, and then iterate on bigger and slower networks later on.

As usual, fastai makes PR a breeze to implement. It is just sufficient to overwrite the learn.dls with dataloaders built upon upscaled images and that’s it.

dls = get_dls(bs=128, size=224, fold=0, df=aug_df)
learn = cnn_learner(dls, resnet34)
learn.fine_tune(base_lr=3e-3)
# WE UPSCALE IMAGES HERE
learn.dls = get_dls(bs=32, size=512, fold=0, df=aug_df)
learn.fine_tune(base_lr=3e-3)

Rich visualizations

Last but not least, an additional aspect I really enjoy of the fastai library is its built-in visualization capabilities. You can really tell it has been built from the ground-up with a data science mindset. Questions like the ones below (in pictures’ captions) are typical in any ML pipeline, and fastai makes finding an answer trivial.

Stay tuned for more fastai-related posts in the upcoming weeks!