Note: Jupyter notebooks can be found here.
Introduction
Last year Kaggle hosted a competition from PetFinder.my that challenged ML practitioners to predict how fast cats and dogs would be adopted from a shelter. The contest immediately caught my attention due to the considerable variety of data made available: to date, it is one of very few (if not probably the only one) contests where images, text and tabular samples are provided at the same time. Such a goldmine potentially allows for very rich feature engineering and modeling approaches, giving Data Scientists the chance to explore a wide breadth of interesting techniques.
This is the first writeup of a 2-3-posts series in which I dive into this Kaggle challenge, exploring different algorithmic options and building up solutions of increasing complexity. In this journey, I will have two more-or-less constant companions: Apache MXNet and Amazon SageMaker Studio (simply Studio, from now on). The former is an AWS-backed Deep Learning framework, well known for its scalability, speed, and flexibility. The latter is an IDE added by AWS as part of the SageMaker offering, allowing users to control the end-to-end ML workflow, from data ingestion to production, almost never leaving the UI.
Amazon SageMaker Studio
Let’s get started with the infrastructure. As just mentioned, I decided to take Studio for a spin and drop the usual SageMaker Notebooks environment. I onboarded it via the Quick Start option, as explained here. The result is the following Control Panel.
Clicking on the user shows the recently employed Apps. In AWS jargon, those are nothing else than the Docker images automatically invoked by Studio to run code under different environments. Here mine. As you can see, there is a default App, to guarantee JupyterServer is always up and running, and a mxnet–1-6–gpu–py36-ml-g4dn-xlarge App, which I have recently used and killed (hence the Deleted state).
I can trigger the actual IDE by clicking the Open Studio launcher in Screenshot 1. A new tab will pop-up in your browser, showing a Jupyter-Lab-styled interface, like the one below (as I have already used the tool, in my case, it is already populated with files in the left panel and python notebooks in the centre).
A couple of things to notice here. Studio automatically did 2 things:
- it started the kernel I had last chosen, e.g. the Python 3 (MXNet GPU Optimized), blue bounding box.
- it attached the EC2 instance I had last selected, e.g. ml.g4dn.xlarge, green bounding box.
This also means my earlier Screenshot 2 now has turned into the following, with the appropriate App (Docker image) up and running (and the bill getting charged for it!).
If you are already familiar with Jupyter, you are good to go. You might still wonder how to change the python environment, though: what if you don’t need a GPU? What if you want PyTorch instead of MXNet? This is when the cool stuff starts. You can actually change both EC2 instance and Docker container at your pleasure, combining those as flexibly as you like. Tensorflow on a compute-optimized machine? Check. Standard Data Science toolkit on general-purpose hardware? Check. Just click on either the green/blue bounding boxes highlighted in Screenshot 3 and select what best suits your needs. Everything happens on the fly. The kernel will, of course, get re-initialized and any variables lost in the process, but your files (notebooks, etc…) are going to be made available from machine to machine seamlessly. I really love the experience.
The PetFinder challenge
Data
The dataset Kagglers are provided with is extensively explained here. In short, there are 3 main components which deserve attention:
- Tabular: this is constituted of attributes, at the PetID level, such as FurLength, Name, Vaccinated, Dewormed, Color, Breed, etc. Fields structured in a classical tabular form, more or less ready to be fed to a RandomForest for instance.
- Text: this is the Description field, corresponding to the caption coming with the pet listing, and going along the lines of “2 years old mixed terrier welsh corgi waiting to rehome, etc…”. The competition host provides also the sentiment analysis outputs, resulting from submitting Description to Google’s Natural Language API.
- Images: those are the pictures completing the pet listing. Arguably the most important part of it. Each pet comes with a varying number of these pictures, with a small minority of listings not accompanied by any image.
The dependent variable, e.g. the thing we must predict, is AdoptionSpeed. It has five distinct levels, integer values between 0 and 4, from “adopted on the same day as listed” to “no adoption after 100 days of being listed”. I tackled the task from a multi-class classification perspective, even though many participants addressed the same from a regression point of view. I have still to try the latter myself.
Experiments
Links | Features | Multi-class Accuracy | Quadratic Weighted Kappa | |
1. RandomForest (non HPO) | Notebook | Tabular: low cardinality categorical features + numerical features Text: sentence avg of word-level FastText pre-trained embeddings Images: multiple images avg of ResNet18 convolutional features | 0.404 | 0.337 |
2. CatBoost (HPO) | Same as above | Tabular: all categorical + numerical features Text: same as above Images: same as above | 0.421 | 0.370 |
3. CatBoost (HPO) | Notebook, NLP, CV | Tabular: all categorical + numerical features Text: sentence avg of word-level fine-tuned embeddings Images: multiple images avg of ResNet18 fine-tuned convolutional features | 0.425 | 0.380 |
4. Deep Learning (MLP) | Same as above | Same as previous | 0.383 | 0.323 |
Above you can find a table with a summary of all my experiments. Quadratic Weighted Kappa (QWK), which is the metric used in the Kaggle competition. For reference, the winner scored a QWK of 0.453.
As you can see the best performing approach builds on an hyper-parameter-optimized CatBoost trained on tabular fields, fine-tuned word embeddings, and fine-tuned image convolutional features (more details about what fine-tuned means, later on). Funnily enough, an MLP on the same feature set scores last, whereas a CatBoost coming from a much less time-consuming feature-engineering process, holds a very reasonable second place, surprisingly close to the top.
Let’s take a closer look at the results.
RandomForest and CatBoost baseline
The first step was to build a baseline (code in the PetFinder_ML notebook). The idea is very simple and consists in:
- taking tabular features as they are. CatBoost does not even require encoding of categorical variables, making the training process completely smooth. RandomForest does, of course, which is why I have almost not used categorical features in this case, to avoid wasting time in pre-processing and create an even more simple baseline.
- pass each picture into a pre-trained ResNet18, extract the last conv layer features (512-shaped), and average those over all the available images at the PetID level (as shown in Image 6).
- tokenize pet description, lookup tokens into a pre-trained word embeddings dictionary (FastText, 200-shaped), and average those out (as shown in Image 7).
- concatenate 1, 2, and 3 into a numeric-only dataset (as shown in Image 8).
On top of this feature set, CatBoost (after some careful hyper-parameter tuning) easily beats RandomForest, scoring 0.421 accuracy and 0.370 QWK.
Fine-tuning image and text features
The question I asked myself immediately after completing the previous step was: what if instead of using Wikipedia pre-trained word embeddings and an ImageNet pre-trained CNN, we did something a little more custom?
The “little more custom” thing I am referring to consists in:
- Text transfer-learning:
- load a Wikipedia pre-trained language model (LM – e.g. a model to predict the next word in a sentence).
- replace its head with a new dense layer to output 5 classes.
- fine-tune the network, training it on pets’ descriptions, with AdoptionSpeed as target.
- use word embeddings (now trained on OUR task) from the LM’s encoder to replicate the baseline approach (Image 7).
- Steps 1-4 are outlined in the PetFinder_NLP notebook.
- Image transfer-learning:
- load an ImageNet pre-trained ResNet.
- replace its head with a new dense layer to output 5 classes, instead of 1000.
- fine-tune the CNN, training it on pets’ descriptions, with AdoptionSpeed as target.
- use convolutional features (now trained on OUR task) from the CNN to replicate the baseline approach (Image 6).
- Steps 1-4 are outlined in the PetFinder_CV notebook.
This means we build exactly the same data structure as describe in Image 8, with the (hopefully) considerable difference that text and image features are calculated in a smarter way compared to the baseline.
Given our brand new dataset, I first trained a CatBoost classifier on top of it (notebook). After HPO, I managed to reach 0.425 accuracy and 0.380 QWK, the best so far.
The second attempt was to build a rather basic neural network (Multi-Layer Perceptron – MLP- notebook), whose architecture is displayed in Image 9. I had gone down this route in the past already, in this post, copying fastai’s TabularModel. The DL approach scored terrible, as you can see from the previous table.
What to try next
Having scrolled through a number of Kaggle discussions, several tricks can be tried to improve the performance of the model (TfIdf for text encoding, SVD for dimensionality reduction, switch to a regression approach instead of a multi-class one, carefully optimize thresholds across classes, use a weighted loss instead of up-sampling the minority class as I am currently doing). Nevertheless, as I am mostly interested in experimenting with NN architectures for tabular data (in MXNet), I will give a go to TabNet and DeepGBM. Those will be the object of my next PetFinder-related posts.