As soon as you throw the binome “Deep Learning” into the air, domains such as Computer Vision and NLP immediately come to mind. This is all very legitimate as the ML community has shown countless number of times how Deep Nets shine when applied to a picture or a piece of text. As Jeremy Howard points out correctly in his (amazing!) fast.ai MOOC, though, the above applications do not represent what the large majority of data science folks in industry are working on. Believe it or not, most of nowadays ML tasks are still delivered in a tabular form. So, basically, as a data scientist you are going to be handed sort of a giant Excel spreadsheet with N columns, of which N-1 are features and the remaining one is the dependent variable, i.e. what you are asked to predict. It can be a regression or a classification task. The concept does not really change. Now, even though this all looks very different than training a neural network to perform sentiment analysis on IMDB reviews, at the end of the day, from a very high level perspective, modern DL frameworks would still be expecting batches of data to be fed iteratively into an appropriate series of layers.
Great! So that means if I Google “deep learning on tabular data” I am going to find lots of illuminating code samples, right? Not really. Again, as Jeremy Howard points out, applying a Deep Net to a Youtube video classification challenge looks way sexier than doing the same over a house price regression problem, even though the latter might be a lot more common than the former. And it is not a case that the top search result is actually the fast.ai blog post around this topic. It is basically the only resource out there!
Inspired by this lack of press coverage I decided to take on the challenge and reproduce the approach detailed in this notebook on a different dataset: the Kaggle’s NYC Taxi Fare Prediction competition (actually on just 500K randomly selected rows out of the 55M in the original training set). I actually wanted to go a step further and dig down a level of abstraction in the fast.ai library. The amount of work and the results obtained by Jeremy’s team on this new framework are incredible. Its ease of use is out of discussion. The price the user pays is, of course, the fact that the actual PyTorch code is wrapped into a couple of additional layers. I really wanted to get my hands dirty with the Facebook’s framework, though, hence I dived into the fast.ai code, unburied and adapted the pieces which would be relevant for my task.
Specifically, any Deep Learning project needs to have at least 3 parts:
- a dataset and a way to iterate over it to produce batches of features/labels
- a model
- a training/evaluation loop
Here the code for each one of them adapted to my challenge. You can find the full code here on NBViewer.
Dataset/DataLoader
The RegressionColumnarDataset
class takes 3 inputs in the constructor
df
: apandas.DataFrame
containing all the featurescats
: the list of categorical variablesy
: the dependent variable
It returns a Dataset
object whose __getitem__
method retrieves a list of 3 numpy
arrays
- categorical features for a set of data points in the dataframe
- numerical features for the same set of points
- labels for the same set of points
This is not sufficient though.
PyTorch needs something to iterate onto, in order to produce batches which are read from disk, prepared by the CPU and then passed to the GPU for training. To achieve this, we need a DataLoader
, which is what we define in lines 22-23 for both the training and the validation sets.
Model
Let’s take a look at what the model m
contains, by printing the object to console. As you can see we have a series of 15 Embedding
layers. These are used to encode the categorical variables. Together they constitute a matrix of size (batch_size, 195), where 195 is obtained by summing up all the Embeddings’ sizes (second integer in each tuple) Adding the 9 categorical features we have in our dataset, we get to 204, which is the input shape to the first fully connected layer (Linear
in PyTorch). After that, we just have Linear layers, followed by BatchNormalization
and Dropout
. Nothing fancy.
Training/Evaluation Loop
This is where the Deep Learning magic happens. I have wrapped everything inside a function which iterates over the training/validation DataLoaders performing a forward and a backprop pass followed by a step of the optimizer (i.e. updating the weights according to the latest gradient). The loss function is the mse_loss.
An interesting twist to this procedure is the Learning Rate scheduler, which is in charge of modifying the LR during training. If you are wondering why it might be a good idea to dynamically change this parameter while the learning phase is ongoing, there are plenty of blog posts out there treating this subject. This one I particularly liked. It is not a case that it references to a couple of posts by fast.ai fellows. Jeremy Howard continuously stresses the importance of properly adjusting this parameter. For my specific case I opted for a PyTorch Cosine Annealing scheduler, which updates the LR at every mini-batch, between a max and min value following a cosine function. This is how it looks like. As you can see the LR oscillates between 0.01 and 0.And here the training/validation loss per epoch. The model is learning!