Skip to content

Identify toxic online comments with scikit-learn and Gluon NLP

Reading Time: 5 minutes

In this post I wanted to review a list of common approaches to a standard NLP task. The goal I had set for myself was to start exploring standard Machine Learning methods and then grow in complexity applying state-of-the-art Deep Learning strategies. As usual, I borrowed data and challenge from a Kaggle competition. The one I picked for this exercise is the Toxic Comment Classification Challenge, consisting in identifying and classifying toxic online comments. The fact that we are dealing with a multi-label, multi-class classification problem adds an even more interesting twist to the dissertation.

The dataset is constituted of a little less than 160k pieces of text which can be categorized as toxic, severe toxic, obscene, threat, insult, identity hate or none of the previous. Our goal is to maximize the macro-ROC-AUC, i.e. the straight arithmetic average of the ROC-AUC of the 6 models trained to produce the 6 output classes.

Here the approaches I experimented with, along with results and comments.

For the actual code, please visit this Jupyter Notebook.

Naïve-Bayes-LinearSVM on top of Tf-IDF BOW

This baseline was proposed by Jeremy Howard in this kernel. The approach is simple and astonishingly effective. In fact, it turns out to be the most accurate of all the methods I tried, beating even the Transfer Learning approach in MXNet. The strategy is based on 3 main steps

  1. Turning the text corpus into a Bag Of Words (BOW)
  2. Enriching the standard CountVectorizer output with a more insightful Tf-IDF metric
  3. Calculating the Naïve-Bayes matrix multipliers. Nothing else than a fancy way of defining the conditional probabilities of a token given a class. Once computed those we can appropriately inverse them to generate the conditional probability of a class given a set of tokens (Bayes Rule)
  4. Run LinearSVM on top of the matrix calculated in #3

This technique produced an outstanding 0.98057 ROC-AUC on the validation set, which is pretty incredible for something being simple and very fast at the same time.


Logistic Regression on top of Tf-IDF BOW

Do the Naive-Bayes multipliers really help? To answer this question I ran a simple Logistic Regression classifier (basically the same as a LinearSVM) on top of the exact same dataset as before, just skipping point #3. Results are still very good, even though the approach is a little worse then before, scoring 0.97941 on the validation set. Again, it is really interesting how applying such a trivial technique, as Naive-Bayes, helps boost performance at basically zero cost.


Logistic Regression on top of pre-trained GloVe Word Embeddings

This is something I always try whenever I have to do with a NLP task. It is just too easy not to give it a shot. The approach consists in:

  1. Downloading GloVe pre-trained embeddings. I generally go for the 6B Wikipedia + Gigaword ones. This time I opted for the 300-dimensional vectors
  2. Tokenize the dataset
  3. Loop through each comment and lookup the corresponding embedding from the GloVe dictionary, defaulting to an array of zeros whenever the lookup fails. So, say a comment is composed of 10 tokens; at the end of the procedure it would be represented as a (10, 300) matrix. Averaging the matrix column-wise results into a 300-dimensional representation of the entire comment, ready to be fed to a ML pipeline.

Running a Logistic Regression classifier on top of the this dataset returns a quite disappointing 0.94220 ROC-AUC. This is understandable, considering that

  1. each comment is a straight average of word tokens (maybe weighting by Tf-IDF would work better)
  2. these embeddings are pre-trained on a dataset (Wikipedia + Gigawords) which has nothing to do with our current task, i.e. for sure not a hate-speech related dataset.

Logistic Regression on top of ad-hoc trained Gensim Embeddings

Let’s check if training some embeddings from scratch on top of our original dataset helps.

To do that, I have used Gensim. The ease of use of this library is simply fantastic. The model expects a corpus of tokenized sentences. It then trains token-level embeddings of user-specified size. That’s it. This entire explanation is actually longer than the lines of code needed to do the job!

To be able to compare this approach with GloVe, I opted for 300-dimensional vectors. After generating my embeddings, I applied the same approach explained before for GloVe, averaging token-level vectors to obtain a comment-level representation. The result goes directly into a Logistic Regression classifier, producing a ROC-AUC of 0.95545. Slightly better than before, confirming that training embeddings on task-related datasets helps!


Transfer Learning with Gluon NLP

In my last post, I explored PyTorch. I defined for the first time my own datasets, dataloaders and network. I was literally mind-blown by how easy and pythonic its imperative framework is, compared to Keras’ declarative approach. The fact that you can build a deep net in, as simple as, a python class and then use it as a standard function makes the whole process ridiculously simple. The debugging experience is almost fun compared to the physical pain suffered with TensorFlow. Take-home message: PyTorch is great. Now, turns out that MXNet’s Gluon is actually very similar to PyTorch!

On top of it, Gluon ships with a pretty advanced NLP toolkit, which makes working with text very easy. It also incorporates pre-trained Language Models, the secret sauce for Transfer Learning. Why Transfer Learning at all, though? Thing is, how can we expect a model to figure out whether text is toxic or not, if it cannot “speak” English at all? It would be the same as asking me to solve the same problem in Mandarin. I don’t understand a word of Mandarin! My guess is that I would perform very poorly at identifying hate speech.

So, the whole idea is very simple.

  1. We train an independent LM to generate English text. This is what Gluon has already done for us and it is represented by the bottom 2 layers in the diagram on the left (stolen from here). Embedding matrix followed by a standard LSTM encoder.
  2. We then pool the LSTM output (which by default generates a sequence of tokens) to feed it to a sigmoid-activated final dense layer (the top 2 layers in the diagram on the left).

Here the same model with some additional details:

I won’t spend any additional time on Gluon’s technicalities. The official Sentiment Analysis (SA) with pre-trained Language Model (LM) does an awesome job already at describing the process and any attempt of mine to do better would miserably fail.

This approach yields a disappointing ROC-AUC of 0.95160. This is indeed disappointing. It is rather expected though. Very rarely I have seen off-the-shelf Deep Learning approaches perform super well. Except, maybe, for Transfer Learning in CV, generally, a rather long optimization process is required to fine tune all the network’s hyper-parameters. In any case, this was mainly a proof-of-concept, so I am quite already about the fact I was able to get familiar to the Gluon’s API and to adapt the tutorial’s code to my task. Maybe, later on, I will devote some serious effort to make this work properly ;).

Thanks for reading!

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading