hi , I would like I am doing at for this amazing experience and ask one reason 6 months in touch , do so via linkedin in kindle with instant video in the kitchen !
If you are wondering whether I am drunk or not, the answer is that I am not.
What you read above is a farewell email generated by a Deep Learning Language Model which I trained on a corpus of “last-day” emails some of my former co-workers sent when saying goodbye to the team.
I am leaving Amazon and I told myself “I need to drop a note to the Kindle folks but how can I raise the bar of creativity? Maybe I can ask a Neural Network to come up with something for me!” and the Farewell Email Writer was born.
The result email does not make much sense, especially as the model thinks there is an instant video in the kitchen instead of a cake (which is usually what you would bring on your last day)!
Nevertheless it was a very fun exercise. Let me provide a little bit more context into this challenge.
How did you do it?
I fine-tuned a language model, originally trained on Wikipedia, on a custom dataset of farewell emails sent by some of my former co-workers.
How many emails did you use?
35 unique past emails.
What is a language model?
A language model is a model which has learnt a specific language, i.e. it is able to produce an arbitrary sensical piece of text. In this case in English.
What does it mean that it was pre-trained?
It means the network was “taught” English by “reading” Wikipedia. It was looking at that corpus that it learnt how words relate to each other to create consistent sentences.
What is fine tuning (a.k.a. transfer learning) and why is it useful?
Fine-tuning a network consists in getting a model which already “knows” how to do something and in steering its knowledge a little bit towards a specific task.
In this case I needed the model to learn how to “write farewell emails”. It turns out that it is much easier to teach that to a network which already “speaks” English.
This is intuitive. Imagine if I asked you to reproduce a Picasso. You’d probably argue that you need to take painting classes first. This strategy is especially useful when having very little data at one’s disposal. Had I had a corpus of 1M farewell emails to train a network on, then, probably I would not have cared about fine-tuning (even though it would have orders of magnitude faster in any case!). With 35 emails only, though, I am obliged to grab a model which already “speaks” English and to “add” the email bit of knowledge on top.
Can you provide more details on the architecture of the pre-trained model?
I have used an AWD-LSTM (which stands for ASGD Weight-Dropped LSTM) language model.
These kind of architectures reflect very recent advances in Deep Learning for NLP, mainly based around improved regularization and optimization strategies for word-level models.
Which framework did you use?
I used Amazon’s MXNet Gluon, a very powerful framework which combines the benefits of imperative (like Facebook’s PyTorch) and declarative (like Google’s Tensorflow) libraries.
Specifically I have run the experiment on top of GluonNLP, a very flexible and resourceful Deep Learning Toolkit for Natural Language Processing (NLP).
Can we look at the code?
Sure. I uploaded the Jupyter notebook here on NbViewer and it is also embedded at the bottom of this post.
How does the model know how to start and (most importantly) when to end an email?
This is a great question!
As soon as the network’s training is over, we can feed it a start-token.
In this case I triggered the text generation with the string “hi , “.
When the model “reads” the above token it starts exploring the space of possible subsequent words given the previous ones.
i.e. given “hi , ” what is the most likely word to follow?
This technique is called BeamSearch.
The network stops generating text as soon as the subsequent-most-likely word it picks is the token “<eos>”, which stands for end-of-sentence.
When this happens, it knows it has to stop and start over with a new email.
Why does the model produce such poor quality emails?
I used very little data (only 35 emails) and I did not spend time tweaking the network’s parameters.
I just wanted to make sure I nailed the details of the implementation as this is the first time I play around with fine-tuning in NLP.
It is generally much more common in Computer Vision.
Follow me! If you like what you are reading feel free to follow me on LinkedIn or Twitter!