Skip to content

Detecting machine VS human-generated Wikipedia articles with fastai

Reading Time: 7 minutes

Note: notebook, dataset and AWS CloudFormation template.

Introduction

At the time of writing, I have the honor of attending the live-stream of the fastai’s course 2020 part 1. Admittedly, it requires some effort, as I have to wake up in the middle of the night, at around 3 AM (I live in Belgium) to connect to the YouTube stream. Needless to say, it is totally worth it. The general content is top-notch, and on top of it, Jeremy and Sylvain recently released fastai2, a rewrite from scratch of their famous library. The bar of the previous version was already very high, but they somehow managed to crush it, gifting the community with a flexible, efficient, and easy-to-use code base. For reference, here a couple of useful links: to the fastai paper, to the fastai book and to fastai2.

As part of the self-imposed course homework, I decided to take the library for a spin and test its text module, building a classifier to detect machine vs human-generated Wikipedia articles.

AWS infrastructure

First things first. Let’s cover the infrastructure part. How do we run fastai2 code? There are several options available: PaperSpace, AWS, Microsoft Azure, Google Colab, and own computing. Being a huge Amazon Web Services fan, I obviously went down the SageMaker route. The amazing fastai community shared a CloudFormation template to serve this purpose, so the process was a real breeze. For the ones unfamiliar with the concept, let me quickly summarize it. CloudFormation allows users to spin up entire resources stacks on AWS by simply providing a yaml configuration file. Such a file, is nothing else than text with detailed instructions on which resources to create, names and permissions to assign to them, technical details related to single instances and any script to run at boot time. The idea is to submit this to-do list to CloudFormation to automate the creation of complex infrastructures, without having to jump from service to service in the AWS console and checking all the boxes manually. If you have the yaml config already, then the process is as simple as navigating to the CloudFormation UI and creating a new stack. Like this.

Creating a CloudFormation stack based on a yaml config file
fastai2 stack successfully created
As part of the stack, the fastai2 SageMaker notebook instance is spun up

The end result is the creation of a SageMaker notebook instance with the fastai2 library and all its dependencies installed. Good, we have our playground. Let’s hack something out of it.

Creating a dataset

As usual, the first thing any DL practitioner needs is a dataset. As I did not find anything pre-cooked, I decided to build one myself. This is how I did it (section 1 of this notebook):

  1. Download Wikitext-2 (a subset of Wikitext-103)
  2. Iterate over a couple hundred randomly selected Wikipedia articles and:
    1. Add the untouched original to the human dataset
    2. Feed a Huggingface large GPT-2 model with the first 2-3 sentences of the original article, and ask the transformer to generate ~900-tokens-long text. Add the result to the machine dataset.

As you can see from the notebook, all the magic happens within the generate_text function (copy-pasted below for reference). The language model is the massive gpt2-large English transformer from Huggingface (36-layer, 1280-hidden, 20-heads, 774M parameters).

def generate_text(index):
    doc = wikitext.loc[index, 0]
    dot = doc[400:len(doc)].find(".")
    prompt_text = doc[:(400+dot+1)]
    
    encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
    encoded_prompt = encoded_prompt.to(DEVICE)
    output_sequences = model.generate(
        input_ids=encoded_prompt,
        max_length=LENGTH + len(encoded_prompt[0]),
        temperature=TEMPERATURE,
        top_k=K,
        top_p=P,
        repetition_penalty=REPETITION_PENALTY,
        do_sample=True,
        num_return_sequences=NUM_RETURN_SEQUENCES
    )
    generated_sequence = output_sequences[0].tolist()
    # Decode text
    text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
    # Remove all text after the stop token
    text = text[: text.find(STOP_TOKEN) if STOP_TOKEN else None]
    # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
    total_sequence = (prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :])
    total_sequence = total_sequence[:(total_sequence.rfind(".")+1)]
    
    save_text(total_sequence, 
              f"/home/ec2-user/SageMaker/course-v4/nbs/wikitext/machine/{index}.txt")
    save_text(wikitext.loc[index, 0][: (wikitext.loc[index, 0][:len(total_sequence)].rfind(".")+1)], 
              f"/home/ec2-user/SageMaker/course-v4/nbs/wikitext/human/{index}.txt")
    print(index, len(generated_sequence))

With this approach I created 335 machine articles which, added to the 335 human counterparts, constitute my 670-sized dataset. Here an example of an original Wikipedia piece:

Original Wikipedia article

And here the machine-generated version, e.g. what happens when we feed the first 3 sentences from the original one to GPT-2:

GPT-2-generated article. The green highlighted section is the prompt which was fed to the language model.

Modeling: transfer-learning in NLP

Let’s move to the modeling part, covered in section 2 of this notebook. Following fastai’s best practices, we apply transfer learning.

  1. First, we fine-tune an English-pre-trained language model on our dataset. The core principle driving this step is that, in order for our network to discriminate human vs machine, it needs to have a very good understanding of English itself (syntax, grammar, etc.). The model we’ll use is an AWD_LSTM trained on Wikipedia. So, technically, it knows quite a bit of the language already. Still, our dataset is not strictly a Wikipedia sample. Half of it was generated by GPT-2, and several tokens might not be contained in the transformer’s vocabulary. This fine-tuning step is therefore done to remedy that and bring the `AWD_LSTM` closer to our task of interest.
  2. Second, we actually train a binary classifier, to teach a model the difference between human and machine-generated text.

1. Fine-tuning a pre-trained language model

Step 1 trains a language model on our articles, e.g. a classifier to predict the next token in a sentence given the previous ones. As we are using a pre-trained network, in practice this means we

  1. initialize to random weights the embeddings of the words present in the articles and missing in the original Wikipedia corpus.
  2. train those embeddings first, freezing the rest of the parameters.
  3. unfreeze and train the entire network at the same time.

The below screenshot creates DataBlock and related dataloaders to package the data in a usable format. Take a moment to look at the output of show_batch: the text column represents the X (model’s inputs), whilst the text_ column the y (label). Notice how text_ is shifted one word ahead with respect to text_. As we would expect, given our task is to predict the next token.

The training part is taken care of by fastai’s fit_one_cycle‘s magic.

2. Training a binary classifier

Once fine-tuned the original Wikipedia-based AWD_LSTM, we can move to the task we actually care about: learning a binary classifier to detect fake articles.

Same process as before: we create a DataBlock and dataloaders, define a learner object and fit_one_cycle. The DataBlock is almost identical to the one we instantiated in the first step, with only a couple of (very significant) differences.

# TO FINE-TUNE THE LANGUAGE MODEL
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files, 
    splitter=RandomSplitter(0.1)
)
# TO TRAIN A BINARY CLASSIFIER
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=get_text_files,
    splitter=GrandparentSplitter()
)

For sure, the most interesting edits occur within TextBlock.from_folder:

  • is_lm=True is dropped. This makes sense as the language model does not need explicit labels. They are in the text itself, in the form of the “next word”, which is why get_y is not defined either. In the case of the binary human VS machine classifier instead, each article must be appropriately flagged, which is why we define get_y=parent_label, e.g. taking the label from the names of the folders the text files are saved into. Take a look at the path structure of an article: /wikifake_tiny/valid/machine/104.txt. get_y=parent_label means: use as label the name of the parent folder, machine in this case. splitter=GrandparentSplitter() means: use as training/validation split the name of the grandparent folder, valid in this case.
  • vocab=dls_lm.vocab is added. This is to make sure that the vocabulary used for the binary classifier is the same as the one used for the language model. The token-to-index mapping must be the same, otherwise all that fine-tuning work would be useless.

And here how a data batch for the new binary classifier looks like.

To close up on the magic of DataBlock, make sure to check out the summary() method, e.g. the output of the following piece of code. It prints a succinct step-by-step description of what happens under the hood, from data reading to tokenization and numericalization, up until batching. Incredibly handy to debug your pipeline.

DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=get_text_files,
    splitter=GrandparentSplitter()
).summary(path)

We are now left with training the actual model. Nothing could be easier with fastai. Here the steps:

  1. Create a text_classifier_learner on top of DataBlock, load the encoder part of the language model fine-tuned earlier, and fit_one_cycle on the frozen network (e.g. tweaking just the final layer).
  2. Unfreeze an additional layer and fit_one_cycle again, using discriminative learning rates (DLR), e.g. higher learning rates for layers closer to the head and lower learning rates for deeper layers (this is what slice(1e-2/(2.6**4) does).
  3. Unfreeze an additional layer and fit_one_cycle again, using DLR.
  4. Unfreeze the entire network and fit_one_cycle again, using DLR.
# 1
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(2, 1.4e-2)
# 2
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
# 3
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
# 4
learn.unfreeze()
learn.fit_one_cycle(6, slice(1e-3/(2.6**4),1e-3))

The last step achieves 91% accuracy. Mind-blowing.

Running inference with the trained model on new articles is also incredibly easy. Look at this.

idx = 0
classes = learn.dls.vocab[1]
pred,pred_idx,probs = learn.predict(dls_clas.valid_ds[idx][0])
actual = int(dls_clas.valid_ds[idx][1])
print(f"Article's actual class: {classes[actual]}\nPrediction: {classes[pred_idx]}; Probability: {probs[pred_idx]:.04f}")
# output
Article's actual class: machine
Prediction: machine; Probability: 0.9916

Great! This was just an introductory post around the amazing capabilities of the fastai library. I am planning to deliver more in the upcoming weeks so stay tuned!

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading