This is a classic. Yet, if you have never tried to implement it yourself, you never know how it looks like, right? I belonged to this group of tech enthusiasts (probably not that big) who have never rolled out an end-to-end solution using AWS. Until last week. This is when I stumbled upon an old python file on my laptop. Turns out it contained the implementation of an AWS Lambda function, I had put together some time before. Its purpose was to get triggered each time a text file was uploaded to an S3 bucket and redirect its contents to Amazon Comprehend for a quick sentiment analysis. That code triggered some interest, and, considering I had always wanted to implement a complete solution in the cloud, I told myself this could be the right moment to take up the challenge.
The idea
Therefore, I came up with the following side project: ingest IMDb reviews into S3, feed those to Comprehend to get a sentiment score and then use QuickSight to visualize whether the actual rating matched what the AWS NLP engine had captured.
Let’s get started.
AWS architecture
The above diagram illustrates how the solution works. These are the steps in detail.
- An external system is in charge of uploading IMDb reviews to S3. I am keeping this very vague as it depends on the use case. In real life it could be IMDb itself interacting with AWS, or a third-party aggregator sending reviews over to the cloud. For my personal proof-of-concept I used GluonNLP, sampling the IMDb dataset and uploading the reviews to an S3 bucket. Â
- I packed the reviews in JSON files with three fields:
- review (original IMDb review)Â
- rating (original IMDb rating, from 1 to 10)
- submitted_on (timestamp of upload to S3).Â
- An AWS Lambda function is triggered each time a new JSON is PUT to the bucket. For the ones new to this concept, Lambda is a compute service that lets you run code without provisioning or managing servers. A cornerstone of serverless infrastructure. The function performs 3 actions, in this order:
- it calls the Amazon Comprehend API and requests two services: a language detection, to figure out which language the review was written in, and a consequent sentiment score. This second part generates one of 4 possible sentiment outcomes (POSITIVE, NEGATIVE, MIXED, NEUTRAL) along with their probabilities.
- it writes the data (original JSON + sentiment score) to a DynamoDB table (covered in #4).
- it saves the original JSON, enriched with the sentiment score, to a second S3 bucket (covered in #5).
- As described in 3.2, Lambda creates a new JSON, adding two fields to the original file: sentiment (predicted sentiment) and sentiment_score (probability of sentiment). This JSON is written to a DynamoDB table, for future reference.
- As described in 3.3, as last step, Lambda writes the enriched JSON to a second S3 bucket, to create a data lake for Athena.
- Â Amazon Athena picks it up from here. It queries the text files in S3 (without the need for an external ETL pipeline), and creates an external table.
- Amazon QuickSight is the BI tool topping the whole pipeline. It is in charge of visualizing the data and producing the analysis.
Uploading JSON files to S3
This is the part where an external system (IMDb itself, a third-party aggregator etc) PUTs objects to a bucket. For the sake of my my proof-of-concept I pulled ~10k reviews from IMDb, packed them into a JSON and uploaded to S3. The interaction with AWS is taken care of by the AWS CLI (in charge of access credentials) and boto3. I randomly sampled the IMDb dataset, processing ~80 reviews per minute, for a total duration of 2 hours. I decided to add this artificial time component to the review to be able to enrich the visualization in QuickSight. Here the code I used to accomplish this part:
from mxnet import gluon, nd import gluonnlp as nlp import datetime import boto3 import numpy as np import time import json def jsonifize(l): now = datetime.datetime.now() return {"review": l[0], "rating": l[1], "submitted_on": now.strftime("%Y-%m-%d %H:%M:%S.%f")}, f"{now.strftime('%Y%m%d%H%M%S%f')}.json" train_dataset, test_dataset = [nlp.data.IMDB(root='data/imdb', segment=segment) for segment in ('train', 'test')] s3 = boto3.resource('s3') all_ = list(train_dataset) + list(test_dataset) t_end = time.time() + 60 * 120 while time.time() < t_end: a = (all_)[np.random.randint(0, high=49999)] content, name = jsonifize(a) object = s3.Object('imdb-sentiment-analysis', name) object.put(Body=json.dumps(content)); time.sleep(0.5)
This code populates the imdb-sentiment-analysis S3 bucket with IMDb reviews in JSON format. Here a screenshot of how the bucket looks like.
Triggering Lambda
In order to make this work, we first need to create an IAM Role aimed at providing Lambda with the right set of permissions. Specifically, to Comprehend, to S3 and to DynamoDB. LambdaSentiment is the Role I setup for this purpose.
The second step is to create the actual Lambda function. We need to specify a name, an IAM Role, the type of trigger event (S3 object creation + S3 bucket name, i.e. imdb-sentiment-analysis) and, optionally, some environment variables. Regarding that, as we have to write to DynamoDB, it is useful to set the table name (imdb_review_sentiment) as a variable. The following screenshot shows how the key components of the function page look like.
And here is the code getting executed each time Lambda is invoked by an S3 PUT trigger. The relevant parts are commented to simplify the reading.
import boto3 import os import json from decimal import Decimal def lambda_handler(event, context): ###################################### #LOADING JSON FILE FROM S3 ###################################### s3 = boto3.client('s3') record = event['Records'][0] bucket = record['s3']['bucket']['name'] key = record['s3']['object']['key'] file_content = s3.get_object(Bucket=bucket, Key=key)['Body'].read().decode('utf-8') json_content = json.loads(file_content) text = json_content['review'][:10000] rating = json_content['rating'] timestamp = json_content['submitted_on'] ###################################### #INVOKING COMPREHEND ###################################### if len(text) == 0: text = 'EMPTY' languagecode = 'EMPTY' sentiment = 'EMPTY' score = Decimal(0) else: comprehend = boto3.client(service_name='comprehend', region_name='eu-west-1') language = comprehend.detect_dominant_language(Text = text) languagecode = language['Languages'][0]['LanguageCode'] if languagecode in ['en', 'es']: comp_sentiment = comprehend.detect_sentiment(Text=text, LanguageCode=languagecode) sentiment = comp_sentiment['Sentiment'] score = str(comp_sentiment['SentimentScore'][comp_sentiment['Sentiment'].title()]) else: sentiment = 'UNSUPPORTED LANGUAGE' score = 0 ###################################### #SAVING JSON WITH SENTIMENT TO S3 ###################################### content = { 'review' : text, 'submitted_on' : timestamp, 'rating' : rating, 'language' : languagecode, 'sentiment' : sentiment, 'sentiment_score' : float(score) } s3.put_object(Body=json.dumps(content), Bucket='imdb-sentiment-enriched', Key="reviews-sentiment/"+key); ###################################### #CREATING NEW RECORD IN DYNAMODB TABLE ###################################### dynamodb = boto3.resource('dynamodb') table = dynamodb.Table(os.environ['DB_TABLE_NAME']) table.put_item( Item={ 'review' : text, 'submitted_on' : timestamp, 'rating' : rating, 'language' : languagecode, 'sentiment' : sentiment, 'sentiment_score' : Decimal(score) } ) return
Calling Amazon Comprehend
This call is the core of the whole pipeline, as it feeds reviews to Comprehend to analyse sentiment. Amazon's NLP service is super straightforward. It is possible to play around with it in the AWS console directly or, make programmatic calls to the relevant API and get a JSON back. Here how Amazon Comprehend looks like in the console. I fed it with a review and it diligently returned the sentiment analysis.
When it comes to the API, its usage is also ridiculously simple. A call to the sentiment service resembles the following
comprehend = boto3.client(service_name='comprehend', region_name='eu-west-1') comp_sentiment = comprehend.detect_sentiment(Text='review to be analysed', LanguageCode='en')
and it responds returning a JSON in this format
which allows sentiment and score to be retrieved very easily.
sentiment = comp_sentiment['Sentiment'] score = str(comp_sentiment['SentimentScore'][comp_sentiment['Sentiment'].title()])
Writing to DynamoDB
I added this step to the pipeline, mainly with the goal of familiarizing with DynamoDB, given that I almost never work with NoSQL databases. As I am dumping the processed JSONs to S3, the DynamoDB part is not critical. It is always nice to have data stored in a database, though. Given its flexibility, this NoSQL option is the perfect choice for this task. In DynamoDB, tables, items, and attributes are the core components we get to work with. A table is a collection of items, and each item is a collection of attributes. DynamoDB uses primary keys to identify each item in a table. Secondary indexes, instead, provide more querying flexibility. After creation, the table gets automatically updated by Lambda, each time a new object is processed. Here a couple of screenshots illustrating how the table, and one of its items appear, within the AWS console.
Querying S3 with Athena
As stated in the AWS architecture paragraph, together with writing to DynamoDB, Lambda dumps the sentiment-enriched reviews in a new S3 bucket (imdb-sentiment-enriched). Here it is.
Same file names as in the bucket triggering the pipeline. Just adding sentiment and sentiment_score to the fields in the JSON. What we need now is creating a data structure to be imported in QuickSight. Amazon Athena to the rescue.
I will not try to reformulate concepts which the AWS docs already brilliantly summarize, so here you go the first two lines of the Athena introduction on AWS. Clearer than that?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage [...]. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
https://aws.amazon.com/athena/
So, it literally boils down to pointing the service to the S3 bucket where the text files are located, defining the schema and creating a table. Stunningly simple. As our files are in JSON format, we use a JSON serializer/deserializer (SerDe) to parse the raw text records. We then create an external table (imdb_sentiment) using Hive data definition language and start querying it. Here the Hive code I executed and a couple of screenshots illustrating the Athena UI. I was mind blown by its simplicity.
CREATE EXTERNAL TABLE `imdb_sentiment`( `submitted_on` timestamp, `rating` int, `language` string, `sentiment` string, `sentiment_score` float, `review` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://imdb-sentiment-enriched/reviews-sentiment'
Visualizing data with Amazon QuickSight
Let's move to the final stage. We finally have all the elements to perform our analysis! Here the original question we wanted to investigate: is Amazon Comprehend capable of capturing the sentiment of an IMDb review, as suggested by the actual rating the user provided?
Before diving into the task, a couple of clarification on my side. As already mentioned in the Uploading JSON files to S3 paragraph, I have added an artificial time component to this dataset. In the original JSON files, I have included a field called submitted_on. Nothing else than the timestamp corresponding to when the review was dumped to S3. The purpose of this field is to simulate a stream of reviews being ingested into the system as soon as users submit them. This is done with the purpose of assessing any specific trend over time. As I sampled the IMDb dataset in a random fashion, for obvious reasons, no trend is visible in the time series. A second element to keep in mind is that the IMDb dataset I used, is the classic NLP sentiment analysis benchmark widely referenced in the ML community. This dataset contains only strongly polarized reviews. Either with a score <= 4 out of 10 (flagged as NEGATIVE), or with a score >= 7 out of 10 (flagged as POSITIVE). This means that, ideally, Comprehend should never produce a NEUTRAL or MIXED prediction. This, of course, in the case of the text matching the rating. I.e. a user might submit a very high score and then translate his thoughts into a more mitigated and standard review, hence trumping Comprehend into believing he did not like the movie that much after all. This is the potential phenomenon we actually want to uncover!
QuickSight is a nice BI tool which allows to easily visualize data coming from a number of AWS sources. In our case we point it to the Hive table Athena created in the previous step of the pipeline. As soon as the data gets loaded, we can start drag-and-dropping fields into an empty visual, and see the service quickly populate it for us.
I thought it would make sense to produce two separated charts to build up the story.
The first shows average Comprehend score split by sentiment prediction. The purpose of this one is to spot whether Comprehend is confident of its own predictions or not. As you can see in the chart below the NLP service is pretty confident of POSITIVE (~80% AVG score) and NEGATIVE (~70% AVG score) sentiments. The outcome is a lot more uncertain and wiggly when it comes to NEUTRAL (~60% AVG score) and MIXED (~50% AVG score) predictions. We should again mention that, at least in theory, NEUTRAL and MIXED shouldn't show up at all, given the strongly polarized nature of the dataset.
The second visualization shows average actual IMDb rating split by sentiment prediction. This one is to check if Comprehend is able to catch highly rated reviews as POSITIVE and low rated reviews as NEGATIVE. A measure of accuracy of the model and of consistency between the rating and the text. As expected, reviews flagged as POSITIVE are the ones with the highest actual rating (the opposite holds for NEGATIVE ones). The chart also shows how Comprehend is way less accurate for NEUTRAL and MIXED reviews. The variability for these two time series is huge, without a clear distinction between them. It is important to notice, though, that, on average, both of the time series are upper bounded by the POSITIVE line and lower bounded by the NEGATIVE one, which is exactly what we would expect.
I did not proceed to the creation of aggregated metrics, basically removing time from the equation. QuickSight makes this really easy and, anyway, producing a thorough analysis was not the main goal of this whole exercise. Hence I decided to draw a line here and enjoy my first fully functioning and automated AWS pipeline!
Average Comprehend score split by sentiment prediction
Average actual IMDb rating split by sentiment prediction
A couple of screenshots from the QuickSight AWS UI
Pricing
Something I have noticed in almost any AWS related post out there is that the pricing part is often completely missing. I understand it might not be the sexiest section of the process, but, you know, money is money, and I believe it is important to have an estimate of how much a pipeline would cost. In my specific case, as shown below, this exercise added up to a total of ~$42, of which $34 (80%) correspond to the ~10k calls to the Comprehend API. This is interesting to know, as it means two things:
- Amazon Comprehend is actually quite expensive.
- All the rest (S3, Lambda, DynamoDB, Athena) is really cheap!