Skip to content

COVID-19: end-to-end analytics with AWS Glue, Athena and QuickSight

Reading Time: 10 minutes

Note: in this GitHub repo you can find 2 notebooks and a python script (COVID-19*) I created working on the project. Also, the Kaggle dataset I refer to (this one) gets updated daily by its curator. At the time of writing, it contained 5k rows but it constantly increases in size. As for the infrastructure, AWS Glue is the serverless ETL service we will be using the most throughout the post. Check it out.

Introduction

Any piece of analytics is just as good as the data it is sitting upon.

Nothing new here. Yet, even if acknowledged, this fact doesn’t make the process of getting data into a usable format any easier. Nowadays, most companies have engineers taking care of the heavy lifting when it comes to business-critical ETL jobs (Extract Transform Load). Still, more often than expected, analysts need to roll up their sleeves and process large amounts of messy data to enrich their models. It is therefore of paramount importance to be up-to-date and comfortable with the most common technologies to get the job done.

The purpose of this post is to go through one of these scenarios and use AWS ETL and analytics services to pull it off. The idea is to start from relatively messy data, clean it up and build a report on top of it.

First things first: we need a use case. Given the current news, I thought it would be interesting to build a pipeline around an openly-available COVID-19 dataset. For the records, COVID-19 is an infectious disease caused by a virus of the corona family, the same as SARS. It originated in China in December 2019 and it quickly spread to almost the entire world in a pandemic of vast proportions. At the time of writing, the worldwide counter reports 100k+ confirmed cases and 4k+ deaths, with the toll rising constantly.

The dataset I will be using is this one. It shows the cumulative confirmed, recovered and deaths figures by day, country and province from the 22 Jan to the 1 Mar 2020. Latitude and longitude coordinates are also added at the country level, e.g. a single GPS point per nation. The problem with this dataset is that, ironically, it is already perfectly usable. All fields are clean and properly formatted. So, to make the whole thing a little more interesting I decided to screw it up as much as I could, and then test if I could fix it back using AWS. Let me outline the work to be done:

  1. Get the COVID-19 Kaggle dataset and randomly break fields, e.g. screw up formatting, add null values, duplicates, etc, to make it more similar to sources we deal with on a day-to-day basis.
  2. Drop the crappified version of #1 onto Amazon S3.
  3. Build an ETL pipeline to clean #3 up.
  4. Build a QuickSight report on top.

Time to dive in.

AWS diagram: project’s workflow

This is the plan in detail:

  1. Save messed-up dataset to the pochetti-covid-19-input S3 bucket. I will deep dive into this one in the following section.
  2. Run the covid19 AWS Glue Crawler on top of the pochetti-covid-19-input S3 bucket to parse JSONs and create the pochetti_covid_19_input table in the Glue Data Catalog.
  3. Run the covid19-json Glue ETL Job on top of the pochetti_covid_19_input table to:
    • clean the data, e.g. undo all the crappification logic previously implemented
    • save instances in JSON format to the pochetti-covid-19-output S3 bucket.
  4. Run the covid19-output AWS Glue Crawler on top of the pochetti-covid-19-output S3 bucket to parse JSONs and create the pochetti_covid_19_output table in the Glue Data Catalog.
  5. Query the pochetti_covid_19_output table in the Glue Data Catalog via Amazon Athena. Remove duplicates and create the final, clean, covid19_athena table in the Glue Data Catalog.
  6. Connect Amazon QuickSight to the covid19_athena table and build relevant visualizations.

1. Crappifiyng the dataset

N.B. I must give credit where credit is due, of course. AFAIK the neologism “crappify a dataset” was coined by Jeremy Howard in the context of generative models for fast.ai. I am shamelessly stealing the term.

If you simply download the dataset from here and open it up, you would be pleasantly surprised by how clean and ready-to-use it is. This is because it has been already preprocessed by the dataset curator. That’s nice of him but, given our goal is to practice with ETL pipelines, we need something a bit messier. Which is why I opted for crappifying the original dataset and getting something more realistic. This is the notebook where I implement the logic, and as follows what I did in detail.

  1. The dataset contains 5000 rows per 8 columns (Province-State, Country-Region, Lat, Long, Date, Confirmed, Deaths, Recovered).
  2. I add an id column copying the row index, to be able to discriminate duplicates more efficiently later on.
  3. I create duplicated rows, randomly selecting samples of the dataframe and appending them to the original one. This operation increases the size to 34410 rows, with some instances repeated up to 18 times.
  4. For each row:
    1. I turn it into JSON
    2. I mess up with the Date format, from the original %Y-%m-%d to a randomly selected option among [“%Y-%m-%d”, “%Y/%m/%d”, “%Y-%b-%d”, “%Y/%B/%d”, “%Y, %B, %d”]
    3. In case Province-State is null, I override the field with a randomly selected option among [‘NULL’, ”, ‘missing’, ‘–‘]
    4. I add up-to 10 integer fields to the JSON, in the format Random-Field-{i}: {i}
    5. In case one of the 3 fields Confirmed, Deaths, Recovered equals 0, I override them with a randomly selected option among [‘zero’, ‘o’, 0]
    6. I upload the resulting crappyfied JSON into the pochetti-covid-19-input S3 bucket, with filename json-{i}.

The points #4.1-5 are achieved calling the function process_entry (code provided below for reference) on each individual row of the pandas dataframe.

def crappify_date(x):
    formats = ["%Y-%m-%d", "%Y/%m/%d", "%Y-%b-%d", "%Y/%B/%d", "%Y, %B, %d"]
    x['Date'] = datetime.fromtimestamp(x['Date']/1000.0).strftime(random.choice(formats))
    return x

def crappify_province(x):
    if x['Province-State'] is None: x['Province-State'] = random.choice(['NULL', '', 'missing', '--']) 
    return x

def add_fields(x):
    n = random.randint(0, 10)
    if n == 0: return x
    else:
        for i in range(n):
            x[f"Random-Field-{i}"] = i
    return x

def crappify_zeros(x):
    f = random.choice(['Confirmed', 'Deaths', 'Recovered'])
    if x[f] == 0: x[f] = random.choice(['zero', 'o', 0])
    return x

def crappify_json(x):
    x = crappify_date(x)
    x = crappify_province(x)
    x = add_fields(x)
    return crappify_zeros(x)  

def process_entry(df, i):
    x = json.loads(df.loc[i].to_json(date_format='epoch'))
    return crappify_json(x)

And this is how one of the messed up JSONs looks like.

#BEFORE 
{'Province-State': 'San Benito, CA', 
 'Country-Region': 'US', 
 'Lat': 36.5761, 
 'Long': -120.9876, 
 'Date': 1581724800000, 
 'Confirmed': 2, 
 'Deaths': 0, 
 'Recovered': 0} 

#AFTER 
{'Province-State': 'San Benito, CA', 
 'Country-Region': 'US', 
 'Lat': 36.5761, 
 'Long': -120.9876, 
 'Date': '2020, February, 15', 
 'Confirmed': 2, 
 'Deaths': 'o', 
 'Recovered': 0, 
 'id': 50, 
 'Random-Field-0': 0, 
 'Random-Field-1': 1, 
 'Random-Field-2': 2, 
 'Random-Field-3': 3, 
 'Random-Field-4': 4, 
 'Random-Field-5': 5}

Here the contents of the pochetti-covid-19-input bucket, instead.

2. Crawl S3 input with Glue

This step is as simple as creating an AWS Glue Crawler (covid19) and pointing it to an S3 bucket. The Crawler dives into the JSON files, figures out their structure and stores the parsed data into a new table in the Glue Data Catalog. Almost pure magic.

S3 bucket with input, messy data. This is the source we point the Glue Crawler to.
covid19 Glue Crawler in charge of parsing S3 and storing data into a Glue Data Catalog table
pochetti_covid_19_input table created within the Glue Data Catalog. As you can see the Glue Crawler correctly identified 34410 objects in S3 and parsed the following JSON fields out of those: Confirmed,Country-Region,Date,Deaths,Lat,Long,Province-State,Random-Field-0,Random-Field-1,Random-Field-2,Random-Field-3,Random-Field-4,Random-Field-5,Random-Field-6,Random-Field-7,Random-Field-8,Random-Field-9,Recovered,id.

3. Run ETL with Glue

Now that we have the data nicely organized in a table, we can proceed to the actual cleaning. Specifically, we have to:

  • Select only the fields we are interested in, e.g. Confirmed, Country-Region, Date, Deaths, Lat, Long, Province-State, Recovered, id, removing any Random-Field-* (undo the add_fields function).
  • Fix the Date formatting (undo the crappify_date function).
  • Homogenize the null values of Province-State (undo the crappify_province function).
  • Homogenize Confirmed, Recovered and Deaths in case the value on any of those fields equals 0 (undo the crappify_zeros function).

The way to achieve that is by leveraging a Glue ETL Job. Let’s see how we can create one.

Navigate to the AWS Glue ETL Jobs page and click “Add job”.
Type a name and select an IAM Role (you can let Glue create one for you or DIY in advance). As for the environment, I went for Spark (Python) and asked Glue to propose a script template for me. Make sure to tick the Job Metrics box under Monitoring Options, as it allows to save logs to CloudWatch.
Point the ETL Job to the data source of interest. In my case the previously-created pochetti_covid_19_input table within the Glue Data Catalog.
We want to change the table schema as part of our data cleansing process, as we need to drop all the Random-Field-* fields.
This is when we decide the format and destination of the Job’s output. In our case, we save a new clean JSON into the pochetti-covid-19-output S3 bucket.
Next, we tell Glue which fields we want to keep and the relative data type. By clicking on the X (red square) on the target side, we get rid of unwanted columns.

The process ends by clicking “Save job and edit script” at the bottom (not visible) of the previous screenshot. That’s when we land on a console-embedded IDE containing the PySpark script Glue has put together for us. It is just a skeleton with minimal code at this point. We need to add all the cleansing logic to fix the data, as described before. How do you actually do that without the ability to test the code? We don’t really want to run the ETL Job each time we add a line just to make sure there are no syntax errors and, as far as I could see, there were no TEST buttons around.

The answer is a Dev Endpoint.

A Dev Endpoint is nothing else that a SageMaker notebook instance, running on a PySpark environment, allowing the user to connect to the Glue Data Catalog and run/test code on-demand, enjoying the flexibility of Jupyter. The result is this notebook, which I ended up turning into the actual Spark script (here) Glue runs as part of the ETL Job.

You can create a Dev endpoint within the AWS Glue console. This event automatically triggers the creation of a SageMaker notebook instance, as shown in the following screenshot.
As soon as the SageMaker notebook instance is up and running, you can just open it up and start coding in a Spark environment .

Now that we have a fully functioning Glue ETL Job, we are just left with running it.

Successful Job run

As expected, the ETL saves clean JSONs into the pochetti-covid-19-output S3 bucket.

The output of the Glue ETL Job

Inspecting the top file (run-1583319324043-part-r-00000) reveals properly formatted contents, as shown here:

{"Province-State":"Anhui",
 "Country-Region":"Mainland China",
 "Lat":31.8257,
 "Long":117.2264,
 "Recovered":0,
 "Confirmed":1,
 "Deaths":0,
 "Date":"2020-01-22",
 "id":0}

Let’s move on.

4. Crawl S3 output with Glue

This is just an intermediate step to move JSONs out of the output S3 bucket and load them into a table within the Glue Data Catalog. That’s exactly what a Glue Crawler is for. We point it to the pochetti-covid-19-output S3 bucket and we let it create the pochetti_covid_19_output table.

As expected the Glue Crawler parsed 34410 objects and created a table with the following columns: Confirmed,Country-Region,Date,Deaths,Lat,Long,Province-State,Recovered,id

5. Query Glue Catalog with Athena

Only one step is left before we can effectively use the data. Remove duplicates by id.

I decided to do that in SQL, leveraging the flexibility of Amazon Athena. The idea is to clean up repeated rows with a simple query (pasted below for reference) and create a new table (covid19_athena) out of it, within the same Glue Data Catalog database. As expected, the table contains 5000 data points (from 34410), as the original Kaggle dataset, which proves the cleansing process is successful.

CREATE TABLE covid19_athena
as select "province-state",
 "country-region",
 "lat",
 "long",
 "date",
 "confirmed",
 "deaths",
 "recovered"
from (
select t.*,
    row_number() over (partition by id order by date) as r
from "covid19"."pochetti_covid_19_output" t)
where r = 1
Removing duplicates by id within the pochetti_covid_19_output table and storing results in the new covid19_athena table
As expected the newly created covid19_table contains 5000 rows (from 34410), the same as the original Kaggle dataset. This completes the cleaning process.

6. Visualize with QuickSight

Visualization time (that’s the point where each analyst would ideally like to start, trusting the data cleaning has already been taken care of by someone else). As you saw, it took us a while to get here, proving how complex and tricky these processes are.

What is left is as simple as opening up Amazon QuickSight, choosing a data source (the covid19_athena table within the Glue Data Catalog) and start plotting. Something to mention in between is that, during the data loading phase, you might need to convert the type of the Date field from string to date. That’s it.

Here a couple of screenshots with the charts I put together, pivoting the data on different dimensions. QuickSight makes it quite easy.

#Confirmed cases by geospatial location as of 1 March 2020
#Confirmed, #Recovered and #Deaths WW by date
Heatmap of #Confirmed by country and date

Could it had been done differently?

As far as I was able to dig, my understanding is that the real benefit of using AWS Glue comes from handling huge datasets. So, say we had to process millions of JSONs in S3, then what we did would make a lot of sense (hopefully). As for alternative approaches to achieve the same goal, here a couple of ideas:

  • If you know which kind of pre-processing is needed upfront, before data hits your systems, then I’d say that leveraging a plain and simple Lambda function triggered by an S3 object upload event, would be the simplest solution. Apply any cleaning steps within Lambda and dump everything to another bucket.
  • If the data is already on-premise and you don’t have a lot of it, probably the easiest way is to loop through the relevant objects, run a Lambda on top of them and save again to S3. Ugly and hacky but effective.

Here you can find an AWS Glue FAQ page, addressing lots of interesting points related to the product and its interaction with the data-related AWS suite.

Tags:

19 thoughts on “COVID-19: end-to-end analytics with AWS Glue, Athena and QuickSight”

  1. My brother recommended I may like this website. He was once entirely right. This post truly made my day. You cann’t consider just how a lot time I had spent for this information! Thanks!

  2. Excellent post. I was checking continuously this blog
    and I’m impressed! Very helpful information particularly the last
    part 🙂 I care for such information a lot. I was looking for this certain information for a very long
    time. Thank you and good luck.

    Have a look at my blog post :: coupon

  3. Hey! Someone in my Myspace group shared this site with us
    so I came to take a look. I’m definitely enjoying the
    information. I’m book-marking and will be tweeting this to my followers!
    Great blog and terrific style and design.

  4. Just about all of what you articulate happens to be supprisingly accurate and that makes me ponder why I hadn’t looked at this with this light previously. This particular article really did switch the light on for me as far as this topic goes. Nonetheless there is just one point I am not really too cozy with and whilst I try to reconcile that with the actual main idea of the point, allow me observe just what the rest of the subscribers have to say.Well done.

  5. One thing I’d prefer to say is that often before purchasing more pc memory, have a look at the machine directly into which it would be installed. In the event the machine is usually running Windows XP, for instance, a memory limit is 3.25GB. Putting in over this would simply constitute any waste. Make sure that one’s motherboard can handle the particular upgrade amount, as well. Good blog post.

  6. Hi! Someone in my Myspace group shared this site with us so I came to check it out. I’m definitely loving the information. I’m book-marking and will be tweeting this to my followers! Exceptional blog and fantastic design.

  7. I like the valuable information you provide in your articles. I will bookmark your blog and check again here frequently. I’m quite sure I will learn a lot of new stuff right here! Best of luck for the next!

  8. Pretty nice post. I just stumbled upon your blog and wanted to say that I’ve truly enjoyed surfing around your blog posts. After all I抣l be subscribing to your rss feed and I hope you write again very soon!

  9. I have seen plenty of useful points on your website about computer systems. However, I have the opinion that laptops are still not nearly powerful sufficiently to be a good selection if you typically do things that require many power, like video editing. But for internet surfing, statement processing, and the majority of other common computer work they are just great, provided you do not mind the screen size. Appreciate sharing your ideas.

  10. I think that is among the so much vital info for me. And i’m satisfied reading your article. However should commentary on few common things, The site style is great, the articles is in reality great : D. Just right job, cheers

Leave a Reply

Your email address will not be published. Required fields are marked *

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading