Note: in this GitHub repo you can find 2 notebooks and a python script (COVID-19*) I created working on the project. Also, the Kaggle dataset I refer to (this one) gets updated daily by its curator. At the time of writing, it contained 5k rows but it constantly increases in size. As for the infrastructure, AWS Glue is the serverless ETL service we will be using the most throughout the post. Check it out.
Introduction
Any piece of analytics is just as good as the data it is sitting upon.
Nothing new here. Yet, even if acknowledged, this fact doesn’t make the process of getting data into a usable format any easier. Nowadays, most companies have engineers taking care of the heavy lifting when it comes to business-critical ETL jobs (Extract Transform Load). Still, more often than expected, analysts need to roll up their sleeves and process large amounts of messy data to enrich their models. It is therefore of paramount importance to be up-to-date and comfortable with the most common technologies to get the job done.
The purpose of this post is to go through one of these scenarios and use AWS ETL and analytics services to pull it off. The idea is to start from relatively messy data, clean it up and build a report on top of it.
First things first: we need a use case. Given the current news, I thought it would be interesting to build a pipeline around an openly-available COVID-19 dataset. For the records, COVID-19 is an infectious disease caused by a virus of the corona family, the same as SARS. It originated in China in December 2019 and it quickly spread to almost the entire world in a pandemic of vast proportions. At the time of writing, the worldwide counter reports 100k+ confirmed cases and 4k+ deaths, with the toll rising constantly.
The dataset I will be using is this one. It shows the cumulative confirmed, recovered and deaths figures by day, country and province from the 22 Jan to the 1 Mar 2020. Latitude and longitude coordinates are also added at the country level, e.g. a single GPS point per nation. The problem with this dataset is that, ironically, it is already perfectly usable. All fields are clean and properly formatted. So, to make the whole thing a little more interesting I decided to screw it up as much as I could, and then test if I could fix it back using AWS. Let me outline the work to be done:
- Get the COVID-19 Kaggle dataset and randomly break fields, e.g. screw up formatting, add null values, duplicates, etc, to make it more similar to sources we deal with on a day-to-day basis.
- Drop the crappified version of #1 onto Amazon S3.
- Build an ETL pipeline to clean #3 up.
- Build a QuickSight report on top.
Time to dive in.
AWS diagram: project’s workflow
This is the plan in detail:
- Save messed-up dataset to the pochetti-covid-19-input S3 bucket. I will deep dive into this one in the following section.
- Run the covid19 AWS Glue Crawler on top of the pochetti-covid-19-input S3 bucket to parse JSONs and create the pochetti_covid_19_input table in the Glue Data Catalog.
- Run the covid19-json Glue ETL Job on top of the pochetti_covid_19_input table to:
- clean the data, e.g. undo all the crappification logic previously implemented
- save instances in JSON format to the pochetti-covid-19-output S3 bucket.
- Run the covid19-output AWS Glue Crawler on top of the pochetti-covid-19-output S3 bucket to parse JSONs and create the pochetti_covid_19_output table in the Glue Data Catalog.
- Query the pochetti_covid_19_output table in the Glue Data Catalog via Amazon Athena. Remove duplicates and create the final, clean, covid19_athena table in the Glue Data Catalog.
- Connect Amazon QuickSight to the covid19_athena table and build relevant visualizations.
1. Crappifiyng the dataset
N.B. I must give credit where credit is due, of course. AFAIK the neologism “crappify a dataset” was coined by Jeremy Howard in the context of generative models for fast.ai. I am shamelessly stealing the term.
If you simply download the dataset from here and open it up, you would be pleasantly surprised by how clean and ready-to-use it is. This is because it has been already preprocessed by the dataset curator. That’s nice of him but, given our goal is to practice with ETL pipelines, we need something a bit messier. Which is why I opted for crappifying the original dataset and getting something more realistic. This is the notebook where I implement the logic, and as follows what I did in detail.
- The dataset contains 5000 rows per 8 columns (Province-State, Country-Region, Lat, Long, Date, Confirmed, Deaths, Recovered).
- I add an id column copying the row index, to be able to discriminate duplicates more efficiently later on.
- I create duplicated rows, randomly selecting samples of the dataframe and appending them to the original one. This operation increases the size to 34410 rows, with some instances repeated up to 18 times.
- For each row:
- I turn it into JSON
- I mess up with the Date format, from the original %Y-%m-%d to a randomly selected option among [“%Y-%m-%d”, “%Y/%m/%d”, “%Y-%b-%d”, “%Y/%B/%d”, “%Y, %B, %d”]
- In case Province-State is null, I override the field with a randomly selected option among [‘NULL’, ”, ‘missing’, ‘–‘]
- I add up-to 10 integer fields to the JSON, in the format Random-Field-{i}: {i}
- In case one of the 3 fields Confirmed, Deaths, Recovered equals 0, I override them with a randomly selected option among [‘zero’, ‘o’, 0]
- I upload the resulting crappyfied JSON into the pochetti-covid-19-input S3 bucket, with filename json-{i}.
The points #4.1-5 are achieved calling the function process_entry (code provided below for reference) on each individual row of the pandas dataframe.
def crappify_date(x):
formats = ["%Y-%m-%d", "%Y/%m/%d", "%Y-%b-%d", "%Y/%B/%d", "%Y, %B, %d"]
x['Date'] = datetime.fromtimestamp(x['Date']/1000.0).strftime(random.choice(formats))
return x
def crappify_province(x):
if x['Province-State'] is None: x['Province-State'] = random.choice(['NULL', '', 'missing', '--'])
return x
def add_fields(x):
n = random.randint(0, 10)
if n == 0: return x
else:
for i in range(n):
x[f"Random-Field-{i}"] = i
return x
def crappify_zeros(x):
f = random.choice(['Confirmed', 'Deaths', 'Recovered'])
if x[f] == 0: x[f] = random.choice(['zero', 'o', 0])
return x
def crappify_json(x):
x = crappify_date(x)
x = crappify_province(x)
x = add_fields(x)
return crappify_zeros(x)
def process_entry(df, i):
x = json.loads(df.loc[i].to_json(date_format='epoch'))
return crappify_json(x)
And this is how one of the messed up JSONs looks like.
#BEFORE
{'Province-State': 'San Benito, CA',
'Country-Region': 'US',
'Lat': 36.5761,
'Long': -120.9876,
'Date': 1581724800000,
'Confirmed': 2,
'Deaths': 0,
'Recovered': 0}
#AFTER
{'Province-State': 'San Benito, CA',
'Country-Region': 'US',
'Lat': 36.5761,
'Long': -120.9876,
'Date': '2020, February, 15',
'Confirmed': 2,
'Deaths': 'o',
'Recovered': 0,
'id': 50,
'Random-Field-0': 0,
'Random-Field-1': 1,
'Random-Field-2': 2,
'Random-Field-3': 3,
'Random-Field-4': 4,
'Random-Field-5': 5}
Here the contents of the pochetti-covid-19-input bucket, instead.
2. Crawl S3 input with Glue
This step is as simple as creating an AWS Glue Crawler (covid19) and pointing it to an S3 bucket. The Crawler dives into the JSON files, figures out their structure and stores the parsed data into a new table in the Glue Data Catalog. Almost pure magic.
3. Run ETL with Glue
Now that we have the data nicely organized in a table, we can proceed to the actual cleaning. Specifically, we have to:
- Select only the fields we are interested in, e.g. Confirmed, Country-Region, Date, Deaths, Lat, Long, Province-State, Recovered, id, removing any Random-Field-* (undo the add_fields function).
- Fix the Date formatting (undo the crappify_date function).
- Homogenize the null values of Province-State (undo the crappify_province function).
- Homogenize Confirmed, Recovered and Deaths in case the value on any of those fields equals 0 (undo the crappify_zeros function).
The way to achieve that is by leveraging a Glue ETL Job. Let’s see how we can create one.
The process ends by clicking “Save job and edit script” at the bottom (not visible) of the previous screenshot. That’s when we land on a console-embedded IDE containing the PySpark script Glue has put together for us. It is just a skeleton with minimal code at this point. We need to add all the cleansing logic to fix the data, as described before. How do you actually do that without the ability to test the code? We don’t really want to run the ETL Job each time we add a line just to make sure there are no syntax errors and, as far as I could see, there were no TEST buttons around.
The answer is a Dev Endpoint.
A Dev Endpoint is nothing else that a SageMaker notebook instance, running on a PySpark environment, allowing the user to connect to the Glue Data Catalog and run/test code on-demand, enjoying the flexibility of Jupyter. The result is this notebook, which I ended up turning into the actual Spark script (here) Glue runs as part of the ETL Job.
Now that we have a fully functioning Glue ETL Job, we are just left with running it.
As expected, the ETL saves clean JSONs into the pochetti-covid-19-output S3 bucket.
Inspecting the top file (run-1583319324043-part-r-00000) reveals properly formatted contents, as shown here:
{"Province-State":"Anhui",
"Country-Region":"Mainland China",
"Lat":31.8257,
"Long":117.2264,
"Recovered":0,
"Confirmed":1,
"Deaths":0,
"Date":"2020-01-22",
"id":0}
Let’s move on.
4. Crawl S3 output with Glue
This is just an intermediate step to move JSONs out of the output S3 bucket and load them into a table within the Glue Data Catalog. That’s exactly what a Glue Crawler is for. We point it to the pochetti-covid-19-output S3 bucket and we let it create the pochetti_covid_19_output table.
5. Query Glue Catalog with Athena
Only one step is left before we can effectively use the data. Remove duplicates by id.
I decided to do that in SQL, leveraging the flexibility of Amazon Athena. The idea is to clean up repeated rows with a simple query (pasted below for reference) and create a new table (covid19_athena) out of it, within the same Glue Data Catalog database. As expected, the table contains 5000 data points (from 34410), as the original Kaggle dataset, which proves the cleansing process is successful.
CREATE TABLE covid19_athena
as select "province-state",
"country-region",
"lat",
"long",
"date",
"confirmed",
"deaths",
"recovered"
from (
select t.*,
row_number() over (partition by id order by date) as r
from "covid19"."pochetti_covid_19_output" t)
where r = 1
6. Visualize with QuickSight
Visualization time (that’s the point where each analyst would ideally like to start, trusting the data cleaning has already been taken care of by someone else). As you saw, it took us a while to get here, proving how complex and tricky these processes are.
What is left is as simple as opening up Amazon QuickSight, choosing a data source (the covid19_athena table within the Glue Data Catalog) and start plotting. Something to mention in between is that, during the data loading phase, you might need to convert the type of the Date field from string to date. That’s it.
Here a couple of screenshots with the charts I put together, pivoting the data on different dimensions. QuickSight makes it quite easy.
Could it had been done differently?
As far as I was able to dig, my understanding is that the real benefit of using AWS Glue comes from handling huge datasets. So, say we had to process millions of JSONs in S3, then what we did would make a lot of sense (hopefully). As for alternative approaches to achieve the same goal, here a couple of ideas:
- If you know which kind of pre-processing is needed upfront, before data hits your systems, then I’d say that leveraging a plain and simple Lambda function triggered by an S3 object upload event, would be the simplest solution. Apply any cleaning steps within Lambda and dump everything to another bucket.
- If the data is already on-premise and you don’t have a lot of it, probably the easiest way is to loop through the relevant objects, run a Lambda on top of them and save again to S3. Ugly and hacky but effective.
Here you can find an AWS Glue FAQ page, addressing lots of interesting points related to the product and its interaction with the data-related AWS suite.
My brother recommended I may like this website. He was once entirely right. This post truly made my day. You cann’t consider just how a lot time I had spent for this information! Thanks!
Are you writing the articles in your website yourself or
you outsource them? I am a blogger and having difficulty with content.
Other bloggers told me I should use an AI content writer, they are actually pretty good.
Here is a sample article some bloggers shared
with me. Please let me know what your opinion on it and
should I go ahead and use AI – https://sites.google.com/view/best-ai-content-writing-tools/home
Thanks for your blog, nice to read. Do not stop.
Thanks for the support folks!
Excellent post. I was checking continuously this blog
and I’m impressed! Very helpful information particularly the last
part 🙂 I care for such information a lot. I was looking for this certain information for a very long
time. Thank you and good luck.
Have a look at my blog post :: coupon
Remarkable! Its truly amazing article, I have got much
clear idea regarding from this post.
my homepage; tracfone special
My brother recommended I may like this website. He used to be entirely
right. This put up truly made my day. You can not consider simply how so much time I had spent for this info!
Thanks!
Hey! Someone in my Myspace group shared this site with us
so I came to take a look. I’m definitely enjoying the
information. I’m book-marking and will be tweeting this to my followers!
Great blog and terrific style and design.
Very good written story. It will be helpful to anyone who usess it, as well as yours truly :). Keep doing what you are doing – looking forward to more posts.
great issues altogether, you simply won a emblem new reader. What could you suggest in regards to your publish that you made some days in the past? Any sure?
Just about all of what you articulate happens to be supprisingly accurate and that makes me ponder why I hadn’t looked at this with this light previously. This particular article really did switch the light on for me as far as this topic goes. Nonetheless there is just one point I am not really too cozy with and whilst I try to reconcile that with the actual main idea of the point, allow me observe just what the rest of the subscribers have to say.Well done.
One thing I’d prefer to say is that often before purchasing more pc memory, have a look at the machine directly into which it would be installed. In the event the machine is usually running Windows XP, for instance, a memory limit is 3.25GB. Putting in over this would simply constitute any waste. Make sure that one’s motherboard can handle the particular upgrade amount, as well. Good blog post.
WONDERFUL Post.thanks for share..more wait .. ?
Hi! Someone in my Myspace group shared this site with us so I came to check it out. I’m definitely loving the information. I’m book-marking and will be tweeting this to my followers! Exceptional blog and fantastic design.
I like the valuable information you provide in your articles. I will bookmark your blog and check again here frequently. I’m quite sure I will learn a lot of new stuff right here! Best of luck for the next!
Pretty nice post. I just stumbled upon your blog and wanted to say that I’ve truly enjoyed surfing around your blog posts. After all I抣l be subscribing to your rss feed and I hope you write again very soon!
I couldn’t refrain from commenting. Exceptionally
well written!
I have seen plenty of useful points on your website about computer systems. However, I have the opinion that laptops are still not nearly powerful sufficiently to be a good selection if you typically do things that require many power, like video editing. But for internet surfing, statement processing, and the majority of other common computer work they are just great, provided you do not mind the screen size. Appreciate sharing your ideas.
I think that is among the so much vital info for me. And i’m satisfied reading your article. However should commentary on few common things, The site style is great, the articles is in reality great : D. Just right job, cheers