Skip to content

How to build an expense tracker with Amazon Textract

Reading Time: 6 minutes

Note: the Jupyter notebook is available here.

According to the official AWS documentation

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

https://aws.amazon.com/textract/

The description immediately tickled my curiosity. The feeling turned into pure excitement when I happened to watch Andy Jassy showcase the service during his event’s keynote. The capabilities of this new AWS AI resource looked truly incredible. I needed to give it a go.

Tracking expenses

The first application which crossed my mind when thinking about Textract was an expense tracker. The idea came from the wonderful ExpenseIt app I used multiple times on business trips. The application automates the receipts ingestion process by allowing users to take pictures of tickets and extracting all sort of information from them, from the paid amount to place and date. A real piece of cake. Inspired by ExpensIt and given the unreasonable number of receipts I keep at home, I decided to use Textract to parse them and try to build a homemade expense tracker.

Here a couple of examples of receipts from a Seattle trip I took a couple of years ago.

As you would expect, the scans are far from being neat. To get a sense of how Textract would perform on those, I went for the Try Amazon Textract feature available within the AWS console. This option allows users to upload a picture of a document and run the OCR magic on top of it directly in the UI.

This is what I got for the Chipotle ticket you saw above on the right, in the group of three receipts.

A couple of obvious considerations came my mind when I looked at the results.

  • High OCR quality: The quality of the OCR model is outstanding. In any receipt I tried, Textract managed to isolate all the relevant (horizontally oriented) text. The greyed areas you see in the left side panel of the screenshot are the bounding boxes the service draws around each piece of text. None is missing. Not only does Textract a good job at detecting characters. As the right side panel of the screenshot shows, it manages to correctly read them too. I was not able to find typos so far.
  • Pieces of text on the same line are not recognized as such: At the same time, for my expense tracking use case, a first issue appears evident. If you take a look again at the Chipotle ticket screenshot, you might notice that Textract highlighted lines. Each bounding box is what the service thinks is a whole row. So, as far as Textract is concerned, TAKE OUT Total and 2,64 belong to two different lines. This is clearly a problem. As we will see later on, the Textract API returns each piece of text separately in a JSON, so technically there is no way to match TAKE OUT Total and 2,64. No way to infer that 2,64 is the amount of money we are looking for. Technically, for each piece of text, Textract returns the bounding boxes coordinates too. This means that despite the system not recognizing TAKE OUT Total and 2,64 to belong to the same line explicitly, the fact that the y coordinates of the two boxes are very close can potentially be exploited to deduct their relationship programmatically. Turns out another handier trick can be used too. I obviously don’t know how Textract operates in the backend. If I am allowed to take a wild guess though, the structure of the JSON output suggests that the service parses documents top-down and left-right, as the order of the text pieces follows their relative position in the receipt. You can check this yourself looking at the screenshot above. 2,64 comes just after TAKE OUT Total. If we manage to identify the box referring to the Total, it is very likely the actual amount might be contained in the following one.

The Textract API

How do we submit scanned documents to Textract programmatically? boto3 and the python SDK to the rescue, of course. Detecting text inside a receipt literally translates into a one-liner:

json = textract_boto_client.detect_document_text(Document={'Bytes': image_binary})

The rest consists of handling the response, i.e. parsing the JSON and implementing the little trick I explained above. Here the function I ended up putting together. Once Textract returns the list of text objects detected, I use regular expressions to identify the date the receipt was emitted and the box referring to the total. The former does not require any further processing, except for correctly being re-formatted as a date. As for the latter, I just need its position, as the following element in the list is likely the money amount itself. All this info is then wrapped into a second JSON and saved to S3.

def process_receipt(document, 
                    bucket = 'textract-expense-tracker', 
                    display=False, upload=True, printout=False):
                      
    s3_object = boto3.resource('s3').Object(bucket, document)
    s3_response = s3_object.get()
    stream = io.BytesIO(s3_response['Body'].read())
    bucket_location = boto3.client('s3').get_bucket_location(Bucket=bucket)
    
    ######################################
    #INVOKING TEXTRACT
    ######################################
    text = boto3.client('textract')
    image_binary = stream.getvalue()
    response = text.detect_document_text(Document={'Bytes': image_binary})
    dates = []
    totals = []
    
    for i, block in enumerate(response["Blocks"]):
        if block["BlockType"] == "LINE":
            date = re.search("[0-3]?[0-9]/[0-3]?[0-9]/(?:[0-9]{2})?[0-9]{2}", block["Text"])
            total = re.search("Total|TOTAL", block["Text"])
            if date is not None: 
                dates.append(date.group())
            if total is not None: 
                total = ''.join(c for c in response["Blocks"][i+1]["Text"] if c in digits+"."+",")
                if total == '':
                    total = ''.join(c for c in response["Blocks"][i-1]["Text"] if c in digits+"."+",")
                totals.append(total.replace(",", "."))
    date = list(set(dates))[0]
    dt = parser.parse(date)
    date = dt.strftime("%Y-%m-%d")
    amount = list(set(totals))[0]
    
    if printout:
         print(f"Document: {document}; Date: {date}; Amount: {amount}")
            
    if display: 
        image = Image.open(stream)
        fig, ax = plt.subplots(figsize=(5,10))
        ax.imshow(image)
        plt.show()
    
    ######################################
    #SAVING JSON WITH SENTIMENT TO S3
    ######################################
    content = {            
            'receipt' : f"https://{bucket}.s3-{bucket_location['LocationConstraint']}.amazonaws.com/{document}",
            'submitted_on' : strftime("%Y-%m-%d %H:%M:%S GMT", gmtime()),
            'date' : date,
            'amount' : float(amount)
    }
    
    if upload: 
        boto3.client('s3').put_object(Body=json.dumps(content), Bucket=bucket, Key="textracted/"+document.replace("jpg", "json"));
    
    return

The following is the result of running the function on the Chipotle receipt. As you can see Textract, together with my custom logic, correctly identified the date (2017-05-12) and the amount spent (2.64).

A couple of caveats, before I forget. The algorithm expects the “total box” to contain either Total or TOTAL. This is obviously not a robust solution as it breaks for all the languages where the word is spelled differently. As for the money part, the currency is not available (in most cases). This means a receipt you collected in the US or France will be parsed in the same way, just extracting a raw number, you cannot thus compare. A way of fixing this could be to use Textract to detect the name of the place the ticket was emitted at or its address, and then trigger a lookup to figure out in which country it is located. This info is not always available through and it might need some additional thought.

Aggregating expenses

As previously mentioned, after being extracted from the receipt, the data is stored in S3 in JSON format. This is an example:

{
  "receipt": "https://textract-expense-tracker.s3-eu-west-1.amazonaws.com/receipt7.jpg",
  "submitted_on": "2019-06-17 21:20:16 GMT",
  "date": "2017-05-11",
  "amount": 8.81
}

This structure allows to easily read the files in pandas, build aDataFrameand aggregate expenses as needed. Here the end result:

jsons = []
for extract in bucket.objects.filter():
    if ".json" in extract.key:
        content_object = s3.Object(extract.bucket_name, extract.key)
        file_content = content_object.get()['Body'].read().decode('utf-8')
        jsons.append(json.loads(file_content))
expenses = pd.concat([pd.DataFrame(j, index=[0]) for j in jsons], ignore_index=True)
expenses.groupby("date")["amount"].sum().reset_index().plot(x="date", y="amount", rot=30, figsize=(10,6))

Integration with other AWS services

This was a quick proof of concept and I did not want to push it further. Nevertheless, a potential plan to develop it into a fully functional AWS-based expense tracker is easily stated. Below the pipeline I had sketched (points 4 and 5 are similar to what I already did here):

  1. Upload receipt to S3
  2. Trigger Lambda
  3. Lambda invokes the process_receipt function (Textract magic behind the scenes)
  4. Athena parses the JSON files Lambda has saved to S3 and stores the data into a table.
  5. QuickSight connects to Athena, allowing all sorts of visualizations.

Happy expense tracking!

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading