Introduction
In this post, I briefly dive into the fascinating domain of OCR, in a quest to examine the most commonly used engines, and try to answer the following ever-lasting question: which one is better?
Despite its apparent simplicity, this is a very tricky query to address. As always, it depends on the application of interest. Reading an invoice is very different than capturing text from a lit-up traffic signal at night. Language plays a major role too. English and Arabic tokens are not quite alike, and the latter might require more effort than the former.
To keep things simple, I opted to test only the recognition of Latin alphanumeric text on a whitish background. Even if not completely thorough, such an experiment should already provide some insights into our quest.
Here what we need:
- choosing the OCR engines to put to the test
- some labeled data to run those onto
- a metric to measure performance
OCR engines
I selected:
- Tesseract: probably the most famous and widespread open-source solution (41.1k stars on Github at the time of writing). Available in python via the Python-Tesseract library, this engine is powerful and accurate. Note: if you need to install on Ubuntu as myself, these two resources might be helpful.
- EasyOCR: way younger than Tesseract, EasyOCR is quickly gaining in popularity. 12.1k Github, and counting. As the name suggests, this engine is incredibly easy to use. Its API is just a
pip install
away, providing one-liner solutions for a growing number of languages and upcoming handwritten text support. The library also comes with first-class blazingly fast GPU support. For comparison, I was not able to get Tesseract to work on CUDA. I know it is possible, but it is not trivial, and eventually I dropped it. - Amazon Textract: belonging to the AWS suite of services, Textract is (for the moment) not open-source and not free (but very cheap). Nevertheless, I really wanted to add it to the mix, as I have played with it myself in the past (obtaining great results) and, most importantly, I know several enterprise level offerings running on top of it. Also, as most AWS services, it is made available in python via `boto3` here, so definitely worth a shot.
Getting a dataset
This is easier said than done. After a couple of Google queries, I quickly realized there are no large-scale, varied, public OCR datasets (at least I could not find them). Most OCR engines benchmarks I found online are based on synthetic data, which is not ideal, given the real world tends to be a lot messier, but not a deal-breaker either. Some analyses, such as this one, simply create image files writing random words on a white canvas using `PIL`. Not a terrible idea, but I wanted something more flexible.
Meet TextRecognitionDataGenerator. This true gem exposes a super neat CLI (and a python wrapper) allowing users to generate images with text in a fully custom way. You can change the font, the resolution, the background, the inclination, the blurring, the margin, the distortion, and much more. `trdg -c 1000 -w 5 -f 64` for instance, is how you create 1000 images of random text of up to 5 separated words with a resolution of 64 pixels (height).
Needless to say, creating experimental datasets was a piece of cake with `trdg`. I decided to generate 6 of those, to run tests at an increasing level of text recognition difficulty. Here they are, in order of complexity:
- `trdg -c 500 -na 2 -rs -num -let -t 4 -f 64 -w 5 -r -b 1 -tc ‘#000000’ -bl 1 -rbl -k 1 -rk -m 10,10,10,10 –output_dir 1`, which translates into:
- `-c 500`: create 500 images
-na 2
: with the following label convention: [ID].[EXT] + one file labels.txt containing id-to-label mappings-rs -num -let
: using random sequences of numbers and letters-w 5 -r
: with up to 5 separate tokens (words)-t 4
: spawning 4 threads for execution-f 64
: with 64px high resolution-b 1
: on white background- `-tc ‘#000000’`: with black text
-bl 1 -rbl
: applying a gaussian blur with a random radius between 0 and 1-k 1 -rk
: text inclined with a random angle between -1 and 1 degrees-m 10,10,10,10
: with a margin of 10px on all sides (quite big)-output_dir 1
: saving results in a folder named 1.
- `trdg -c 500 -na 2 -rs -num -let -t 4 -f 64 -w 5 -r -b 1 -tc ‘#000000’ -bl 1 -rbl -k 1 -rk –output_dir 2`, same as #1 except for no margin added.
- `trdg -c 500 -na 2 -rs -num -let -t 4 -f 32 -w 5 -r -b 1 -tc ‘#000000’ -bl 1 -rbl -k 1 -rk –output_dir 3`, same as #2 but smaller in resolution (
-f 32
). 32px instead of 64px. - `trdg -c 500 -na 2 -rs -num -let -t 4 -f 32 -w 5 -r -b 0 -tc ‘#000000’ -bl 1 -rbl -k 1 -rk –output_dir 4`, same as #3, but writing on a random noise background, instead of plain white (
-b 0
). - `trdg -c 500 -na 2 -rs -num -let -t 4 -f 32 -w 5 -r -b 0 -tc ‘#000000,#2B21C8’ -bl 1 -rbl -k 1 -rk –output_dir 5`, same as #4, except that now we write with a colored font in random shades between black and blue (
-tc '#000000,#2B21C8'
). - `trdg -c 500 -na 2 -rs -num -let -t 4 -f 32 -w 5 -r -b 0 -tc ‘#000000,#2B21C8’ -bl 1 -rbl -k 1 -rk -d 3 –output_dir 6`, same as #5 with the additional complexity of randomly distorted text (
-d 3
).
Measuring performance
How are we going to measure performance? This is actually quite tricky.
Consider the following. `actual_text = “aaaa”` and predicted_text="aaaa"
. That’s 100% accurate. What about `actual_text = “aaaa”` and predicted_text="aaab"
? Easy. 75% accurate.
This one instead? `actual_text = “aaaa”` and predicted_text="aaaabb"
. Not that trivial anymore. The engine recognized 2 additional characters. How do we compute accuracy in this case? We could come up with a more general definition of the metric. Truth is, maybe accuracy is not the right one for strings’ comparisons. We need something more flexible, capable of handling and comparing sequences of variables sizes.
Meet the Levenshtein distance. From Wikipedia, it amounts to “the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other” (so, the lower the better). Let’s see how this works in practice in the below screenshot. It looks exactly what we are looking for.
For clarity, in terms of lower and upper bounds, the Levenshtein distance between two strings equals zero if and only if the strings are equal, and is at most the length of the longer string.
Running tests in python
You can find here the link to the jupyter notebook with all the relevant code. As for the GPU tests, I have run those on a `g4dn.xlarge` EC2 instance on AWS. Last but not least, below are the functions executing the three separate OCR engines, extracted from the notebook, for reference.
reader = easyocr.Reader(['en'], gpu = True)
textract = boto3.client('textract')
def image_to_byte_array(image: Image):
imgByteArr = io.BytesIO()
image.save(imgByteArr, format=image.format)
imgByteArr = imgByteArr.getvalue()
return imgByteArr
def aws_textract_read(path):
img = Image.open(path)
response = textract.detect_document_text(Document={'Bytes': image_to_byte_array(img)})
text = " "
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
text+=(block["Text"]+" ")
return text.strip()
def easy_ocr_read(path):
text = reader.readtext(path, detail = 0)
if len(text) > 0:
return text[0]
else:
return ""
def pytesseract_read(path):
text = pytesseract.image_to_string(Image.open(path))
text = text.replace("\n\x0c", "")
text = text.replace("\n", "")
text = text.replace("\\x", "")
return text
Tests results
Global tests results at a glance
The two charts below summarise results on the 6 synthetic datasets, both split by experiment and combined. In the first bar plot, it is pretty clear how (expectedly) performance degrades as the complexity of the dataset increases. This behavior occurs across all three OCR engines. Note how recognizing big black text on a white background (#1) is 4x easier compared to small, distorted, colored text on a greyish background. Note also the significant decrease in the Levenshtein metric between experiments #1 and #2, 64px big images, compared to the rest, smaller but more realistic 32px high pictures (tests 3# to 6#).
PyTesseract seems to win across the board (beware of the averages though), followed by Textract and EasyOCR in the order. To be fair, looking at the box plots in the second chart, Amazon Textract displays a distribution more skewed to zero compared to the other two libraries. It is the only engine with a first quartile equalling 0.
The following sections show results split by experiment. I won’t comment on each one of those separately, as the charts are pretty self-explanatory. What stands out is that the general trends are confirmed on the single datasets: EasyOCR performs the worst, followed by Tesseract and Textract. These two engines are almost on-par looking at averages, with Textract leading the race in terms of distributions’ medians, suggestingTextract outputs being more extreme.
1. White background + black text + 10px margin + 64px resolution
2. White background + black text + no margin + 64px resolution
3. White background + black text + no margin + 32px resolution
4. Noise background + black text + no margin + 32px resolution
5. Noise background + black/blue text + no margin + 32px resolution
6. Noise background + black/blue text + no margin + 32px resolution + distorted text
Speed comparison
Together with performance, in order to make a more educated choice, it is also important to compare OCR engines in terms of speed. The table below serves this purpose. I have not added Amazon Textract to the test, given it requires invoking a remote service on AWS, making it incomparable with EasyOCR and PyTesseract, both running locally.
Results are pretty clear. As you can see EasyOCR on GPU (NVIDIA T4 on `g4dn.xlarge`) is blazingly fast. 4x faster than the same library on CPU. 7x faster than PyTesseract on CPU.
Conclusions
Overall, Amazon Textract and Tesseract lead the pack in terms of Levenshtein distance, without a clear winner between the two. Tesseract dominates when comparing averages, whereas Textract wins if we switch to medians. As for speed, EasyOCR tops the rest hands down. The outcome is evident already on CPU, becoming even more marked on GPU.
Pingback: Training and Deploying a fully Dockerized License Plate Recognition app with IceVision, Amazon Textract and FastAPI -
Comments are closed.