In this post I’m going to summarize the work I’ve done on Text Recognition in Natural Scenes as part of my second portfolio project at Data Science Retreat.
The importance of image processing has increased a lot during the last years. Especially with the growing market of smart phones people has started producing a huge amount of photos and videos which are continuously streamed on social platforms. Thus the increased interest of the industry towards this kind of problems is completely justified.
Machine learning obviously plays a very significant role in this field. Automatic text detection and character recognition is just an example. One can cite other sophisticated applications such as animal species or plants identification, human beings detection or, more in general, extraction of any kind of information of commercial use.
The topic I was interested to dive into is OCR which stands for Optical Character Recognition. This field has been object of very intensive study in the past decades. Actually, at present, the problem of character recognition from black and white documents is considered solved. It is pretty common practice to scan a sheet of paper and use some standard software to convert it to a text file. There are also very good open source tools out there, such as Tesseract-OCR, which can read and detect up to 60 languages. In any case those are easy cases. The image are gray scale, very good contrast, no specific issue in single character contour detection and little problems due to lighting or shadows.
A completely different scenario starts being depicted when we deal with natural scenes. For example a photo taken by a twitter user and then posted on the social platform. In this case the problem has not been solved at all. Actually there are still quite big issues in processing this kind of images. Very big improvements have been made by some Google services such as Translator which recently added a new feature capable of detecting and translating text from images, but anyway the results are not completely satisfactory and they highly depend on the quality of the picture and on the environmental conditions (night/day, light/shadow) in which it was taken.
Of course the target of my project is not to find a final solution to this kind of open problem but in any case it is still worth trying and practice with such a fascinating topic.
In order to show the results of my work I’ll walk through a complete example, starting from the raw image (which could be ideally a picture from a user) and ending with detected text. The results are not satisfactory yet but in any case the pipeline (image processing + machine learning) is properly working, which opens the way to huge improvements. I implemented the whole project in Python (Pandas/Scikit-Learn/Numpy/Skimage) but for the sake of simplicity and shortness I won’t walk through the code which is available on Github. The post is organized as follows:
- Image Preprocessing and Object Detection
- Text Detection
- Text Classification
- Text Reconstruction
The image I picked to test my code is the following one:
As you can see together with text at the bottom the background image is quite complex and overwhelming. The quote and the name of the author are also printed in two different font size which adds some sort of additional challenge to the task.
After having loaded the image, it needs to be preprocessed. Specifically it goes through the next two steps:
- Denoising: this is done applying a total variation approach which consists in reducing as much as possible the integral of the absolute gradient of the image, where the gradient of an image can simply be interpreted as a directional change in the intensity or color in the image itself.
- Increasing Contrast: this is done applying Otsu‘s method which calculates an “optimal” threshold by maximizing the variance between two classes of pixels, separated by the threshold. Equivalently, this threshold minimizes the intra-class variance.
After image cleaning, object detection is performed. Contours are identified and a rectangle is drawn around objects candidates. The result of this process is the following figure.
As you can see a lot of rectangles have been identified. Not all of them contain text but we’ll take care of that in the following section. After this the objects are converted to greyscale, resized to 20 X 20 pixels and then stacked into a 3D numpy array. The coordinates of each rectangle are also saved in order to be able to reconstruct the image afterwards. The result of this operations is showed in the following standard output and generated figure (plotting 100 random images from the text-candidates selected):
Images After Contour Detection
Fullscale: (342, 20, 20)
Flattened: (342, 400)
Contour Coordinates: (342, 4)
Here comes the interesting part. It’s time to dive into some Machine Learning.
The challenge now is to detect which ones of the identified objects contain text in order to be able to classify it. I approached the problem in the following way: basically I have to train a model to make such a decision which means that first of all I need a proper dataset, consisting ideally of half images containing text and half not containing it. To do that I decided two merge two existing data sources:
- I took 50k images from the 78903 available in the 74K Chars dataset. This is the half containing text and I labeled each image as a 1.
- I took all the 50k images in the CIFAR-10 dataset on Kaggle. This is the half NOT containing text and I labeled each image as a 0.
The complete dataset was then composed of 100k images, properly labeled and randomly shuffled. Then I needed a model to perform the binary classification.
The model I turned to worked in two steps:
- Feature Extraction: this step is performed computing the Histogram Of Gradient (HOG) of the image. This technique is based on the fact that local object appearance and shape within an image can be described by the distribution of intensity gradients, where the gradient of an image can simply be interpreted as a directional change in the intensity or color in the image itself. This approach is commonly used for object detection as it is able ti detect in a fairly easy way contours of shapes. The result of this step is generally an ensemble of data points which carry much more information than the beginning. These new features are ready to be passed to the classifier.
- Classification: this step is performed using Support Vector Machines with a Linear Kernel. The idea at the base of this choice is that we don’t want to over complicate the situation. On the contrary as we already performed an “enrichment” step such as HOG we want to apply a model which, being powerful, would keep things simple. Linear SVM is worth trying.
I run Grid Search Cross Validation in order to optimize the hyperparameters of the Pipeline (both for HOG and LinearSVC) and the following are my results on both train and test test for the binary classification problem:
Generated model on 90% train set --> linearsvc-hog-fulltrain2-90.pickle
Loaded 100000 images each (20, 20) pixels
Target set shape: (90000, 400)
Target shape: (90000,)
Test set shape: (10000, 400)
Target shape: (10000,)
Accuracy on train set: 0.974344444444
Accuracy on test set: 0.9728
97.28% on previously unseen data is very good. We are now ready to run the model on all the rectangles we had detected in the first place and select only the ones containing real text. The result of this operations is showed in the following standard output and generated figure:
Images After Text Detection
Fullscale: (86, 20, 20)
Flattened: (86, 400)
Contour Coordinates: (86, 4)
Rectangles Identified as NOT containing Text 256 out of 342
You can see that the result is very good despite not being completely satisfactory. There are objects which were classified as being characters while it is not the case at all. This is going to cause us more than one problem in the future steps but anyway let’s go on.
Now that we have final candidates it’s time to classify the single characters. The approach I followed is exactly the same I considered for the Text Detection.
In this case the dataset is composed of the 78903 images available in the 74K Chars dataset. We are not dealing with a binary classification anymore as in this case the number of classes is 36:
- integers [0-9] : 10 classes
- lowercase letters of English alphabet [a-z] : 26 classes
I actually decided to reduce the initial number of classes from 62 to 36 as I counted as belonging to the same group uppercase ans lowercase English characters.
As for the Machine Learning part I followed the same exact approach considered in the previous section. The pipeline is composed by a feature extraction step performed by HOG and a classification step carried out by a Linear SVM. After hyperparameter selection by Grid Search CV the following are the results on train and test set:
Generated model on 90% train set --> linearsvc-hog-fulltrain36-90.pickle
Loaded 77635 images each (20, 20) pixels
Train set shape: (69872, 400)
Target shape: (69872,)
Test set shape: (7763, 400)
Target shape: (7763,)
Accuracy on train set: 0.942079803068
Accuracy on test set: 0.877109364936
This is quite promising so let’s run the model on our text candidates and see what happens. The output is shown in the plot below in which each character is reported together with the result of the prediction.
This is the last part of the work and simply consists in putting together all the pieces of the puzzle we have build so far. Just to recap, we have characters with rectangles coordinates from the original image and predictions. What we can do is simply build an other figure plotting the predicted strings in the right positions. The result of this approach is the following:
This chaotic outcome was sort of expected considering the errors accumulated during the several steps but it is very encouraging! I want to emphasize that I manually added the violet rectangles at the end of the procedure to point out the structure of the sentence, so they are not generated automatically by the code.
Concluding I developed a first basic system for automatic text detection and classification in natural scenes (code here on Github). It definitely suffers from several problems but a working pipeline was my first target and it is actually doing its job.
Now lots of possibilities for improvement are available. First of all an accurate analysis of the bottlenecks is necessary in order to define weak points and select the steps needing serious refactoring. As for the algorithmic part it is definitely worth giving a try to Neural Networks and Deep Learning (nntools by Theano could be an idea), both for binary text-no-text classification and for OCR multi-classification. A significant improvement in both steps would result in far less noise in the last part of the program turning into more affordable the text reconstruction phase.
To recap I think the following could be good starting points:
- Get rid of nested rectangles in object detection. Solves the problem of detecting a circle (classified as an ‘o’) inside an ‘a’.
- Manually labeling objects containing or not containing text. It is possible to add a wait_for_key during the object detection phase and as soon as a rectangle is identified manually specify if it’s text or not. For example a tree may be miscassified as text and then classified as a T. Manual detection is very time consuming and before diving into that it is necessary to analyze the pipeline and be sure that it is worth doing it.
- Introduce as a final step a ‘Guess Missing Text Phase’ to correct little mistakes. For example if in the end we should detect the word ‘house’ but we identify ‘hous’, well of course that’s a house!
- Implement Neural Network and Deep Learning.