Tesseract-OCR: Evaluating Handwritten Text Recognition
The Why, What And Who:
As Data Scientists (DS), they often ask us to produce evaluations of models’ capabilities. We do this to help identify the best approach or approaches for specific use cases and deployment environments.
Key objectives for our DS team:
Project manager entrusted us to achieve the following aims for the Story Squad’s app:
- Explore, apply, and implement new image preprocessing techniques to help improve the Tesseract-OCR.
2. Provide a baseline metrics to evaluate Google Vision and Tesseract-OCR pipeline objectively.
#to install Tesseract-ocr
sudo apt install tesseract-ocr#check Tesseract version
tesseract — version#To extract text out an image
tesseract sample_image.png stdout -l eng
Optical character recognition (OCR) is the extraction of typed or printed text, for example, from a PDF or image, into a text string. OCR of typed text is a well-understood problem. Human Handwritten Text Recognition (HTR) is more challenging, because of uniqueness of individuals’ handwriting styles. Specially for kids’ handwriting between the ages of 8–12.
We were part of a 4 member DS team. Our DS team was part of a large multifunctional development team of docents of other Data Scientists, Stack Engineers, iOS developers, and Front End developers.
What is Story Squad?
Story Squad is the dream of a former teacher, Graig Peterson, to create opportunities for children to have creative writing and drawing time, off-screen. Here’s how it works: Story Squad provides users of the website a new chapter in an ongoing story each weekend. They read the story, and then follow with both a writing and drawing prompt to spend an hour off-screen writing and drawing. When they’re done, they upload photos of each.
The team transcribes the stories into text, analyzed for complexity, screened for inappropriate content, and then sent to a moderator. (1)
(https://www.storysquad.education/)
For this project, we work with-in an agile framework; therefore, we have constant interactions with external stakeholders from the Story Squad team, which helps understand and develop the main user story and break it down into tasks to achieve the primary objectives.
Stakeholder User Story
Because of the related cost of using Google cloud vision as the current project’s OCR solution to transcribe thousands of handwritten docs.
Story Squad wants to evaluate objectively Google vision and Tesseract-OCR.
To see if, the team can use Tesseract as the main OCR for the Story Squad app.
Known Unknowns: The Challenge working with Tesseract-OCR
Tesseract-OCR support for handwritten offline text is poor, as of version 4, there is no human handwritten traineddata file available. Also, the training process for the integrated LSTM model is not straightforward, and it is labor intensive. Because of resources, and time constraints, we could not have access to a trained LSTM model which generalizes well human handwritten text. As our testing will show, we have to rely heavily on preprocessing techniques to test if Tesseract-OCR in its default configuration could produce useful text transcription of handwriting samples.
Major hurtle was that none of the DS team members had experience working with Tesseract-OCR specifically, and the OCR topic. Previous teams have done some work to build out parts of the Tesseract-OCR pipeline. There were some functional parts, but the performances of the train model inherited, it isn’t as good as the Google Vision API.
“It’s not that I’m so smart, it’s just that I stay with problems longer.” — Albert Einstein
Sticking to a UPER framework (Understand, Plan, Execute, Review).
We spent a significant amount of time researching, reading, viewing videos, and learning. So, we can fully understand how to work with-in the Tesseract-OCR framework. Once we have brought ourselves to speed on the subject, and we felt confident. We planned and executed without many issues. We also conducted reviews of our plan with individuals outside our 4 member team, and try to use best practices with our git push request to smooth out the approval process of our code.
The Metrics:
The metrics we used were Character Error Rate (CER), Word Error Rate (WER) and Levenshtein distance (LDR).
- A CER of 10% means that every tenth character is incorrectly identified (and these are not only letters but also punctuations, spaces, etc.). (2)
- The WER shows how good the exact reproduction of the words in the text is. As a rule, the WER is three to four times higher than the CER and is proportional to it. (2)
- The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. Named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965. (3)
The process:
After creating a PipEnv environment through Linux Terminal in our Windows 10 machine. We mainly use OpenCV (4) and its cv2 module for our pre-processing of the images before text extraction by Tesseract-OCR. The cv2 functions we used: resizing, BGR to RGB, BGR to GRAY, and Adaptive threshold. We tried other functions with mix results and process cost, but we settled for the few previously mentioned functions, as we got the biggest bang for our processing cost. Although openCV (4) has a deskew function, we had better results using [sbrunner/deskew repo](https://github.com/sbrunner/deskew) (5)
Original Image & Ground Truth
We used a sample with handwriting to test our hypothesis against the ground truth text, using the CER, WER, and LDR. Ground truth is information that is known to be true.
We found better results by deskewing the image first, then applying the other preprocessing steps. We need it to use skimage io module to read the image.
Tesseract-OCR Evaluation results
The team evaluated our results using a python wrapper pytesseract (6) for Tesseract-OCR Binary . We also used two other libraries to produce our scores, asrtoolkit for CER, WER) (7) and fuzzywuzzy (8) for Levenshtein distance.
We created seven hypotheses text extractions to compare with our ground truth text and pass both the individual hypothesis and ground truth to each of the evaluating functions. For more detail information on the results of each evaluation, please visit the following link.
Results from our evaluation:
As per the below results, CER and WER a rating close to 0 is desirable, close to 100 is not desirable. LDR a rating close to 100 is desirable and near 0 is not desirable.
As the results from our evaluation showed in the above graphs:
- Preprocessing techniques used improved Tesseract-OCR performance even with the default eng.traineddata. (10)
- As expected, Google cloud vision produces far better results at this stage of the project.
- Although we used for hypothesis #5 a custom trained data ssq.traineddata (10) results actually were less desirable than using the default eng.traineddata. I theorize the LSTM custom trained model, it’s still not generalizing well.
As per the Story Squad project, there are definitely some challenges up-ahead. Improving the LSTM model for better text extraction, I think will be key.
This will allow to see the impact of further preprocessing methods or steps.
As per our code, there are some improvements that to be done, such as adding further preprocessing steps and arguments to control the parameters within the cv2 module. Creating a helper function out of the deskew process and integrating it into the “pre_process_image” function.
I was extremely lucky to be part of this project. I would like to thank the team at Story Squad. It such an innovated idea to use technology to engage kids to spend off-screen time writing and drawing. As well as using their imagination and storytelling capabilities.
As per myself, the challenges encountered by our lack of knowledge of the OCR subject. Allow me to put into practice a growth mind set, and learn a few new skills, frameworks, and libraries. I am always pleasantly surprise by the bounty of information and open-source projects that allow us to push forward our knowledge and hopefully contribute back to the data science community as well. I also plan to continue working on the OCR and Machine Vision subjects to further my knowledge and capabilities.
Citations
(1) Lambda-School-Labs/story-squad-ds Lambda-School-Labs/story-squad-ds (2021). Available at: https://github.com/Lambda-School-Labs/story-squad-ds (Accessed: 22 June 2021).
(2) Word Error Rate & Character Error Rate — How to evaluate a model (2019). Available at: https://rechtsprechung-im-ostseeraum.archiv.uni-greifswald.de/word-error-rate-character-error-rate-how-to-evaluate-a-model/ (Accessed: 22 June 2021).
(3) Levenshtein distance — Wikipedia Levenshtein distance — Wikipedia (2021). Available at: https://en.wikipedia.org/wiki/Levenshtein_distance (Accessed: 22 June 2021).
(4) Image preprocessing OpenCV: Install OpenCV-Python in Ubuntu
(5) Image Text Deskewing sbrunner/deskew
(6) Tesseract-OCR python Wrapper pytesseract
(7) CER and WER ratings asrtoolkit
(8) Levenshtein distance rating fuzzywuzzy
(9) Document Text Tutorial | Cloud Vision API | Google Cloud
(10) Tesseract training repo dakotagporter/tesstrain