Tesseract is a Optical Character Recognition (OCR) engine, which originated at HP Labs and was released as an open source project in 2005. In 2006, Google took over development and has since provided continuous improvements and updates. In the meantime, Tesseract has become a widely used OCR engine that supports over 100 languages.
Compared to proprietary OCR software, Tesseract offers not only a free OCR engine, but also the possibility to constantly improve the quality of the text recognition by human feedback. This is often necessary, since the conventional installation of the tool does not allow for optimal quality in detection.
We explain in our two-part Tesseract Guide how the software works and how you can get the most out of it. In the first part, we show you how to properly install, set up and train the tool.
In the second part you HERE you will learn what to consider when using Tesseract OCR and what best practices you can follow.
You are reading an auto-translated version of the original German post.
1. How Tesseract works
Tesseract works in several steps to extract text from images. First, it performs preprocessing of the image to optimize its quality for text recognition. Then Tesseract OCR segments the image into text blocks, lines and words and analyzes the structure of the text.
In the preprocessing phase comes Leptonica into play, also a Open source library.It is responsible for image processing and manipulation. Leptonica optimizes the images by reducing noise, normalizing colors, and adjusting scaling to increase the effectiveness of Tesseract's machine learning. Apart from that, however, Leptonica does not take care of text recognition and extraction, which is the exclusive responsibility of Tesseract.
In the recognition phase, the tool uses machine learning to identify the characters in the images. Since version 4.0, the software has focused on Long Short-Term Memory (LSTM) networks to further improve recognition accuracy. Finally, the tool performs post-processing to correct incorrectly recognized characters and generate the final text result.
2. Areas of application of Tesseract
Tesseract OCR finds application in various fields where the conversion of scanned documents, images or PDFs into editable text is required. Some of the most common use cases are:
- Automation of data entry and Text extraction
- Digitization of books and archive material
- Recognition of text on business cards and forms
- Automatic recognition of text in images
- Recognition of license plates and traffic signs
- Text recognition with Tesseract on mobile devices and web services
Tesseract is a versatile and powerful OCR engine that can be used by both developers and end users. It provides a solid base for OCR projects and can be customized to meet specific requirements.
3. Installation and setup of Tesseract
With its advanced neural networks, Tesseract takes text recognition to a new level. Here's how to properly install and set up the open-source software:
3.1 System requirements
Tesseract OCR can be installed on various platforms. For the engine to run efficiently and smoothly, certain system requirements must be met. These include at least a dual-core processor with 2 GHz and 2 GB of RAM.
However, a quad-core processor or higher and at least 4 GB of RAM are recommended for editing larger amounts of text.
Basically, you can say: The more memory capacity the system has, the faster the processing runs. This is especially true when creating OCR for entire books or larger text files.
3.2 Installation on different platforms
You can install Tesseract OCR on Windows, macOS and Linux. If you have any questions or problems during the installation, the official documentation of the software will help you.
3.2.1 Installing Tesseract on Windows
The installation on Windows is quick and easy. After downloading the latest version you can start the installation program, which automatically installs the dependencies like Leptonica and Brew.
During the installation, you can customize the settings. We recommend that you select all the required components, especially the language data. The language data enables optimal text recognition with the Tesseract software. For the installation you need at least Windows 7.
3.2.2 Install Tesseract on macOS
To install Tesseract on macOS, you need at least version 10.7.5. As with Windows, you should install the language modules you need during the installation. However, if you have not selected this option, you can do so later. You can then simply install the language modules manually in the directory you selected during the installation of the tool.
3.2.3 Installing Tesseract on Linux
Installing the Tesseract OCR engine on Linux systems is a bit more complex than on Windows and macOS. To do this, you must first download and install the necessary packages. These vary depending on the Linux distribution, but most distributions require the "tesseract-ocr" package. To install the package, the following command can be used:
sudo apt-get install tesseract-ocr
Some distributions may also require the "tesseract-ocr-all" package to install all language support.
After installing the packages, Tesseract OCR can be started. You can always install additional language modules to extend the software. To do this, either download the modules manually or install them via the package management system of the Linux distribution you are using. For example, to install the language module for German, you can use the following command:
sudo apt-get install tesseract-ocr-deu
To use the Tesseract graphical user interface on Linux, you usually need to install it from the "tesseract-ocr" package:
sudo apt-get install tesseract-ocr
After installation, the graphical interface can be started by entering the command "tesseract_gui" in the command line.
3.3 Setting up the environment variables
To run Tesseract Solutions correctly on an operating system, you need to set up the environment variables accordingly. These help the tool locate and access the resources and files it needs.
For example, for Windows, you must include the directory where the Tesseract installation directory is located in the PATH variable of the environment variable.
This allows the Tesseract software to access the required files - regardless of the folder where the files are stored. Similar steps must be performed for macOS and Linux. Here it is also necessary to specify the directories where the files of the software are stored.
Francesco Piscani shows how to install and set up the software on Linux in the following video:
4. Training of Tesseract
Even the best OCR engine is only as good as its database. While Tesseract's standard functions can quickly complete simple OCR tasks, the software requires training for special use cases. This is crucial to achieve optimal results.
To improve the performance of the tool, you need to adapt the OCR models to specific use cases. This process is called training. It usually involves creating training data, fine-tuning existing OCR models, and evaluating and measuring performance. Only then is the tool able to read data from more complex documents like the one below without errors.
4.1 Creating training data
To train the Tesseract software, you need a sufficiently large collection of sample images or documents. This data must already be annotated.
In order to perform the (time-consuming) annotation of data as quickly as possible, you can resort to various tools. These help to automate the process - or at least speed it up.
An example of such a tool is the program Lios. It is an open source tool specifically designed for outputting OCR recognized text. It can help in creating training data by annotating automatically and thus reducing the effort.
More Tesseract training data can also be downloaded via GitHub.
Another option for obtaining training data is to use templates to enhance your data extraction requirements. For example, you can use existing templates that are similar to your desired data structures to create corresponding training data for Tesseract OCR. This is usually a faster and less expensive method than creating training data manually.
4.2 Fine-tuning existing models
To adapt existing models to specific use cases, you should fine-tune them.
Fine-tuning involves training existing models with additional data to improve the performance of the Tesseract OCR engine for a specific task.
It is important to note that fine-tuning is only successful if the additional training data is relevant to the specific task.
For fine-tuning you need to prepare two types of files:
- the Tesseract Traineddata file
- the LSTM checkpoint file
The Traineddata file contains the data used by Tesseract during training to recognize letters, words and characters. The LSTM checkpoint file contains the information that the LSTM model uses for its predictions.
To extract an LSTM model from a standard model and prepare it for fine-tuning, perform the following steps:
- Load the standard model in Tesseract.
- Extract the LSTM model from the standard model.
- Modify the LSTM model to match the specific task for which fine-tuning is being performed.
- Train the tuned model with the additional training data and save the model checkpoints.
Model Checkpoints are intermediate training results that are saved periodically during fine tuning.
These checkpoints are important because they store a current model with the latest training data. If the training is interrupted, the model can be continued with the last saved checkpoint.
Once the fine-tuned model is created, you can use it in the OCR application. However, it is important to make sure that the training set reflects the practice to be recognized.
4.3 Evaluation and performance measurement
Evaluating and measuring performance is an important step in ensuring that the Tesseract OCR engine provides the expected accuracy and reliability. To achieve this, various metrics are used to assess OCR performance.
One of the most important key figures is the reading accuracy. It is usually given as a percentage and measures the proportion of correctly recognized characters in relation to all characters to be recognized.
In addition, other key figures such as
- the error rate,
- the misrecognized characters,
- the execution speed and
- the accuracy with different fonts and languages
can be measured. Here, it is important to consider the expected performance under real usage conditions and compare it with other OCR engines or methods.
For performance measurement, you can use various tools and techniques, such as
- standardized test data sets,
- a manual check of the results,
- statistical analyses or
- machine learning.
The choice of process depends on the specific application and available resources. However, keep in mind that several factors affect OCR performance. These include image quality, font, language, and the layout and format of the document.
Is there an alternative to Tesseract?
Yes, there are several Alternatives to Tesseract-OCR (optical character recognition). Here are a few of them:
Abbyy FineReader: This OCR software provides high accuracy text recognition and is especially good for scanning books and documents. It supports a wide range of languages and has strong layout analysis functions.
Amazon Textract: This is a service from Amazon Web Services that provides OCR capabilities. It can not only extract text from documents, but also recognize forms and tables.
Google Cloud Vision OCR: This service is part of the Google Cloud Platform and can recognize text in a variety of languages and fonts.
Please note that some of these alternatives are fee-based and their costs and features may differ from Tesseract. It is always important to consider your specific requirements before choosing an OCR solution.
Tesseract Guide Part 2: Usage, result optimization and best practices
Read now the second part of our comprehensive guide. In it, we show you how to use the software in practice and improve delivered results. In the process, we provide best practice tips to help you efficiently achieve the results you need.
FAQ
Tesseract is an optical character recognition engine from Google. The open source software enables the recognition and extraction of text from images and scanned documents. Tesseract is one of the most powerful OCR engines, supporting over 100 languages.
Tesseract extracts text from images in several steps: First, it optimizes image quality through binarization, noise reduction, and scaling. Then, Tesseract segments the image into text blocks, lines, and words to analyze the text structure. In the recognition phase, Tesseract identifies characters using machine learning, specifically Long Short-Term Memory (LSTM) networks. Finally, it corrects incorrectly recognized characters and generates the final text result.
Tesseract OCR is used, for example, in the automation of data entry, digitization of books and archival materials, recognition of text on business cards and forms, and automatic translation of text into images. Companies in the finance and healthcare industries, among others, use the technology.