Document Understanding is a domain that encompasses a set of techniques and technologies aimed at extracting information from unstructured documents and transforming it into structured data. While Computer Vision and Natural Language Processing (NLP) are important components of Document Understanding, it is a distinct domain that requires a combination of both.
Computer vision focuses on the visual aspects of a document, such as images and layout, and uses algorithms to extract information from these elements. NLP on the other hand, deals with the linguistic aspects of a document and uses techniques such as text recognition and sentiment analysis to process the text content.
Although both computer vision and NLP can be effective in their respective domains, they alone cannot provide a complete understanding of a document. For example, a document may contain images that convey important information, while the text content may be limited or irrelevant. In such cases, a combination of computer vision and NLP is essential to gain a full understanding of the document.

Document Understanding: Definition
Document Understanding is the learning process of the Extraction and conversion of meaningful information from unstructured or semi-structured documents into structured data for analysis and use. This process is supported by technologies that use machine learning, NLP and computer vision, or even traditional OCR to automate information extraction.
Today, various providers offer different types of AI. These can be, for example, computer vision NLP (natural language understanding) or even simple forms of machine learning.
This raises the question: why is document AI that involves some document understanding much more difficult to implement than the simple computer vision or NLPers that consider purely the visual or semantic components of information?
The simple answer is that a document-understanding AI, that is, an AI that can understand documents, must work in two dimensions. Often called Hybrid AI, this AI combines semantic and visual information to understand, type, or even process the content of documents like humans.

What would happen: Example of an invoice
Consider an invoice from a vendor that contains information about the products or services purchased, the total amount owed, and the payment due date. In this scenario, both computer vision and NLP AI play an important role in understanding documents.
Computer Vision AI can be used to recognize and extract information such as invoice number, date, supplier name and address. It can also be used to process the visual layout of the invoice, for example, to identify tables and columns and extract the relevant data.
Computer vision alone is not sufficient to fully understand the invoice. For example, it can't extract the specific products or services that were purchased or the pricing information associated with each item. This is where NLP AI comes into play.
NLP AI can be used to determine and extract information such as the names of the products or services purchased, the quantities, and the prices. It can also be used to process the description and specifications of each item and extract relevant information such as the unit of measure, the tax rate and any discounts.
In summary, an invoice requires a combination of computer vision and NLP AI to provide a comprehensive understanding of the document. While computer vision AI is essential for identifying and extracting information about the visual layout, NLP AI is necessary for processing and extracting the detailed information in the text content. Without both components, the information contained in the invoice cannot be fully understood and utilized.
Which documents can be read?
It is important to know that it is easier for AI to read structured documents than unstructured ones. You can recognize the information you are looking for immediately on an ID card, whereas you first have to search for it in the case of general terms and conditions.
In the best case, Document AI is trainable and continuously learns where to find what information on the document type.
There are different approaches to reading structured, semi-structured and unstructured documents:
- Standardized documents are, for example, ID cards or vehicle registration documents. One might think that a simple rule-based approach would suffice. However, the information is not so easy to identify correctly. One might assume that they are always in the same place. But this is not the case, especially if documents were previously folded or photographed freehand with a smartphone and are distorted or rotated.
- Semi-structured documents contain the same information, but it is always found in different places in the document. AI models learn the skills to find the information they are looking for based on keywords, e.g. "phone number", which can be found anywhere on any page.
- Unstructured documents contain searched information at any position and without keywords. This is where the AI's learning ability comes into play. If you teach the AI which terms and information are relevant, it can already filter these out independently with the next documents.
In addition to simple numbers and words, Document AI can also capture checkboxes and tables with proper training.
How does Document Understanding work?
A document understanding robot is created with the help of RPA. The workflow is set up in the appropriate software and can look like this, for example:
- Create taxonomy: Taxonomy refers to a classification model. In the Taxonomy Manager, you must first define a document type and classify the fields to be read (e.g. invoice number, invoice total and date). The special thing about Konfuzio is that the taxonomy is freely configurable and therefore particularly flexible to all types of documents and languages.
- Digitize document: With the help of a OCR software you can digitize the previously defined document and put it into a text form that is readable by the robot.
- Classify: Using the keywords, the robot assigns the digitized document to a document class defined in the Taxonomy Manager.
- Extract: Once the AI has identified what type of document it is, the data is extracted from the individual fields. Rule-based or model-based approaches are used for this.
- Validate: If required, employees can display the results of the extraction in the Validation Station. There, they can check the read values and correct them if necessary. This feedback by a human, often also called human-in-the-loop, offers the AI the opportunity to learn.
- Export: Finally, the data is exported to various systems. These can be e.g. SAP systems, but also Excel tables.

Document Understanding in practice with Konfuzio
Being able to use Document Understanding in practice is a gamechanger. Learn why and how you can use Document Understanding with the following use case.
Example: Separate AI for image and text processing of messages
To explain the whole thing with an example, let's start with the simple question:
Why does an AI recognize a hockey player better than a paycheck?

As you can see in the image, a photo is shown above a news article and this news article shows the information about a picture and separately the picture of a hockey player.
Let's use the first dimension of this information from the article and process the Text with an NLP component. This NLP component was not specifically designed for the use case, but it can already read out so-called entities, e.g. persons, places, organizations or even companies.
In addition, you can use the visual component of the post and finds out, for example, that different segmentations of a single piece of information can be found. This can be, for example, the ceiling, the wall or even the person individually, without being able to take the context into account here - namely that it is an ice hockey player in a stadium.
Both AIs - computer vision and named entity models - have their justification. However, a combination of these is not easily possible to process documents.
For this reason, the Konfuzio software was created to enable both semantic and visual components for processing information in the business context, i.e. within the document.
Document Understanding through Hybrid AI for Salary Statements.
If you compare the salary statement with the plain article from the newspaper, you will immediately see that the salary statement can still semantically correctly represent multiple layers of information in a 2-D context.
Example:
The table-like structure for gross pay on the salary statement provides information on whether a gross pay corresponds to a one-time payment or whether this gross pay represents regular compensation for the employee. This information in particular is quite relevant when working up a potential borrower's income situation.
This is why rule-based information extraction is not enough
In order to read out this information, providers that only offer OCR, NLP or IDP solutions always have the disadvantage that they work rule-based and thus show incorrect information, for example, in the case of incorrectly oriented scans or documents that have been scanned in crookedly.
Most of the time data is not in one of the appropriate forms and there is no sequence of data. They are present in an unstructured form.
There is no specific technique or procedure to extract data from unstructured PDFs because the data is stored randomly and it depends on what kind of data you want to extract from PDFs.
Rule-based tools work by locating target data points in the document. Based on this context, the document is then searched for the final important values.
The downside is: As soon as there is a slight change in the format, this approach no longer works. If you are now a company that works with 60 different service providers in 10 different countries, you can assume that your rule-based tool will quickly reach its limits.
Of course, one might consider that comprehensive training data can also train machine learning models, computer vision models, or NLP models to address these particular characteristics of a document.
The difficulty here, however, is that the number of training documents in the domain-specific context is usually severely limited, and thus thousands of training documents cannot be made available to train such a document AI.
As you can see, the pure rule and layout based information extraction from documents offers a first approach and also has its justification by different vendors in the market. The purely AI-based information extraction is often limited by the number of training documents, which, if provided too low, allow even the AI-based extraction only to a very low degree of accuracy.

Mind your neighbors: Document Understanding by Konfuzio
The approach of Konfuzio works in a different way: the information is obtained both from the semantics of the document (e.g., wording, language, form, or anchor words) and this information is associated with the positioning of the text on a page, e.g., the word is in a table or in body text.
Hence the title "Mind your neighbors" - based on the surrounding information, the "neighbors", the AI can reliably recognize and assign content based on one-dimensional and 2D information.
To learn more about how the combination of models that think one-dimensionally and are based on the continuous text with 2D information of the text works, see the term Segmentation.
This gives you the opportunity to consider a textual information not only in its semantic context, as a NER model would extract, but also to include additional information that is present due to the orientation and positioning of the text in a document.
For example, under the name of the employee in statistically frequent number could be the house number or the street.
The AI combines latest NER research together with computer vision research to create a comprehensive document understanding, as within the AI can implement and learn the typical optical components without a layout based fixed extraction. At the same time, the AI takes into account the semantic context that becomes accessible through a one-dimensional representation of the information. This AI will also Document AI called.
More information about the Document Understanding from Konfuzio can be found on the website.
What are the benefits of Document Understanding?
Especially where large volumes of documents are processed, Document Understanding adds tremendous value.
The benefits of using Document Understanding in the enterprise are as follows:
- Automated processing of large volumes of documents
- Reduced error rate
- Time and cost savings
- Elimination of repetitive tasks for employees
- Increasing the productivity of employees
- Increased employee satisfaction
Conclusion: Document Understanding must be used sensibly
If you want to take advantage of an AI that can handle document understanding, you need to find the right software. Not every OCR or IDP software is suitable for this.
If you regularly deal with demanding documents such as salary slips or complicated spreadsheets, the right software that makes document understanding smart is the investment that will make your business more efficient.
If you need to spend some time training the AI at the beginning, once it is ready, you can benefit from its work and have the AI read your complicated documents quickly, easily and correctly.
This means that you have smartly automated a time-consuming, error-prone process and can use the time gained for other tasks.
You can find more on the topic here:
- OCR text recognition: digitize analog content easily
- Document verification with AI: Relieving the burden on your company
- The best OCR software for intelligent process digitization
Do you already use Document Understanding in your company? Feel free to write me your opinion on the topic or further questions in the comments!
Write a comment