Donut, introduced by Kim et al. (2021) in their paper "OCR-free Document Understanding Transformer (Donut)," is a unique approach to document image processing that does not rely on optical character recognition (OCR). The model is designed to work efficiently in multiple languages and is computationally cheaper than traditional OCR-based methods.
In this article, we will take a deeper look at Donut's architecture, its components, and its performance in real-world applications.

In the DONUT paper, the researchers present a method to train a combined vision and speech model (self contained E2E model), which is a type of AI that can train human-like Visual Noisy Documents understand and can generate structured data. They use a training strategy called teacher forcing, which means they give the AI the correct answers as it learns, rather than letting it guess based on its previous attempts.
When the AI is actually tested, it receives a prompt that is a short piece of text that helps the AI decide what to generate. The researchers have added special tokens (similar to markers) for different tasks to make it easier for the AI to understand what it needs to do.
To illustrate the process, imagine you are teaching a child to write a story. Teacher-forcing would be like giving them an outline or list of key points to include in the story, while the prompt is a sentence or idea that gets their creative juices flowing.
After the AI generates a response, the researchers convert the output to a structured format called JSONwhich is a common method of representing and organizing data. They use special tokens (similar to markers) to indicate the beginning and end of each piece of information in the output. If the AI's output is not structured correctly, they consider that particular piece of information lost.

Overall, the DONUT paper describes a method for training and testing a language model with teacher forcing, prompts, and a structured output format to make it easier for the AI to provide human-like Visual Noisy Documents to understand and generate.
The Konfuzio team has been exploring the Donut Document Understanding model by Kim et al. 2021 as a promising method for automatic document processing. The model uses a novel method of data representation that allows the relationships between different elements in a document to be captured in a more precise and effective way. In addition, the model shows promising results in terms of classification and Extraction of information from documents, which makes it a promising approach for the development of automated document processing solutions.
This article was written in German, automatically translated into other languages and editorially reviewed. We welcome feedback at the end of the article.
Architecture and components
The main components of the donut architecture are the encoder, which is responsible for processing visual data, and the decoder, which processes text data. The model operates in two main stages:
Encoding: In this phase, the encoder processes the input image and converts it into embeddings. Embeddings are numeric values that represent visual, textual, or other types of data. This process allows the model to convert the visual information of the document into a machine-readable format.
Decoding: The decoder takes the embeddings generated by the encoder and autoregressively generates text based on the output of the encoder. In the autoregressive process, the decoder uses previously generated words as context to generate the next word. This approach allows the model to generate a textual representation of the input image without resorting to OCR.
Performance and limitations


Despite its innovative approach, Donut's performance was not particularly convincing in certain applications. In tests with two instances of the model (the default instance and a version fine-tuned with the CORD receipt dataset), success was measured by how accurately ground truth annotations were extracted. Unfortunately, the overall precision for the categories tested never exceeded 10 %.
Some of the limitations and disadvantages identified in these tests are:
Insufficient language-independent capabilities: Although Donut was designed to work in multiple languages, its performance was suboptimal when processing German and English data. In some cases, the generated text even contained unrelated Chinese characters.
Low processing speed: Even when run on a GPU, the processing speed of the model was relatively slow, which could affect its practicality in real-world scenarios.
Future directions and improvements
Given these limitations, researchers are currently exploring the possibility of fine-tuning the donut model for specific datasets to improve its performance. The goal is to develop a more language-independent and efficient version of the model that can better understand and process different document images.
Fine-tuning involves adjusting the parameters of the model to better fit the target data set, resulting in a more specialized model tailored to the task at hand. By fine-tuning Donut to the desired data, researchers hope to achieve better extraction accuracy and overall performance.
Summary
In summary, Donut represents a new approach to document image processing because it no longer relies on OCR. Although current performance has not been satisfactory in some applications, the potential for improvement through fine-tuning and further research is promising. As the technology evolves and adapts, Donut could become a valuable tool for language-independent and computationally efficient document image processing.