Only a short time after the triumph of large language models, another decisive breakthrough has been achieved in artificial intelligence: Multimodal Large Language Models presented last year are able to process visual elements as well as text. This brings us another step closer to the much-dreamed-of general AI.
Multimodal deep learning plays a key role here. As a still young special field of machine learning, it is already achieving impressive results in object recognition as well as speech and image analysis. This offers a wide range of opportunities - especially in the field of intelligent document processing. It is now clear what is actually possible, but also where the new limits lie.
You are reading an auto-translated version of the original German post.
Another dimension of generative AI
It was previously the common standard: pre-trained large language models (LLMs) with domain-specific fine-tuning are used to solve various tasks in automated language processing (NLP). The basic ability to recognize complex relationships in human language comes from the analysis of immense amounts of text as part of an unsupervised learning process. The resulting possibilities in terms of analyzing, generating, translating and summarizing text were certainly enough to turn the tech sector on its head - think ChatGPT. However, they only model one, very important, but single dimension of human perception.
Multimodal LLMs overcome this limitation by supplementing the capabilities of conventional models with the processing of multimodal information. This includes images, for example, but also audio and video formats. They are therefore able to solve much more comprehensive tasks and in many cases do not even need to be specially adapted for this. The combination with vision models, which has often been necessary up to now, could therefore become considerably less important. Overall, a significant breakthrough can be seen here, which is expressed in the following fundamental advances:
- Approximation of human perception by centralized processing of different types of information
- Increased usability and more flexible interaction through visual elements
- Solving novel tasks without separate fine-tuning
- No restriction to the scope of natural language processing

How do Multimodal LLMs work?
Multimodal LLMs basically continue to make use of the Transformer architecture introduced by Google in 2017. In the case of the Developments in recent years it already became clear that comprehensive extensions and reinterpretations are possible. This concerns especially the choice of training data and learning procedures - as here.
Multimodal Deep Learning
This special form of machine and Deep Learning, focuses on the development of special algorithms whose combination allows the processing of different data types. This continues to be done using neural networks, which, due to their depth, can also deal with particularly high information content, such as is present above all in visual content. This also enables a more intensive learning process. Multimodal Deep Learning therefore not only allows the handling of diversified input, but also leads to increased speed and performance. However, one of the greatest challenges lies in the provision of the necessary data volumes.
Replacement of classic fine tuning
In addition, compared to previous paradigms, novel methods such as so-called "instruction tuning" are used. This describes a fine-tuning of pre-trained LLMs for a whole range of tasks - different than previously usual. The result is a significantly more generalized applicability. This means that corresponding models are also prepared for previously unknown tasks without the need for further supervised training or countless prompts.

The versatility of the data passed through is of enormous importance for this process. Corresponding encoding mechanisms are responsible for processing image and video content in addition to speech. In this way, the model learns to recognize connections between text and other forms of content. It can therefore respond to visual input with linguistic explanations or interpretations.
The insights gained from the first study on this topic (A Survey on Multimodal Large Language Models, Yin, Fu et al., 2023) suggest great potential for a widespread AI application area. This has not escaped the attention of subsequent research: With DocLLM an extension of traditional language models was developed that is suitable for multimodal Document Understanding primarily incorporates the spatial layout structure. These approaches open up extensive new possibilities.

Gamechanger for intelligent document processing
The automated processing of business documents is a complex process, but it is becoming increasingly easier to map using artificial intelligence. To date, large language models have played a particularly important role in machine processing the text they contain. The major difficulty is that documents are often available in optical form and therefore initially require additional techniques such as optical character recognition. The same applies to the capture of layout information, for which so far mostly Computer Vision is used. Multimodal LLMs have the potential for extensive simplification. The following capabilities help to achieve this:
- Generate output based on visual input, e.g. summarize content of uploaded business document or image
- Analysis of novel documents without additional fine-tuning
- Queries/query functions, e.g. name the cost points of an invoice on request
- Parse documents and output the data in various formats, e.g. JSON
- Multilingualism without separate translation, e.g. analyze an English document and answer questions about it in German
Document analysis is accelerated
Compared to IDP software based on conventional large language models, multimodal LLMs can significantly increase process speed. This starts right from the implementation stage, which is less time-consuming due to the reduced training required. The elimination of highly specialized business applications, which previously had to be integrated for the individual applicability of the models, also contributes to this. Added to this is the increased performance, which has been scaled up with almost every generation of large AI models. At the same time, the developers ensure more intuitive handling, which prevents errors and excessive correction loops during further processing.
The alternative - How DocumentGPT reads documents
In the search for alternatives to the well-known Google Text Bot Bard, it makes sense to look at ChatGPT and the new multimodal LLM GPT-4 from OpenAI. In 2023, the model still frequently responded to visual input (e.g. an ID card) with error messages such as "Sorry, I cannot help with that". In the meantime, the object is recognized, but the data extraction of the ID card remains unsuccessful. In addition, there are certain limitations relating to the precision of subject-specific documents and objects, e.g. in the medical field. There is also a lack of specialized access to business archives that would allow productive usability in companies.
Or is it? DocumentGPT is an AI technology from Konfuzio that enables the optical extraction of labels and inscriptions. Speech processing is then possible via the GPT-4 API using OpenAI's latest LLM. Access to the multimodal functionalities via API is not permitted. Therefore, the visual OCR functionalities of Konfuzio are required to first extract the data and then only send it for pure speech processing. At the other end, Konfuzio's APIs and SDK can be used for seamless integration into existing workflows, overcoming current hurdles.
Test DocumentGPT on the Konfuzio marketplace and see for yourself. You can register for free at app.konfuzio.com and apply for access to the powerful AI model.

Limitations of Multimodal LLMs
With every technological advance, the boundaries of what is possible are shifted, but not completely removed. New AI models in particular often have a more generalized applicability, but this is often at the expense of errors and weaknesses in individual areas. The first tests of the models reveal which limitations research could focus on in the near future:
Low data accuracy: The incorrect extraction of data can have troublesome consequences for companies.
Hallucinations: No less problematic is the bringing about of data that is not present in a document at all.
Calculation error: Even earlier Large Language Models struggled in some cases even with basic arithmetic. However, important financial documents leave little room for error.
Lack of specialization: The stronger generalized applicability cannot yet outperform fine-tuned models in all areas.
Processing of high image resolution: A current study suggests that multimodal LLMs currently still fail to analyze image information in high resolution.
Solution approaches
Even if the experimental status of current Multimodal Large Language Models hardly allows integrated solutions for the existing weaknesses so far, complementary strategies are already foreseeable. After all, the basic idea of optimizing the performance of AI models is nothing new. For example, the following approaches could help to achieve good results in dealing with documents and text already with the current state of development:
Human in the Loop is a valuable concept that both prevents errors and improves the future performance of the model through annotations. For this, a regular feedback loop by human team members takes place. More information can be found in this blog post.
Expert systems can replace this human logic in troubleshooting by being programmed to a concatenation of investigation steps and action principles.
This creates Hybrid models, which allow a high degree of automation despite the error-proneness of the underlying language model.
It is therefore particularly important to apply a Business Logicwhich is implemented in various ways - by man or machine - as a validation layer around the new system.
Supplementary models such as DocLLM can add further capabilities to existing MLLMs in order to at least partially solve existing problems. Another current example is Monkey, which takes care of the limits if the image resolution is too high.
In order not to lose track of these concepts, the introduction of an all-in-one solution is a good idea: Konfuzio is a platform for intelligent document processing that combines principles of business logic and large language models as an interface between humans and AI. The flexible use of technologies such as computer vision or optical character recognition has been optimized over the years and adapts to the latest (multimodal) developments at any time.
Conclusion
Thanks to the ability to process multimodal content and combine it with text, large language models have reached a new dimension of generative AI. This leads to the overcoming of boundaries that were previously clearly defined in the field of natural language processing. Multimodal LLMs are not only able to understand images and videos, but also offer an increased degree of flexibility in speech processing. This is ensured by new methods such as "instruction tuning", which is not limited to individual tasks and therefore makes subsequent supervised training superfluous in many cases.
This innovation offers particularly great potential for intelligent document processing. This was previously heavily dependent on fine-tuning and the combination with specialized business applications and vision models. However, multimodal LLMs cannot yet completely replace this approach. Separate validation mechanisms are still required to prevent inaccuracies and errors. Everything else is likely to be a question of time, which will soon be answered by developments that are already underway.
Would you like to find out more about the possibilities of multimodal LLMs for companies? Please contact Contact to us.