Multimodal LLM for data and image

Multimodal LLMs - Beyond the Limits of Language

Tim Filzinger

Only a short time after the triumph of large language models, another decisive breakthrough has been achieved in artificial intelligence: Newly developed multimodal large language models are capable of processing visual elements in addition to text. Thus, we are another step closer to the often dreamed-of general AI.

Multimodal Deep Learning plays a key role in this. As a still young specialty of machine learning, it is already achieving impressive results in object recognition as well as speech and image analysis. This offers a wide range of opportunities - especially in the field of intelligent document processing.

A new dimension of generative AI

Until recently, it was the new common standard: pre-trained large language models (LLMs) with domain-specific fine-tuning are used to solve various automated language processing (NLP) tasks. The basic ability to recognize complex contexts in human language comes from the analysis of immense amounts of text in an unsupervised learning process. The resulting capabilities in terms of analysis, generation, translation and summarization of text were definitely enough to turn the tech sector on its head - think ChatGPT. However, they only model one dimension of human perception, a very important one, but a single one.

Multimodal LLMs have recently overcome this limit by supplementing the capabilities of conventional models with the processing of multimodal information. This includes, for example, images, but also audio and video formats. Thus, they are able to solve much more comprehensive tasks and in many cases do not even need to be specially tuned for this purpose. The combination with vision models, which has often been necessary up to now, could thus become considerably less important. Overall, a significant breakthrough can be seen here, which is expressed in the following fundamental advances:

  • Approximation of human perception by centralized processing of different types of information
  • Increased usability and more flexible interaction through visual elements
  • Solving novel tasks without separate fine-tuning
  • No restriction to the scope of natural language processing
data format and images
The range of supported formats could grow even further.

How do Multimodal LLMs work?

Multimodal LLMs basically continue to make use of the Transformer architecture introduced by Google in 2017. In the case of the Developments in recent years it already became clear that comprehensive extensions and reinterpretations are possible. This concerns especially the choice of training data and learning procedures - as here.

Multimodal Deep Learning

This new special form of Machine and Deep Learning, focuses on the development of special algorithms whose combination allows the processing of different data types. This continues to be done using neural networks, which, due to their depth, can also deal with particularly high information content, such as is present above all in visual content. This also enables a more intensive learning process. Multimodal Deep Learning therefore not only allows the handling of diversified input, but also leads to increased speed and performance. However, one of the greatest challenges lies in the provision of the necessary data volumes.

Replacement of classic fine tuning

In addition, compared to previous paradigms, novel methods such as so-called "instruction tuning" are used. This describes a fine-tuning of pre-trained LLMs for a whole range of tasks - differently than has been the case up to now. The result is a clearly generalized applicability. Thus, corresponding models are also prepared for previously unknown tasks without the need for further supervised training or countless prompts.

The versatility of the data passed through is of enormous importance for this process. Corresponding encoding mechanisms are responsible for processing image and video content in addition to speech. In this way, the model learns to recognize connections between text and other forms of content. It can therefore respond to visual input with linguistic explanations or interpretations.

The insights gained from the first study on this topic (A Survey on Multimodal Large Language Models, Yin, Fu et al., 2023) suggest great potential for a widespread AI application area. This has not escaped the attention of subsequent research: With DocLLM an extension of traditional language models was developed that is suitable for multimodal Document Understanding primarily incorporates the spatial layout structure. These approaches open up extensive new possibilities.

Gamechanger for intelligent document processing

The automated processing of business documents is a complex process, but it is becoming increasingly easier to map using artificial intelligence. To date, large language models have played a particularly important role in machine processing the text they contain. The major difficulty is that documents are often available in optical form and therefore initially require additional techniques such as optical character recognition. The same applies to the capture of layout information, for which so far mostly Computer Vision is used. Multimodal LLMs have the potential for extensive simplification. The following capabilities help to achieve this:

  • Generate output based on visual input, e.g. summarize content of uploaded business document or image
  • Analysis of novel documents without additional fine-tuning
  • Queries/query functions, e.g. name the cost points of an invoice on request
  • Parse documents and output the data in various formats, e.g. JSON
  • Multilingualism without separate translation, e.g. analyze an English document and answer questions about it in German

Document analysis is accelerated

Compared to previous IDP software based on conventional large language models, multimodal LLMs can significantly increase process speed. This already starts with implementation, which is less time-consuming due to the reduced training effort. This is also due to the fact that highly specialized business applications, which previously had to be integrated for the individual applicability of the models, are no longer required. Added to this is the increased performance, which has been further scaled with pretty much every generation of large AI models. At the same time, the developers ensure more intuitive handling, which prevents errors and sprawling correction loops during further processing.

The alternative - How DocumentGPT reads documents

In the search for alternatives to the well-known Google text bot Bard, it is obvious to look at ChatGPT and the new multimodal LLM GPT-4 from OpenAI. However, this remains a short-lived pleasure in terms of document processing: In October 2023, the model responds to the upload of an ID card, which is possible, and a request for the image content with "Sorry, I cannot help with that". Even clarifying prompts do not change this output. The linguistic capabilities of GPT-4, however, remain undisputed. The only thing missing is a practically usable, multimodal access to it.

Or is it? DocumentGPT is an AI technology from Konfuzio that enables optical extraction of labels and captions. Subsequently, via the GPT-4 API, speech processing is possible through OpenAI's latest LLM. At the other end, seamless integration into existing workflows can be achieved via Konfuzio's APIs and SDK, overcoming currently existing hurdles.

Test DocumentGPT on the Konfuzio Marketplace and see for yourself. At you can register for free and request access to the powerful AI model.

DocumentGPT succeeds where ChatGPT has failed so far.

Limitations of Multimodal LLMs

With every technological advance, the boundaries of what is possible are shifted, but not completely removed. New AI models in particular often have a more generalized applicability, but this is often at the expense of errors and weaknesses in individual areas. The first tests of the models reveal which limitations research could focus on in the near future:

Low data accuracy: The incorrect extraction of data can have troublesome consequences for companies.

Hallucinations: No less problematic is the bringing about of data that is not present in a document at all.

Calculation error: Even earlier Large Language Models struggled in some cases even with basic arithmetic. However, important financial documents leave little room for error.

Lack of specialization: The stronger generalized applicability cannot yet outperform fine-tuned models in all areas.

Solution approaches

Even if the experimental status of current Multimodal Large Language Models hardly allows integrated solutions for the existing weaknesses so far, complementary strategies are already foreseeable. After all, the basic idea of optimizing the performance of AI models is nothing new. For example, the following approaches could help to achieve good results in dealing with documents and text already with the current state of development:

Human in the Loop is a valuable concept that both prevents errors and improves the future performance of the model through annotations. For this, a regular feedback loop by human team members takes place. More information can be found in this blog post.

Expert systems can replace this human logic in troubleshooting by being programmed to a concatenation of investigation steps and action principles.

This creates Hybrid models, which allow a high degree of automation despite the error-proneness of the underlying language model.

It is therefore particularly important to apply a Business Logicwhich is implemented in various ways - by man or machine - as a validation layer around the new system.

In order not to lose track of these concepts, the introduction of an all-in-one solution is a good idea: Konfuzio is a platform for intelligent document processing that combines principles of business logic and large language models as an interface between humans and AI. The flexible use of technologies such as computer vision or optical character recognition has been optimized over the years and adapts to the latest (multimodal) developments at any time.


With the ability to process multimodal content and combine it with text, Large Language Models have reached a new dimension of generative AI. This leads to the overcoming of boundaries, which so far have been sharply defined around the area of Natural Language Processing. A Multimodal LLM is not only capable of understanding images and videos, but also offers an increased degree of flexibility in language processing. This is ensured by novel methods such as "Instruction Tuning", which is not limited to individual tasks and thus in many cases makes subsequent supervised training unnecessary.

This innovation offers particularly great potential for intelligent document processing. Until now, this has been heavily dependent on fine-tuning and combination with specialized business applications and vision models. However, multimodal LLMs cannot yet completely replace this approach. Separate validation mechanisms are still necessary to prevent inaccuracies and errors. Everything else is likely to be a question of time, which will soon be answered by developments already underway.

Would you like to learn more about the new possibilities offered by Multimodal LLMs? Please feel free to contact Contact to us.

About me

More Articles

Business Innovation

Business Innovation: Plannable long-term business success

In a world shaped by megatrends such as digitalization and individualization, companies must constantly question their business models in order to successfully...

Read article

Document Management System: What must a DMS be able to do?

In a digital world where a flood of information is produced every day, managing documents effectively has become...

Read article

Data Service - Function, Advantages and Areas of Application

Data has always been the fuel for innovation and progress. From the early days of computers, when hard drives and...

Read article