Today we take on the view of Dan Lucarini, who, in his role as a leading analyst for IDP (Intelligent Document Processing), argues that terminological diversity in our field causes more confusion than it helps. However, as is so often the case, there are two sides to a coin.
I suspect this marketing word salad is a consequence of the massive FOMO (fear of missing out) infecting the C-suite.Dan Lucarini, Documents, content, files, records, semi-structured or unstructured data: do labels really matter anymore?
First, it is important to emphasize that we fully understand Dan's concerns. He argues that the terms we use to describe the types of data we process - be it "documents", "content", "files", "records", "semi-structured data" or "unstructured data". - cause confusion and are ultimately of little significance. This point of view is understandable.
The problem, however, Dan says, arises when these terms are used in an uneducated and inflationary manner. Industry jargon, when misused or overused, can become buzzwords that create confusion and dilute the original intent of the terms.
We agree with this in parts. However, it is important to remember that technical terms in science and technology often exist for a good reason. They enable precise and clear communication between experts. However, when they are taken out of their original context and used in an inflationary manner, they can actually become a kind of "buzzword bingo" where the true meaning of the terms is lost.
This article was written in German, automatically translated into other languages and editorially reviewed. We welcome feedback at the end of the article.
Profane explanation: OCR and its representation by some companies
Optical Character Recognition (OCR) is basically a technology that allows computers to "read" printed or handwritten text from images or printed documents.
Imagine you have a photo of a sign that says "Open from 9am to 6pm." You could Tesseract OCR use to digitize this text.
Here is the command you could type in your command line to start Tesseract, see Installation Guide(assume the image is called "shield.jpg"):
tesseract shield.jpg output
This command tells Tesseract to take the image "shield.jpg" and write the recognized text to a file named "output.txt".
If you then open the resulting "output.txt" file, you might see the following text:
Open from 9 to 18
This is now a 'raw' text which you can process further, partly the optical position of the letters is returned besides the raw text, see BoundingBox. But remember that Tesseract (or any other OCR software) does not automatically recognize that these are opening hours or that "9 am to 6 pm" represents specific times of the day. Such interpretations and analyses go beyond the basic functions of pure OCR.
Soon, however, OCR became the miracle cure
This is the basic function of OCR. In the early days of the technology, this was quite a feat, as it saved a lot of manual work and made it possible to edit and search for text in digital form.
Over time, however, some companies have significantly expanded the presentation of OCR and marketed it as a sort of miracle cure for a variety of data and document management challenges. They have "OCR" as a solution for tasks like data extraction, text analysis, automatic categorization of documents and much more.
In reality, however, many of these advanced features are not really part of OCR technology itself, but the result of integrating OCR with other technologies such as artificial intelligence, machine learning, or natural language processing. Thus, even newer models such as those from LayoutLM, R-CNN or Pegasus always OCR as a basis.
Recent research holds out the prospect of OCR as a technology that can completely eliminate the need for a link between image and word processing, cf. DONUT Paper.
While these advanced solutions are undoubtedly valuable and can provide significant benefits, it's important to remember that "OCR" in and of itself is only one piece of the puzzle. It enables machines to "see" and recognize text, but the additional features often marketed under the term "OCR" require additional technologies and capabilities.
Do we still need technical terms at all?
Whatever you send it, AI breaks it all down into machine-digestible components of text, layout, image, page count, etc.Dan Lucarini, Documents, content, files, records, semi-structured or unstructured data: do labels really matter anymore?
I very much appreciate Dan's somewhat exaggerated execution. However, we have to disagree on one particular point. In particular, his statement: First, GPT and other basic LLMs don't care what generic label we use for the 'stuff' we've given it to understand and analyze. An AI model doesn't distinguish between a 'structured', 'semi-structured' or 'unstructured' document/content/data/file; that's a human way of categorizing our stuff. Whatever you send it, AI breaks it all down into machine-digestible components of text, layout, image, page number, etc.
It is true that Large Language Models (LLMs) like GPT-3 can process content at a very basic level, but they are not alone capable of performing complex tasks such as page segmentation or deep, context-based processing of text. LLMs are a powerful tool, but they are not the only solution for all types of document processing.
Clear choice of words and yet do not simplify!
Various researches, especially in page segmentation, have shown that the best processing quality is currently achieved by splitting documents contextually. This means that the model takes into account not only the text itself, but also the structure and layout of the document. The use of visual context helps to better understand and process the document. For example, a table in a document is not just a collection of continuous text, but a clearly structured block of information that should be interpreted in a certain way.
Newer LLMs can also benefit from contextual processing. The plain text information that an LLM processes can be greatly enhanced by contextual information such as "This text is in a table." Understanding the context can guide the model to interpret the text in a way that is closer to human interpretation.
Overall, then, we should not underestimate the importance of domain-specific concepts. They are not only a human idiosyncrasy, but can also help make AI models more effective and accurate. Processing "structured", "semi-structured" or "unstructured" documents may well be different and produce different results. Depending on the exact method used for processing. Different approaches are suitable depending on the application scenario and specific requirements.
Perhaps the solution is to do away with technical terms altogether, but to use them more consciously and carefully. Education and understanding are key words here. It is our responsibility as experts to ensure that we not only use the right terms, but also convey the meaning behind them.
As much as we appreciate Dan's critique of the overuse of technical terms, we believe that the suggestion to leave the division of knowledge and context to AI altogether is problematic. After all, it is our job as experts to make complex concepts understandable while still remaining precise and scientifically correct.
Let's avoid buzzword bingo. Only knowingly used technical terms create knowledge and remain meaningful. In this way, we can ensure that our communication in the industry is not only precise, but also understandable.
But even our editorial team has certainly used one or the other word too often and not defined it precisely. Provided you notice something, contact us and we fix a possible buzzword.