As a Data Scientist or Annotation Manager, you may be faced with the following problem: You want to extract specific data from a mass of PDFs or other documents, or generate granular data for training optical or semantic AI. A labeling tool can target and extract semantic entities like "price", "seller" or "tax" you want. With Konfuzio it is possible to combine such NLP (Natural Language Processing) applications with CV Labeling (Computer Vision) of images. Whether it is receipts, contracts, financial documents or invoices, etc.: Automated data extraction via AI will increase the efficiency and productivity of your business at a fraction of the cost.
However, this amazing feat is not possible without text annotation. The analysis of structured documents like invoices, receipts and contracts is a more complicated undertaking even for modern AI. For this, you need a labeling tool that allows the user to selectively label and extract individual areas of a document. Konfuzio offers an all-in-one labeling tool for extracting data from text and images.
This article was written in German, automatically translated into other languages and editorially reviewed. We welcome feedback at the end of the article.
Model-centric vs. data-centric
If you've worked on data science projects, you may be familiar with some of the steps in a typical ML model build. These used to look like this:
- Collect data
- Clean data
- Try out several models
- Tuning the model parameters
- Transfer to production
- Monitoring the model
The main focus was on the third and fourth steps. ML models were in the foreground. Data Science devoted little to no time to the data part. In the "Model-Centric Approach," advances in storage and computing power led to the development of the modern algorithms. The most fundamental part of the process was neglected - the data itself.
Data for ML algorithms is like food for us humans. Therefore, we need to provide our algorithms with the best possible data quality to achieve the best performance. The data-centric approach focuses primarily on providing quality data. This means that in addition to focusing on algorithm selection, we need to spend time capturing and annotating data, correcting mislabeled data, augmenting data, and scaling these types of processes. You can master these tasks with Konfuzio's annotation and labeling tools.
What is an annotation tool?
You may have used an online translator such as Google Translate or Deepl. Such applications use NLP (Natural Language Processing). This AI technology helps machines understand human language so that, for example, translations or automatic spell checking are possible. NLP is widely used for information retrieval in unstructured texts. Analyzing structured documents such as invoices, receipts and contracts, however, is a bit more complicated.
First, there is not much context surrounding the areas of a document we want to extract. Individual entities of a document, such as price, salesperson, or tax, usually stand alone with no other text in the immediate environment. However, this would be helpful for training an NLP model. Second, the layout of documents often changes from one invoice to another. This causes conventional NLP to work poorly with structured documents.
Since most receipts and invoices are scanned or in PDF format, we need a labeling tool that supports OCR parsing and annotations (Annotations) directly on native PDFs and images. An annotation refers to a character, word, or paragraph extracted from a document. By doing this, you train the AI to extract its documents correctly. OCR means "optical character recognition". This technology allows a computer to recognize and extract text. Unfortunately, most labeling tools that support OCR annotation are either exorbitantly expensive or incomplete, requiring you to perform the OCR step externally before annotation. With Konfuzio, however, you get an all-in-one solution.
Labeling Tool from Konfuzio - the end-to-end solution
Konfuzio provides an end-to-end solution that allows you to annotate native PDF files, scanned images or images from your smartphone directly without losing document layout information. After all, text order and spatial information are equally important in invoice extraction, for example. All you need to do is upload your PDF, JPG or PNG directly and start annotating. Using state-of-the-art OCR technology, Konfuzio analyzes the text or handwriting of your documents and extracts all tokens with their bounding box. Konfuzio is your all-in-one tool for automatic document processing. You don't need any additional applications.
How to annotate PDFs and other documents with the Konfuzio Annotation Tool
- Provide the right tools to the data labeling team
The Konfuzio Data Labeling Tool offers the right solution for both texts and images. When labeling data sets from different sources or in different formats, a data labeling solution that supports all different file formats can make the data labeler's job easier.
In addition to the functions of your data labeling tools, they should also have an optimized and intuitive user interface. This is the only way to maintain an overview in different data contexts.
- Create an annotation
You can create an annotation by clicking on a rectangular area and dragging the cursor over it. When you save the annotation, Konfuzio recognizes the text inside the selected field.
When you click Edit again, you will see the red box that you used to select the text, which you can move and resize. If you select an area that does not contain text, the red box represents the so-called bounding box used for AI training. If you prefer finer control over the selection, you can also create an annotation by individually clicking the words you want to select. When you click Edit again, you will see the red box used to select the text, which you can move and resize.
- Label the annotations.
After the annotation has been created, click on "Annotations". There you will see all annotations summarized. If you click on the annotation, you will be redirected to the document and the annotation you just created. You can also click on the link to the label. In the following example, each annotation of the label "Change Date" is labeled as a date value. After you save the label, you can preview the result on the annotations page.
- Automate the annotation process
After an extraction AI is trained and evaluated, it creates annotations in all documents associated with the test and training dataset. This is especially helpful if you failed to annotate information in one document but did so in others.
Once the annotation is created, it has the status "Feedback required". If you see a green box or a red cross, you can provide feedback, see 1. Within a document, you can use the filter to see all annotations that require human feedback, see 2.
Konfuzio - your all-in-one tool for data extraction
With the user-friendly Konfuzio API you can train NLP models without much effort. There is no need to process their images beforehand with external APIs or add rules for pre-annotation. The user simply uploads, labels and exports his documents.
- NER Labeling: Identify and label key information in texts.
- Image and Document Classification: assign categories to documents and images to facilitate their management, search, filtering or analysis
- Bounding Box Image Labeling: Identify and locate objects in images.
- User management: Important for highly sensitive data and large teams in regulated companies. Can be operated as SaaS and on-premises installation.
- Unique: Konfuzio combines the visual with the semantic in one UI (user interface). For CV (computer vision) and NLP (Natural Language Processing), you no longer need separate tools from now on.
Data Labeling identifies and labels raw data (images, text files, videos, etc.) with one or more meaningful and informative labels. This creates context so that an AI (artificial intelligence) can learn from it. For example, the labels can indicate whether an invoice contains information such as "date," "price," or "seller." Labeling data is also required for a variety of other use cases, including natural language processing and speech recognition.
Customization to your business needs gives you an edge over your competitors. A labeling tool makes this possible by improving automated decision making. By automating their data extraction, you need minimal human intervention to make important decisions.
Computer vision is a field within artificial intelligence (AI) that enables computers and systems to extract meaningful information from digital images, videos and other visual inputs - and take action or make recommendations based on that information. If AI enables computers to think, computer vision enables them to see, observe and understand.
Natural Language Processing (NLP) attempts to capture natural language and process it in a computer-based manner using rules and algorithms. NLP uses various methods and results from linguistics and combines them with modern computer science and artificial intelligence. The goal is to create the widest possible communication between humans and computers via language. This should enable both machines and applications to be controlled and operated by speech.