Document AI Extraction - How To III

Consistent training data: Theory and practice

After you have learned the basics in Tutorial 1 and sections in Tutorial 2 , you should be ready to work on your own project. 

Typical questions from our customers

We have already accompanied many projects of our customers and the following questions we get asked again and again: 

  • How should the training data set look like? 
  • What exactly should I label? 
  • Should I label the euro symbol as well with a monetary amount? 
  • Should I label a date several times that is mentioned more than once?
  • Should I include the commas in enumerations? 

Due to the great versatility of Konfuzio, many answers depend on the individual case. Your questions will certainly differ from them to some extent. However, most questions will be clarified just by understanding how our AI thinks and works. Based on this principle, we will answer the questions here.

We also show you practical tips for a successful training process.

Questions & Answers

How does the AI think?

The Konfuzio AI does not work rule-based, but result-oriented. It considers the training data as the desired result and will set up rules for itself in order to apply them to new documents and to try to achieve a corresponding result. In order for it to be able to recognize clear structures in this process, a clearly structured approach should also be taken during the manual labeling. Irregularities, which make no difference to our human brain, will cause the AI to search for rules and structures that do not exist, making it more difficult for it to make the right decisions.

How should the training data set look like?

The more uniform or homogeneous the documents are among each other, the more accurate the results are. Standardized or normed documents are optimal. However, this is usually not the case and is out of one's control. In principle, this is not a problem for Konfuzio, but it means that the importance of the quality and quantity of the training data increases with the heterogeneity of the documents.

What exactly should I label?

The short answer: Label what you want to read later, but do so consistently.

Should I label currencies as well with monetary amounts?

For example, for monetary amounts, you should either always label the currency (e.g. the euro symbol) or always omit it. It does not matter which way you choose. It is important to do this consistently in all documents and also within a document. Of course, this also applies to other units such as kg, m2 etc. and other composite information.

Should I label a date several times that is mentioned several times in the document more than once?

Let's take the following example. All pages of a document type contain the date in the upper right corner. Does the date need to be marked on all pages? In a document with many pages, this can become quite time-consuming. Typically, this is still done in the first document, then in the second document the date is only marked on the first 3-4 pages and in the third document only on the first page. 

This is where the following problem occurs. The AI will look for a reason why the date on the 5th page of the first document was relevant, but the one on the second page of the third document was not. But since there is no meaningful reason here, the AI will be "confused", in human terms, which has a negative effect on the results. 

To prevent this, the keyword consistency applies again! Either always label the repeating information on all pages or always only on the first page.

Should I include punctuation?

For consistency, it is important that when reading individual words from texts, commas, periods, brackets and other punctuation marks are not included. You should always mark only the actual content that you want to read. Punctuation marks usually come from the context of the sentence structure, but are rather arbitrary based on the training data and thus not suitable to be analyzed for the purpose of predictions. Otherwise, the AI will look for a comma at the end of the word to be read in the future, even if it has nothing to do with the information sought.

Tips for a successful training process

Now that you understand the theory behind how to create high-quality training data through consistency, we'd like to share a few practical tips that you can use to implement this theory into your project.

Create a Labeling Guide

A labeling guide is a document that contains both basic and special rules for labeling a document type. It describes what has to be labeled in which way and is often supported by screenshots. In large projects, where several people have been involved in labeling, they have often proved necessary. The goal is the consistency described above, which is achieved by having everyone involved follow these guidelines. When several people work on the same documents, they often label the documents in different ways. For simple documents with only a few people, verbal agreements are often sufficient. For a complex project, however, we recommend our template. 

Feel free to contact us via the Contact formto receive a template.

Use the four-eyes method

Review your training data. Mistakes happen, even to experienced users. To minimize errors, you should ideally have at least one other person review your annotations for accuracy and consistency. This way, careless errors and deviations from the labeling guide can be detected and corrected. In particular, incorrect section assignments can significantly lower the quality of the AI model. You can see how to check this in Tutorial 2

For an efficient distribution of review tasks, you can also use the following method. When person 1 has labeled a document, they add it to the Preparation Data Set. This way, person 2 knows that it is ready for review. After person 2 adds the document to the Training Data Set after review, everyone involved knows that it has been reviewed.

Book a weekly check-in

A weekly meeting helps your team build a common understanding. We recommend a retraining before this meeting, see Step 6 in Tutorial 1. In this meeting, you can analyze the evaluation of the latest model and automatically identify and discuss possible errors in the test and training data. A Konfuzio expert can contribute valuable tips and tricks directly to the meeting.

Any questions? We are constantly working to improve our instructions so that you can use Konfuzio as quickly and easily as possible. Please let us know if you have any unanswered questions so we can provide you with the best possible solution. Thank you!

Maximilian Schneider Avatar

Latest articles