Logs stacked

Data extraction from documents - How To II

Christopher Helm

How to optimize your project by using templates

After you have read about the basics in the first tutorial , we can now go one step further. 

In this tutorial, we will also use our data set of receipts. This time, however, we will deal with the listed individual services. To label them sensibly, we will use sections for the first time. Here, it is important that we teach the AI not only which entities belong to which label but also how the annotations relate to each other. 

When we read the price of a product, this information is only useful if we also know the corresponding product. The same applies to the quantity and all other information we want to read out. Accordingly, all annotations that belong to a product and thus to each other are grouped in a section. 

In this example, the content of the sections correspond to the products and to the rows at the formatting level. The labels correspond to the properties of the products and to the columns at the formatting level. 

This way, all relevant entities are assigned two pieces of information, the section and the label. This is visualized in the image by the colored tags.

sections-and-labels

This principle is needed for lists and tables, among other things. We will show you how to teach it to the AI with this example.


Step-by-step tutorial

  1. Create project

    We use the same project as in the first Instruction. If you want to create a new one, you can check the first tutorial again on how to create a project. 

  2. Creating new labels

    Click HOME > Labels > +Add and add your labels there.
    In our example these are: "Quantity", "Description", "Unit price", "Subtotal" and "VAT code".

  3. Creating a template

    A template is a group of labels that are logically related to each other. They are therefore the abstract template for the sections. Click HOME > Templates > +Add to create a new template. Name your template (here: "Individual services"). Select the associated project (here: "Receipts"). Check the box "Has multiple Sections". Then click "Save and continue editing" to get to the next step. Here you can add the labels you just created to the template using the arrow keys. Click on "Save" to save the template.

  4. Create training data

    Sections are groups of related information in a document. They are the concrete manifestations of the templates. In our example, the first section contains all information of the first product, i.e. the top line or the first individual service of the receipt.
    To label the first section, we create an annotation that belongs to the first section. After clicking on the right entity, we can define the properties of the annotation in the annotation bar on the right side using two tabs. In the upper tab, we select the template that corresponds to the section and in the lower tab, we select the label that should be assigned to the entity.
    We select "Single Service (New)" at the top and "Number" at the bottom. We then label the rest of the section, with the first section now being displayed as "Single Service". We repeat this for the next sections. They will then be listed in the tab numbered from top to bottom. To create an additional section, select "Single Service (New)".

    We repeat this process for all training documents. Create your training data according to our example. Due to the diversity of the application area, differences may occur. For example, sections do not always have to correspond to rows.

  5. Reviewing the training data

    You can verify correct labels as they are displayed above the annotations. However, it is equally important for the learning success of the AI to verify that the labels are assigned to the correct sections. To do this, you can do the following: 
    In the upper right corner of the annotation bar, select the first section in the "Sections" tab under Filter (here: "Individual Performance"). Now only the labels of the first section should be visible. Most of the time you can see at a glance if they are correct (Here: If all labels are in one row). If you see an error, you can use "Edit" in the annotation bar to fix it. (Tip: You should also use this method when checking the results of the AI).

  6. Evaluate results and give feedback

    You can see how to split your documents into a training and test data set and train the AI in the first tutorial. There you will also see how to give feedback to the AI.

  7. Export results

    How to export your data and download them, you will also see in the first tutorial


Any questions? We are constantly working to improve our instructions so that you can use Konfuzio as quickly and easily as possible. Please let us know if you have any unanswered questions so we can provide you with the best possible solution. Thank you!

More Articles

Capture delivery bill OCR

Delivery docket OCR automates data extraction via scanner

With a document AI and OCR, extract all the information in a delivery bill automatically. User interface that allows information...

Read article
knothole

Document AI Extraction - How To III

Unified Training Data: Theory and Practice After learning the basics in Guide 1 and sections in Guide 2, you should...

Read article
Tutorial 1

Document AI Extraction - How To I

How to start your project with Konfuzio To learn the basics of the Konfuzio platform, we recommend this tutorial where you...

Read article

    Get in touch

    Do you have questions about our product, pricing, security, implementation or other topics? Let our experts advise you.

    Arrow-up