Tutorial 1

Document AI Extraction - How To I

Maximilian Schneider

This article was written in German, automatically translated into other languages and editorially reviewed. We welcome feedback at the end of the article.

How to start your project with Konfuzio

In order to understand the basics of the Konfuzio platform, we recommend this tutorial, which will teach you how to train your own AI in just a few minutes using only 5 documents. To do so, you can watch the video below or follow the step-by-step tutorial below. Video watch on YouTube.


 


Documents AI Step-by-Step Guide

 

  1. Create a new project

    Click HOME > Projects > Add Project + to create a new project. Name your project. In our example it is called "Receipts". Save the project via "Save". You can invite additional users to your project via HOME > Project Invitations > Add+ .

  2. Create a label

    Click HOME > Labels > Add Label + to create a label. Name your label. In our example, it is called "Bruttobetrag". Add it to your project via the tab (Here: "Quittungen") and click on "Save". 

    Click HOME > Templates to access the templates. Click on the template that has the name of your project (here: "Quittungen"). Add the created label to the template by using the arrow buttons to add it from "available Labels" to "chosen Labels". Save by clicking "Save". In the next tutorial, you will learn how to use templates to read complex documents.

  3. Upload documents

    Click on DOCUMENTS. You can upload your local files here via Drag&Drop or the browser window. Click on the Reload button to reload the page after the upload. Now the OCR process starts. Depending on the file size, this may take a moment. We are now uploading 9 receipts (5 training and 4 test documents).

  4. Labeling

    Once the OCR process is complete, you can access your document via "Smartview". The OCR will have divided the information in your document into entities. "Entities" are individual words or pieces of information that are outlined with dashed lines. When you click on them, their background turns green. "Annotations" are relevant information in a document that should be retrieved or used. They are entities that have been assigned a label, which is done either manually by a human or automatically by AI. Use our lasso if you want to assign multiple entities to a label. To do this, hold down the mouse pointer and drag the red lasso that appears over the entities you want to select.
    Click on an entity you want to mark (here e.g. "48,60"). On the right side in the annotation bar, you see that the content of the entity is read by OCR. Click on "Save" to assign the created label to the entity (here: "Bruttobetrag") and thus convert it into an annotation.

    In a more complicated project, you would now need to select what type of template it is and what section of the document it is in. This is what the top tab is for. In this tutorial, however, we will only cover the basics, which is why you only have one label to choose from.

    Repeat step 4 for all uploaded documents. Use the arrows to switch between the documents.

  5. Division into training and test data

    After all documents have been labeled, you can now split them into training and test data. 

    The training data set contains manually labeled documents, on the basis of which the AI learns how to label documents itself. The test data set also contains manually labeled documents. Here, the AI attempts to label them on the basis of the knowledge learned from the training data set. In retrospect, the documents created by the AI are then Annotations with those created by humans and statistically evaluated. 

    In the document view, you can now check the box to the left of each file name to select the documents. In our example, we select 5 documents and choose the action "Add to training data set" in the action tab at the bottom and click on "Go". Then we select the remaining 4 documents and repeat the step but with the action "Add to test data set". 

  6. Start retraining and evaluate results

    Click HOME > Projects. Find your project and mark it with a check mark. In the Action tab, select "Retrain AI Model" and click "Go". A banner that says "AI model re-training has been started. This may take up to 24 hours." appears. In a small project like this example project, it should be trained after just a few minutes. 

    To check if the newly trained AI model is ready, click HOME >. AI models. There, the AI model is listed including the quantitative evaluation based on the test data.

  7. Give feedback

    Upload a new document as described in step 3. Click on "Smartview" after it has gone through the OCR process. Here you can revise the annotations produced by the AI. Confirm correct suggestions by clicking on the green tick and reject the incorrect ones by deleting this with the red "X". Also add any missing annotations. 

    You can now add this document to the training dataset as in step 5 to increase it and thus improve the AI model or you can export the information. If you get no results or very bad results, check if you did everything right in step 4-6 or increase the number of your training documents.

  8. Export your results

    Select the documents whose data you want to download by ticking them. If you select multiple documents here, they will be combined into one CSV file. Select the action "Get human revised data as a CSV file" in the action tab and click on "Go". The download of the CSV file should start automatically. CSV files can be used with spreadsheet programs such as Microsoft Excel, Google Sheets etc.


Any questions? We are constantly working to improve our instructions so that you can use Konfuzio as quickly and easily as possible. Please let us know if you have any unanswered questions so we can provide you with the best possible solution. Thank you!

Photo from Brandon Montrone from Pexels

About me

More Articles

Logs stacked

Data extraction from documents - How To II

How to optimize your project by using templates After learning the basics in the first tutorial,...

Read article
Capture delivery bill OCR

Delivery docket OCR automates data extraction via scanner

With a document AI and OCR, extract all the information in a delivery bill automatically. User interface that allows information...

Read article
Konfuzio Logo

PDFTron vs Konfuzio - The ultimate tool showdown

PDFtron (Apryse) and Konfuzio both offer high-quality software solutions for digital document processing. Which offering is right for the needs of your...

Read article

    Arrow-up

    This article was written in German, automatically translated into other languages and editorially reviewed. We welcome feedback at the end of the article.

    Navigation