Classification of documents and document separation by AI

One of the often overlooked and really difficult problems in document automation, which is also really annoying in day-to-day operations, is how to automatically separate a batch of documents into individual meaningful documents and assign them to a document class. In traditional scanning processes, this is often accomplished by manually preparing the paper and placing a barcode as a document separator on each first page. But this is labor intensive and error prone. In addition, as we become more digital, even with paper-based processes, we typically no longer have access to the paper. So the goal would be to simply scan the entire stack and separate it using an intelligent algorithm.

Fortunately, this is already available today, for example from the Konfuzio technology stack as an integrated feature. However, this does not mean that it is easy. It requires quite a bit of experience and infrastructure to manage multiple interdependent steps of classification and separation in a stable and reliable way. However, this is exactly what Konfuzio provides out of the box.

How does document structuring work in principle? Well, exactly the same way (our credo!) as a human would. Go through the stack page by page, determine what type of page it is, if it is related to the previous page, or if a new topic/form is starting. Then check the page numbers for security, if they are there. If in doubt, go back one or more pages to check, then make your decision to separate.

Document separation

Split file into documents

In AI classification, this is integrated into a sequence of algorithms. The system is trained on a sample that is already correctly separated. Konfuzio learns for each page whether it is first, middle, last or a single page. The user does not need to explicitly specify this, as the Konfuzio AI automatically figures this out from the samples and hides this complexity for the users. The training interface only requires that the individual documents be dropped into the training set. It is not necessary to have an exact number of pages (range) for each document type. Konfuzio automatically takes into account that these may vary for each document type. However, if you know you can also limit allowed pages, for example for single page forms that always have one page.

Konfuzio will then learn the structure and apply it at runtime to the entire stack of unseparated individual pages on

Document splitting - A detailed guide to training an AI

Document automation can often be a tedious and error-prone process, especially when it comes to automatically separating a batch of documents into individual documents and assigning them to a document class. Traditionally, this is achieved in scanning processes by manually preparing the paper and placing a barcode as a separator on each first page. However, this is time-consuming and error-prone. In addition, processing equipment usually no longer has access to the paper, even in paper-based processes. The goal is therefore to simply scan the entire stack and have it split by an intelligent algorithm.

Fortunately, this is possible today with Konfuzio technology. The Konfuzio SDK provides a pre-built class called SplittingAI and an instance of a trained ContextAwareFileSplittingModel that uses context-sensitive logic. Context-aware in this case means a rule-based approach that looks for common strings between the first pages of all documents in a category. In predicting whether a page is a potential separator (i.e., whether it is a first page), the algorithm compares the content of the page with these common strings of first pages. If at least one of these strings occurs on the page, the page is marked as a first page, which means that it is a breakpoint.

document splitting

In this deep dive, we will explain how to use the Konfuzio SDK to train a model capable of automatically splitting documents into multiple documents. We will use the SplittingAI class and an instance of a trained ContextAwareFileSplittingModel to automatically split a file into multiple documents.

File splitting

Extract single documents from a file Let's see how we can use the Konfuzio SDK to automatically split a file into multiple documents. We use the prebuilt SplittingAI class and an instance of a trained ContextAwareFileSplittingModel. The latter uses a context-sensitive logic. Context-sensitive in this case means a rule-based approach that looks for common strings between the first pages of all documents in a category. In predicting whether a page is a potential separator (i.e., whether it is a first page), the algorithm compares the content of the page with these common strings of first pages. If at least one of these strings occurs on the page, the page is marked as a first page, which means that it is a breakpoint.

This tutorial can also be used with the MultimodalFileSplittingModel. The only difference in the initialization is that no tokenizers have to be specified explicitly.

How to train a document splitting AI with the Konfuzio SDK

Document automation is an important part of today's digital workplace. In this context, automatically separating batches of documents into individual meaningful documents and assigning them to a document class is an often overlooked and difficult task that can often be very troublesome in the day-to-day processing of documents. In the traditional scanning process, this is often accomplished by manually preparing the paper and placing a barcode as a document separator on each first page. However, this is time consuming and error prone. With increasing digitization, processing equipment often no longer has access to the paper. The goal, then, would be to simply scan the entire stack and automatically separate it using an intelligent algorithm.

Konfuzio provides a solution to this problem through the Python Konfuzio SDK. The Konfuzio SDK provides a pre-trained class called SplittingAI that uses an instance of a trained ContextAwareFileSplittingModel. This uses context-aware logic that looks for common strings between the first pages of all documents in a category to decide whether or not a page is a potential split point. If at least one such string is present, we mark the page as the first (indicating that it is a split point).

In this tutorial, you will learn how to train and use a ContextAwareFileSplittingModel with the Konfuzio SDK to automatically split documents.

Step 1: Set up the Konfuzio project and create the test document

First, you need to set up a Konfuzio project and select a test document. You can do this by importing the Konfuzio SDK library into Python and initializing the Konfuzio object:

from konfuzio_sdk.data import Page, Category, Project
from konfuzio_sdk.trainer.file_splitting import SplittingAI
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
# Initialize a project and retrieve a test document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

Step 2: Initialize and customize the ContextAwareFileSplittingModel

Next, you need to initialize and customize the ContextAwareFileSplittingModel. To do this, you must specify the categories and tokenizer for the model. The tokenizer is used to load the texts into the model and split them into sentences.

file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())

In this example, we use a ConnectedTextTokenizer that splits the text into sentences and then divides them into words. Other tokenizers, such as a standard tokenizer, can also be used. However, the ConnectedTextTokenizer is specifically designed to process connected text and can therefore be helpful in identifying separable parts of the document.

First we need to initialize a project and select a test document. We assume that the Konfuzio SDK is already installed and that we have access to a running Konfuzio system.

from konfuzio_sdk.data import Page, Category, Project
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

We then initialize an instance of the ContextAwareFileSplittingModel and match it to our project categories. We also need to specify the tokenizer we are using. In this case we use the ConnectedTextTokenizer.

# initialize a Context Aware File Splitting Model and fit it
file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
# fit the model
file_splitting_model.fit(allow_empty_categories=True)

Now we can use the trained model to split an input document into multiple documents. We assume that we have a document that consists of several documents and therefore needs to be separated.

# run the prediction with the Context Aware File Splitting Model
new_documents = []
current_document = None
for page in test_document.pages():
    pred = file_splitting_model.predict(page)
    if pred.is_first_page:
        # create a new Document when a first Page is found
        if current_document is not None:
            new_documents.append(current_document)
        current_document = project.create_document(category_id=YOUR_CATEGORY_ID)
        current_document.add_page(page)
    else:
        # add the Page to the current Document
        current_document.add_page(page)
# add the last Document to the list
new_documents.append(current_document)

The variable new_documents now contains a list of separate documents extracted from the original input document.

Validation through Human-in-the-Loop UI

Document processing is an important component in various industries, such as banking, insurance, and legal. The process of scanning and filing multiple documents can be time-consuming, and there is often a need for an automated solution that can divide and organize these documents. However, automated solutions are not always perfect and can miss important details or misidentify a document type. This is where a human interface, the Document Validation UI (DV UI), comes into play.

split document

DV UI is a tool that allows people to interact with an automated document processing system to validate and correct its output. The DV UI provides an interface for users to review and verify automated document splitting and file organization and make corrections as necessary. Users can also train the AI system to recognize new document types to ensure the accuracy of future splits and organization.

Using DV UI to fine-tune the AI that automatically splits scans and files can significantly improve the accuracy and efficiency of document processing. The AI system can learn from the user's corrections and update its algorithms to better recognize document types and split them accordingly. The more data the AI system receives from the user, the more accurate it becomes, so less manual intervention is required over time.

categorize and rename document

In addition, by using DV UI for document processing, companies can save time and money, as the system can process more documents in less time with fewer errors. This means that companies can focus their resources on other important tasks, such as analyzing the data extracted from the processed documents, instead of manually organizing and splitting them.

In summary, DV UI is an indispensable tool for fine-tuning AI that automatically splits scans and files containing multiple documents. Its use can significantly improve the accuracy and efficiency of document processing, reduce the need for manual intervention, and allow organizations to focus on other important tasks.

Conclusion Document separation and document classification

In this example, we have shown how to use the Konfuzio SDK to use a trained model to split an input document into multiple separate documents. Automatic document splitting is particularly useful for processing batches of paper documents or PDF files that contain multiple separate documents. With the Konfuzio SDK, developers can quickly and easily train their own models and apply them to new documents.

Register or visit our technical documentation to learn more

"
"
Elizaveta Ezhergina Avatar

Latest articles