Create data parsing tool with Python, SROIE dataset and machine learning.

If you are a Python developer and want to create a data parsing tool, this tutorial is for you. We'll show you how to create an efficient document parsing tool using Python and the SROIE dataset, and introduce you to the capabilities of the Konfuzio SDK. Let's get started!

A short introduction to string parsing in Python can be found here.

Step 1: Setting up your environment

Before we begin, make sure that Python and all the required libraries are installed. To do this, run the following commands in your command line or terminal:

# Install Python packages
pip install opencv-python tensorflow numpy scikit-learn

Step 2: Load SROIE data set

The SROIE data set consists of images from Receipts and the associated annotation files. This is how you can load them:

import os
import cv2
# Directories for SROIE images and annotations
image_folder = "path/to/SROIE/image/folder"
annotation_folder = "path/to/SROIE/annotation/folder"
def load_sroie_dataset(image_folder, annotation_folder):
    """Load images and corresponding text from the SROIE dataset."""
    images, extracted_texts = [], []
    for filename in os.listdir(annotation_folder):
        with open(os.path.join(annotation_folder, filename), "r") as file:
            extracted_text = file.readlines()[1].strip()
            image_path = os.path.join(image_folder, filename.replace(".txt", ".jpg"))
            image = cv2.imread(image_path)
            images.append(image)
            extracted_texts.append(extracted_text)
    return images, extracted_texts
images, extracted_texts = load_sroie_dataset(image_folder, annotation_folder)

Step 3: Prepare data set

Now we want to preprocess the images from the SROIE dataset for further training:

output_folder = "path/to/output/folder"
def preprocess_images(image_folder, annotation_folder, output_folder):
    """Preprocess images and save them to the output directory."""
    for filename in os.listdir(annotation_folder):
        with open(os.path.join(annotation_folder, filename), "r") as file:
            extracted_text = file.readlines()[1].strip()
            image_path = os.path.join(image_folder, filename.replace(".txt", ".jpg"))
            image = cv2.imread(image_path)
            # Here you can add image preprocessing steps
            output_path = os.path.join(output_folder, filename.replace(".txt", ".jpg"))
            cv2.imwrite(output_path, image)
preprocess_images(image_folder, annotation_folder, output_folder)

Another exciting data set offers FUNSD+. Read more now.

Step 4: Fine-tuning and evaluation with donut model

For accurate Document parsing we will have a pre-trained donut model fine-tune on our pre-processed data:

import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(images, extracted_texts, test_size=0.2, random_state=42)
def create_donut_model(input_shape, num_classes):
    """Create the Donut CNN model for text recognition."""
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model
model = create_donut_model((height, width, channels), num_classes)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)
test_loss, test_accuracy = model.evaluate(X_test, y_test)

Summary

Creating a data parsing tool with Python and the SROIE dataset can be seamless with the right approach. For advanced features and more advanced functionality, consider integrating the Konfuzio SDK into your projects. Take a look at the Konfuzio documentation for comprehensive insights.

Have fun programming!

More links

"
"
Florian Zyprian Avatar

Latest articles