From PDF to text - step by step guide

For many years, the PDF file format has established itself as the standard for the digital distribution of documents. Every device, whether PC, smartphone or tablet, is capable of reading PDFs. The viewers have been free of charge from the very beginning. This has contributed massively to their widespread use.

The format is also continuously being extended, e.g. in order to be able to fill out forms or set signatures. However, the problems start when you want to further process the information within the PDF document. Manual copy & paste rarely works, since text is often also stored as image information, not to mention the time aspect. Vendors have approached this problem in various ways so that PDF to text conversion can be done in a structured, automated and intelligent way. We would like to take a closer look at these processes here.

Text to PDF Conversion Background

When we talk about PDF-to-text conversion, it generally refers to the process by which converters automatically extract the text contents of a PDF document and convert it into an editable text format. This can be useful when you want to further process the text from a PDF document or use it in another application. There are many tools and services that automate this process and facilitate PDF-to-text conversion.

Some of these tools also offer the option to preserve special layout or formatting elements such as tables or paragraphs to ensure that the text is rendered as well as possible in the new application. PDF-to-text conversion can also be helpful when you want to extract the text from a scanned PDF document where the font is only present as image information. For this purpose OCR (Optical Character Recognition) technologies are used, which interpret the image information and generate the text.

Depending on the capabilities of the user and the goals of converting a PDF to text, Konfuzio offers different ways to extract the text (e.g. as TXT or Word file). These are at a glance:

  1. The manual process: For occasional conversions, PDF files can be uploaded manually to conversion platforms, and then the extracted text can be returned as a download for further individual processing.
  2. Via application programming interface (API): For more document volumes, flexible APIs can be used to create efficient processes for automating PDF-to-text conversions with a little programming effort.
  3. Via program librariesThe conversion functionality is available via the program libraries directly in the source code of the own application. Very popular here are offers for the programming language Python.

Variant 1 - Instructions for the manual process

  1. Open a public PDF to text converter. These often allow free conversions.
  2. Follow the instructions of the platform to upload your PDF file from your local computer to the platform.
  3. After uploading the PDF file, the platform generates the pure text, mostly still unstructured. The text is either available for copying in a text field on the web page. Or the application has generated a text file for download.
  4. Copy the text from the generated file or web page directly into the TextView of the Konfuzio platform.
  5. Here, simply apply labels (annotations) directly to the still unstructured data to train the artificial intelligence on the specific form of the data.

Manual annotation via TextView

Variant 2 - Instructions for PDF to text via API

This variant requires programming knowledge. In addition, one needs clear ideas about which categories of documents can be expected in order to make the further processing of the texts after conversion as efficient as possible.

  1. The document is loaded onto the server via API command. For this purpose, the category and the associated project are specified (Try it here)
  2. Once the upload was successful, the document will appear in the Konfuzio Administrator interface, already assigned to the correct category.
  3. The text information has then already been read from the PDF by Konfuzio and is available to the program (Try it here)
  4. The file can now be opened via the administrator interface in order to in turn assign annotations to the text information for categorizing the information. With trained AI, this step is then also automated.

PDF to text using Konfuzio API

A comprehensive introduction to working with the API is shown in a clear video here.

Variant 3 - Instructions for using the Python SDK

For developers who are already familiar with Python, using the Konfuzio Python SDK as a code library for quick integration of the conversion and processing functionalities is particularly easy. The integration of the API calls is already implemented here in a user-friendly way. It is very powerful, but we want to focus on PDF to text conversion here.

  1. If this is the first use of the SDK functionalities, the developer must first download and install it on his own development environment. (More here)
  2. Import the SDK program library into your own source code:

    from konfuzio_sdk.data import Project

  3. The PDF document to be loaded should be found by the program. It can be online or on the local hard disk. The file is opened and the binary data is cached for processing on the target environment.
  4. In the next step the upload command upload_file_konfuzio_api of Konfuzio is configured with the necessary parameters(filename, ID of the project, status) and executed.
  5. Once the upload is complete, the platform has already converted the PDF to text via OCR. This can be easily read out via the project object.

The code looks like this in the overview:

from konfuzio_sdk.data import Project
project = Project(id_=11957)
from pathlib import Path
import requests
filename = Path('energiezertifikat.pdf')
url = 'https://www.energieausweis-online-erstellen.de/app/uploads/2016/09/muster-bedarfsausweis.pdf'
response = requests.get(url)
print(response.status_code)
response = requests.get(url, stream=True)
if response.status_code == 200:
  with open(filename, 'wb') as pdf_object:
    pdf_object.write(response.content)
    print(f'{filename} was successfully saved!')
from konfuzio_sdk.api import upload_file_konfuzio_api
request = upload_file_konfuzio_api(filename, project_id=project.id_, dataset_status=2)
project.get(update=True)
[document.status for document in project.documents]
project.documents[-1].text

The processing of images is very similar. Many more examples and instructions on how to use the Konfuzio Python SDK can be found here.

Text extracted and now what?

As the examples show, converting the documents is not enough. Only if the text can also be used further, there is an added value for the effort of conversion. With Konfuzio types of data can be labeled manually (The date of the invoice or the account number). But this is only the very first step. Because in the background is the artificial intelligence, which analyzes all new documents. The manual labels serve as training material for the AI. It quickly takes over and becomes increasingly capable of identifying and classifying data within the texts itself. It can learn different document types. This means that even large volumes of documents are quickly analyzed, the data structured and the information prepared for further use. In this way, data from PDF files can be integrated into the following business processes with significantly reduced manual effort and processed automatically be

"
"
Daniel Weissmann Avatar

Latest articles