PDF to JSON conversion for intelligent text processing

Daniel Weissmann

Many technologies are nowadays a natural part of everyday life. Videos are automatically embellished. Online stores know what we want to buy before we do. A picture ,like JPG or PNG, or a PDF file of the Greek menu is enough for an automatic translation. We assume that it has to do with artificial intelligence, with a lot of computing power. They are only questioned when something doesn't work the way we think it should. But computers are only as intelligent as the algorithms and data we give them.

And this is where JSON comes into play, a structure that makes processing possible in the first place. Should it only be a matter of transferring the PDFs in a practical format, then the use of JSON for the serialization of the binary PDF file is sufficient. Here is an illustrative overview here. But what about when data is to be read out of the PDFs in a structured way so that it can then be processed further? We will explain this in more detail in this article.

What is JSON exactly

JSON stands for JavaScript Object Notation. It provides a clear format to store structured data. Structured data is for example the personal data on an application form. Or it is the list of food and drinks on the menu. Originally, developers used JSON for data transfer in IT systems. An example of this is data that users enter into a mobile application. To transfer it to the server, the program translates the data into JSON format. The advantage here is that JSON files are also human readable. Before the use of JSON, XML files were used for this purpose. However, these are more difficult to read and require significantly more storage space and thus transmission time.

Let's look at an example. Let's assume a PDF file that contains the data in this image.

Example of a form for PDF to JSON conversion.
PDF sample form for further processing into JSON format

The PDF contains personal data in different categories. The data and the categories of the file can be put into the following structure when converting to JSON:

  }, "attorney":
    }, "name": { "john doe & jane doe",
    }, { "first name": null,
    }, "address":
      }, "street number": "1234",
      "Street Name": "ABC Street",
      }, "city": "San Francisco",
      }, "state": "CA"
      { "ZIP Code": "94102"
    "Telephone No": "415-123-4567",
    }, "e-mail": "[email protected]",
    "Fax No": null

The structure of the JSON format can be any according to the requirements. All fields can be mapped. Even complex tables can be represented well by nesting. Empty values are possible and texts or numbers can be defined. The complexity in the conversion arises when the data no longer comes from an online form, but from a PDF.

Why not XML or HTML for the conversion of the PDFs

XML also offers the possibility of mapping the complex structures. Great efforts were made to formalize these structures through XML schemas and definitions. However, this did not change the fact that the transmitted files contain a great deal of redundant information. A significant part of the transmission consists of the XML structure, not the read data.

HTML is, more precisely, also an XML format. In addition, however, HTML was not designed for the exchange of data. Instead, it is used exclusively to define the layout of web pages. HTML structures the pages, allows the inclusion of graphical information or interactive functions via further scripting languages or libraries such as JavaScript or Vue.JS. Using HTML to exchange data is like eating soup with chopsticks: it's possible but tedious.

How PDF to JSON conversion works

PDF files can contain a clearly defined form. We see this for example with official forms, e.g. from the registry office or tax office. But beyond that, there are also many free-form formats. Letters typically contain an address, a date, or possibly banking information. However, the format, font, position or completeness can vary greatly.

For both PDF formats, text recognition (OCR) is used. It recognizes characters and converts them into texts that the computer can read. Already here the choice of the right software is important, so that there are no errors already in the first step. Is the telephone number recognized correctly? Can handwritten entries be recognized? Are there smudges or hard-to-read printouts? Sophisticated algorithms, supported by Artificial Intelligence, allow these hurdles to be circumnavigated.

Convert known PDF structures

For PDFs in known formats, the algorithm can then already recognize the context by the position of the recognized text. The algorithm thus identifies names, address, etc. and can convert this information directly into a JSON structure. For this purpose there is a template, a predefined format, in which the found data is entered. The program can also quickly identify missing data.

Convert complex PDFs to JSON

PDF files with complex content or unknown structures require more intelligence. Categorization can no longer be done by manual input. Instead, an artificial intelligence approach is necessary. The algorithm is trained for a class of documents, e.g. invoices, and thus learns to recognize the relevant information. He learns from many PDF examples how an address or bank information can look like. It learns that a date can have different formats (January 18, 2023, 18-01-2023, or 2023/01/18). This creates categories in the AI network, which can then be applied to the JSON format. One can additionally implement fallback logic in case the AI is not yet able to identify certain categories with certainty.

Use Python to convert PDF to JSON

For processing, there are several libraries, products and vendors that offer very good text recognition and AI support. A very popular programming language for handling the capabilities, training the AI algorithms and converting the input files to JSON is Python. Python is a simple but powerful scripting language. For a long time, Python has been widely used, especially in artificial intelligence applications. Due to the high availability of programming libraries specifically for Python, the integration of the conversion algorithms is very easy.

At Konfuzio a simple example of training AI in Python (in the excerpt) looks like this:

project = Project(id_=None, project_folder=OFFLINE_PROJECT)
category = project.get_category_by_id(63)

pipeline = RFExtractionAI(use_separate_labels=True)
pipeline.category = category
pipeline.test_documents = category.test_documents()
evaluation = pipeline.evaluate_full()
pipeline_path = pipeline.save(output_dir=project.model_folder)

Step-by-step from PDF to JSON format

The process of PDF to JSON conversion can be summarized like this:

  1. The training of the AI

    First, the application must be configured and, most importantly, the system's artificial intelligence be trained on the relevant document formatsn. This means that sample documents are loaded into the application and the system thus learns to recognize which information is relevant and how to find it.

  2. Uploading the PDF file

    After successful training, the interface that allows uploading is built. This can be a mobile application or a website, for example. Automatic processes can also be implemented, for example to check incoming e-mails for PDFs. The system then automatically uploads these to the processing server.

  3. Data extraction

    The application then automatically begins, with the help of AI, to recognize the learned data fields and convert them into text form in the previously defined JSON format. Documents that were not readable or there is a high level of uncertainty in the results are flagged and can be decoded by human analysis. With each document, the AI continues to learn.

  4. Process the JSON file

    The complete JSON file is usually not processed further manually. Instead, it is used by other systems after processing to automatically use the read data for analyses, business processes, or database updates.

More details and a step-by-step Instruction with code examples can be found here.

The guide also shows that the choice of provider depends on the quality of the functions but also on the usability. The functions must be well documented, even for beginners, so that the performance can actually be exploited.

The overall picture counts

Another advantage of converting PDF to JSON is high compatibility with other applications. Almost all data application providers allow JSON processing. Thus, one is not dependent on a single provider in the processing chain. The further processing of the read data can be taken over by cloud solutions or local applications, e.g. to write the information into the right databases, convert tables into Excel files, automatically generate reply letters or perform bank transfers. In this way, the company can always choose the best products and solutions for each work step (best-of-need) and can also replace individual components in the future without having to invest in a completely new infrastructure.

How to convert PDF to JSON?

There is a wide range of tools that perform this task. Described in a nutshell, the point is to use the most intelligent tools possible (artificial intelligence) to increase processing through low error counts. This also requires good training of the AI. More about this can be found in this article.

How can I convert JSON to PDF?

For the conversion (back) to PDF there are also good solutions. Layout templates are used here. These define the appearance of the result file. A conversion program then inserts the data, which is available in JSON format, into this template and creates a new PDF file.
This can also be achieved with other file formats (e.g. Word or Excel).

What other formats are suitable for conversion to JSON?

In general, modern OCR programs read all formats that contain pictorial text information and convert them to JSON. These are besides PDF also image formats like TIFF, PNG or JPEG.
It is important that the compression of the image file is not too strong. This avoids artificial artifacts and misinterpreted characters. Files generated by document scanners usually have sufficient resolution and quality. With today's OCR solutions, even photos from mobile devices are sufficient for correct analysis. With the quality of the text recognition, the success of the conversion to JSON then also increases.

  1. Information about serializing PDFs using JSON: https://wikis.ec.europa.eu/download/attachments/36701338/Mooney-Binary-Encodings.pdf?version=1&modificationDate=1633696451409&api=v2
  2. Detailed definition of JSON: https://www.w3schools.com/js/js_json_intro.asp
  3. Overview of XML from w3schools: https://www.w3schools.com/xml/xml_whatis.asp


Write a comment

More Articles

Automatic text summarization Faster R-CNN for page segmentation

Automatic text summarization in documents with faster R-CNN and PEGASUS

Increasing volumes of documents and the information they contain need to be processed by businesses today in order to harness the hidden content...

Read article

Extract data

Do you want to extract data from PDF files? PDFs are widely used for sending and presenting information. Not only suppliers...

Read article

What distinguishes Data Science vs Machine Learning?

By means of Data Science and Machine Learning (ML) it is possible to obtain meaningful information from a mass of data. The terms...

Read article

    Contact us!