FUNSD+ | A larger and revised FUNSD dataset

in

on

When creating the FUNSD+ dataset we aimed to enlarge the FUNSD. In addition, we wanted to set-up the labeling tool, so that the FUNSD+ dataset can be copied, i.e. "forked" in a way, so that other researchers can inspect, edit or expand the FUNSD+. Annotations visually or via code, see Live Document Example.

How to get access to the FUNSD+ dataset?

  1. Register at app.konfuzio.com

  2. Create a Support Ticket

    Request access to FUNSD+: Provide the correct e-mail that is linked to your app.konfuzio.com account.

  3. You will receive an invite by e-mail.

    We will send you an invitation email to access the dataset via the Konfuzio platform. You need to register an account.

  4. Use the Konfuzio Python SDK to download the data.

    You can explore the dataset from the platform in read-only mode and then download it by using the Konfuzio SDK. Just install it via pip install confuzio_sdk and initialize it in the folder where you want to download the data with konfuzio_sdk init. Then download the dataset with konfuzio_sdk export_project 11984.

  5. Errors

    If you can't use the SDK, we will prepare another download for you, but we don't have it ready yet as we provide the download with the SDK by default.


Background FUNSD dataset

We highly value the FUNSD dataset by Jaume et al. (2019) for form understanding in noisy scanned documents. Guillaume Jaume pubslished the dataset on his homepage. It is licensed to be used for non-commercial, research and educational purposes, see license. The FUNSD dataset is a subset of documents published as RVL-CDIP. RVL-CPID was introduced by Harley et al. (2015).

Approximate number of open-access papers mentioning the dataset in the last five years.

Numbers are base on Papers with Code

To build the FUNSD dataset, we manually checked the 25,000 images from the form category. We discarded unreadable and similar forms, resulting in 3,200 eligible documents, out of which we randomly sampled 199 to annotate.

Jaume et al. (2019)

Even the FUNSD dataset relates to a niche of AI, i.e. Document AI, about 200 people search for "FUNSD" every month.

Search volume of FUNSD. How many times per month people search "FUNSD" on Google.

How many times per month people search "FUNSD" on Google.

FUNSD vs. FUNSD+

While annotating the single page documents we incorporated the latest research. Vu et al. (2020) reports to have found several inconsistency in labeling, which might impede the FUNSD applicability to the key-value extraction problem.

FUNSD+ provides access to more documents

Besides the increase from 199 documents to 1113 documents we summarize the characteristics of both datasets below. Statistics of the FUNSD dataset are retrieved from the Paper by Jaume et al. (2019).

FUNSDFUNSD+
Documents1991113
headers5631604
questions434314695
answers362312154
questions with no answers720 (16.6%)2691 (18.3%)
answers without questions*0114 (0.9%)
Table 1: FUNSD vs. FUNSD+ statistics

* (basically Independent Checkboxes in the table above)

FUNSD+ provides access to more documents

As described in Table 1, the average number of headers, questions and answers per document differs. In Table 2 we summarize the main differences when annotating the documents. Afterwards, we will demonstrate a selected number of documents using screenshots of the Annotation UI.

FUNSDFUNSD+
Handwritten answersYes, usually good qualityYes when good OCR, otherwise document excluded
SignaturesIncluded even when unreadableYes when good OCR, otherwise left blank (we declare it as unreadable by omission)
CheckboxesAll answers included, plus the checkmark signOnly correct answer linked to the question. This provides a clean Question-Answer pair without further postprocessing needed.
Independent CheckboxesMarks the checkmark as the answer and the textual response as a question. The uncheckmarked answers are questions without answers.Only the checkmarked answer is annotated as an answer, the rest is given label "Other" as it doesn't answer any question
TablesLinks all rows of a table to the same column, so it's impossible to differentiate between multiple rowsLeft unannotated and labeled as "Other". In a next version, the proper AnnotationSet structure would have "Table column/row header" labels associated to a single cell with label "Table Cell Answer".
HeadersFullNo brackets, considered as comments to the headers
Trailing colonsYesNo
Irrelevant text/comments included in answers/questionsYes, fully annotatedNo, only clean information from Question-Answers pairs
Edge cases / ambiguous casesSometimes many items interconnected, with a structure which is not able to be understoodDocument excluded from the dataset

Live Document Example

JSON

JSON formatting example: Visit https://git.konfuzio.com/-/snippets/33

Document UI

Visit https://app.konfuzio.com/d/303962/

FUNSD vs. FUNSD+ visual Examples

Multiple rows

FUNSD links all rows of a table to the same column, so it's impossible to differentiate between multiple rows. We did not annotate tables for now. However, we could expand the dataset and annotate tables using the concept of Label sets.

FUNSD to FUNSD+ side by side comparison

Use of headers

FUNSD links headers to questions inconsistently. FUNSD+ tries to reduce the number of headers and only annotated headers that clearly relate the content next to it.

FUNSD to FUNSD+ side by side comparison

Annotating the answer

FUNSD links all multiple answers to a question, even including the checkmark symbol, thus not providing clean information about the right answer.

FUNSD to FUNSD+ side by side comparison

Checkmarks

FUNSD annotates the checkmark as the answer and the textual response as a question (Independent Checkboxes). FUNSD+ annotates the text of the checkbox selected.

FUNSD to FUNSD+ side by side comparison

Exclude text with OCR errors

FUNSD includes unreadable signatures, FUNSD+ does not annotate text that cannot be recognized correctly by the OCR.

FUNSD to FUNSD+ side by side comparison

Reduce number of annotations

FUNSD includes some edge cases / ambiguous cases, where sometimes many items are interconnected, with a structure which is not able to be understood. FUNSD+ prefers not to annotate ambiguous cases.

FUNSD to FUNSD+ side by side comparison

Access to the dataset

The data can be downloaded via our Python SDK or can be custom hosted as an instance of the Konfuzio Server in your environment. Besides that our lableing interface allows you to easily define custom Annotations and entity relation structures besides Key Value Pair Labeling as in FUNSD. Thereby you can build and maintain individual datasets. You can find more examples for invoices, remittance advice or car registration documents on our hompeage.

How to cite?

Zagami, D., & Helm, C. (2022, October 18). FUNSD+: A larger and revised FUNSD dataset. Retrieved November 5, 2022, from http://konfuzio.com/en/funsd-plus/

@misc{zagami_helm_2022,
title = {FUNSD+: A larger and revised FUNSD dataset},
author = {Zagami, Davide and Helm, Christopher},
year = 2022,
month = {Oct},
journal = {FUNSD+ | A larger and revised FUNSD dataset},
publisher = {Helm & Nagel GmbH},
url = {http://konfuzio.com/funsd-plus/}
}

References

Harley, A. W., Ufkes, A., & Derpanis, K. G. (2015, August). Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (pp. 991-995). IEEE. Link to PDF.

Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. CoRRabs/1905.13538.

Vu, Hieu & Nguyen, Diep. (2020). Revising FUNSD dataset for key-value detection in document images.

lets work together
en_USEN