When creating the FUNSD+ dataset we aimed to enlarge the FUNSD. In addition, we wanted to set-up the labeling tool, so that the FUNSD+ dataset can be copied, i.e. "forked" in a way, so that other researchers can inspect, edit or expand the FUNSD+. Annotations visually or via code, see Live Document Example.
How to get access to the FUNSD+ dataset?
- Register at app.konfuzio.com
- Create a Support Ticket
Request access to FUNSD+: Provide the correct e-mail that is linked to your app.konfuzio.com account.
- You will receive an invite by e-mail.
We will send you an invitation email to access the dataset via the Konfuzio platform. You need to register an account.
- Use the Konfuzio Python SDK to download the data.
You can explore the dataset from the platform in read-only mode and then download it by using the Konfuzio SDK. Just install it via
pip install confuzio_sdk
and initialize it in the folder where you want to download the data withkonfuzio_sdk init
. Then download the dataset withkonfuzio_sdk export_project 11984
. - Errors
If you can't use the SDK, we will prepare another download for you, but we don't have it ready yet as we provide the download with the SDK by default.
You are reading an auto-translated version of the original German post.
Background FUNSD dataset
We highly value the FUNSD dataset by Jaume et al. (2019) for form understanding in noisy scanned documents. Guillaume Jaume pubslished the dataset on his homepage. It is licensed to be used for non-commercial, research and educational purposes, see license. The FUNSD dataset is a subset of documents published as RVL-CDIP. RVL-CPID was introduced by Harley et al. (2015).
Approximate number of open-access papers mentioning the dataset in the last five years.

Numbers are base on Papers with Code
To build the FUNSD dataset, we manually checked the 25,000 images from the form category. We discarded unreadable and similar forms, resulting in 3,200 eligible documents, out of which we randomly sampled 199 to annotate.
Jaume et al. (2019)
Even the FUNSD dataset relates to a niche of AI, i.e. Document AI, about 200 people search for "FUNSD" every month.

How many times per month people search "FUNSD" on Google.
FUNSD vs. FUNSD+
While annotating the single page documents we incorporated the latest research. Vu et al. (2020) reports to have found several inconsistency in labeling, which might impede the FUNSD applicability to the key-value extraction problem.
FUNSD+ provides access to more documents
Besides the increase from 199 documents to 1113 documents we summarize the characteristics of both datasets below. Statistics of the FUNSD dataset are retrieved from the Paper by Jaume et al. (2019).
FUNSD | FUNSD+ | |
---|---|---|
Documents | 199 | 1113 |
headers | 563 | 1604 |
questions | 4343 | 14695 |
answers | 3623 | 12154 |
questions with no answers | 720 (16.6%) | 2691 (18.3%) |
answers without questions* | 0 | 114 (0.9%) |
* (basically Independent Checkboxes in the table above)
FUNSD+ provides access to more documents
As described in Table 1, the average number of headers, questions and answers per document differs. In Table 2 we summarize the main differences when annotating the documents. Afterwards, we will demonstrate a selected number of documents using screenshots of the Annotation UI.
FUNSD | FUNSD+ | |
---|---|---|
Handwritten answers | Yes, usually good quality | Yes when good OCR, otherwise document excluded |
Signatures | Included even when unreadable | Yes when good OCR, otherwise left blank (we declare it as unreadable by omission) |
Checkboxes | All answers included, plus the checkmark sign | Only correct answer linked to the question. This provides a clean Question-Answer pair without further postprocessing needed. |
Independent Checkboxes | Marks the checkmark as the answer and the textual response as a question. The uncheckmarked answers are questions without answers. | Only the checkmarked answer is annotated as an answer, the rest is given label "Other" as it doesn't answer any question |
Tables | Links all rows of a table to the same column, so it's impossible to differentiate between multiple rows | Left unannotated and labeled as "Other". In a next version, the proper AnnotationSet structure would have "Table column/row header" labels associated to a single cell with label "Table Cell Answer". |
Headers | Full | No brackets, considered as comments to the headers |
Trailing colons | Yes | No |
Irrelevant text/comments included in answers/questions | Yes, fully annotated | No, only clean information from Question-Answers pairs |
Edge cases / ambiguous cases | Sometimes many items interconnected, with a structure which is not able to be understood | Document excluded from the dataset |
Live Document Example
JSON

Document UI

FUNSD vs. FUNSD+ visual Examples
Multiple rows
FUNSD links all rows of a table to the same column, so it's impossible to differentiate between multiple rows. We did not annotate tables for now. However, we could expand the dataset and annotate tables using the concept of Label sets.

Use of headers
FUNSD links headers to questions inconsistently. FUNSD+ tries to reduce the number of headers and only annotated headers that clearly relate the content next to it.

Annotating the answer
FUNSD links all multiple answers to a question, even including the checkmark symbol, thus not providing clean information about the right answer.

Checkmarks
FUNSD annotates the checkmark as the answer and the textual response as a question (Independent Checkboxes). FUNSD+ annotates the text of the checkbox selected.

Exclude text with OCR errors
FUNSD includes unreadable signatures, FUNSD+ does not annotate text that cannot be recognized correctly by the OCR.

Reduce number of annotations
FUNSD includes some edge cases / ambiguous cases, where sometimes many items are interconnected, with a structure which is not able to be understood. FUNSD+ prefers not to annotate ambiguous cases.

Access to the dataset
The data can be downloaded via our Python SDK or can be custom hosted as an instance of the Konfuzio Server in your environment. Besides that our lableing interface allows you to easily define custom Annotations and entity relation structures besides Key Value Pair Labeling as in FUNSD. Thereby you can build and maintain individual datasets. You can find more examples for invoices, remittance advice or car registration documents on our hompeage.
How to cite?
Zagami, D., & Helm, C. (2022, October 18). FUNSD+: A larger and revised FUNSD dataset. Retrieved November 5, 2022, from http://konfuzio.com/en/funsd-plus/
@misc{zagami_helm_2022,
title = {FUNSD+: A larger and revised FUNSD dataset},
author = {Zagami, Davide and Helm, Christopher},
year = 2022,
month = {Oct},
journal = {FUNSD+ | A larger and revised FUNSD dataset},
publisher = {Helm & Nagel GmbH},
url = {http://konfuzio.com/funsd-plus/}
}
References
Harley, A. W., Ufkes, A., & Derpanis, K. G. (2015, August). Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (pp. 991-995). IEEE. Link to PDF.
Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. CoRR, abs/1905.13538.
Vu, Hieu & Nguyen, Diep. (2020). Revising FUNSD dataset for key-value detection in document images.