FUNSD+ | A larger and revised FUNSD dataset

Konfuzio

When creating the FUNSD+ dataset we aimed to enlarge the FUNSD. In addition, we wanted to set-up the labeling tool, so that the FUNSD+ dataset can be copied, i.e. „forked“ in a way, so that other researchers can inspect, edit or expand the FUNSD+ Annotations visually or via Code, see Live Document Example.

Zugriff anfragen

Background FUNSD dataset

We highly value the FUNSD dataset by Jaume et al. (2019) for form understanding in noisy scanned documents. Guillaume Jaume pubslished the dataset on his homepage. It is licensed to be used for non-commercial, research and educational purposes, see license. The FUNSD dataset is a subset of documents publised as RVL-CDIP. RVL-CPID was introduced by Harley et al. (2015).

Approximate number of open-access papers mentioning the dataset in the last five years.

Numbers are base on Papers with Code

To build the FUNSD dataset, we manually checked the 25,000 images from the form category. We discarded unreadable and similar forms, resulting in 3,200 eligible documents, out of which we randomly sampled 199 to annotate.

Jaume et al. (2019)

Even the FUNSD dataset relates to a niche of AI, i.e. Document AI, ca. 200 people search for „FUNSD“ every month.

Search volume of FUNSD. How many times per month people search "FUNSD" on Google.

How many times per month people search „FUNSD“ on Google.

FUNSD vs. FUNSD+

While annotating the single page documents we incorporated the latest research. Vu et al. (2020) reports to have found several inconsistency in labeling, which might impede the FUNSD applicability to the key-value extraction problem.

FUNSD+ provides access to more documents

Besides the increase from 199 documents to 1113 documents we summarize the characteristics of both datasets below. Statistics of the FUNSD dataset are retrieved from the Paper by Jaume et al. (2019).

FUNSDFUNSD+
Documents1991113
headers5631604
questions434314695
answers362312154
questions with no answers720 (16.6%)2691 (18.3%)
answers without questions*0114 (0.9%)
Table 1: FUNSD vs. FUNSD+ statistics

* (basically Independent Checkboxes in the table above)

FUNSD+ provides access to more documents

As described in Table 1, the average number of headers, questions and answers per document differs. In Table 2 we summarize the main differences when annotating the documents. Afterwards, we will demonstrate a selected number of documents using screenshots of the Annotation UI.

FUNSDFUNSD+
Handwritten answersYes, usually good qualityYes when good OCR, otherwise document excluded
SignaturesIncluded even when unreadableYes when good OCR, otherwise left blank (we declare it as unreadable by omission)
CheckboxesAll answers included, plus the checkmark signOnly correct answer linked to the question. This provides a clean Question-Answer pair without further postprocessing needed.
Independent CheckboxesMarks the checkmark as the answer and the textual response as a question. The uncheckmarked answers are questions without answers.Only the checkmarked answer is annotated as an answer, the rest is given label „Other“ as it doesn’t answer any question
TablesLinks all rows of a table to the same column, so it’s impossible to differentiate between multiple rowsLeft unannotated and labeled as „Other“. In a next version, the proper AnnotationSet structure would have „Table column/row header“ Labels associated to a single cell with Label „Table Cell Answer“
HeadersFullNo brackets, considered as comments to the headers
Trailing colonsYesNo
Irrelevant text/comments included in answers/questionsYes, fully annotatedNo, only clean information from Question-Answers pairs
Edge cases / ambiguous casesSometimes many items interconnected, with a structure which is not able to be understoodDocument excluded from the dataset

Live Document Example

JSON

JSON formatting example: Visit https://git.konfuzio.com/-/snippets/33

Document UI

Visit https://app.konfuzio.com/d/303962/

FUNSD vs. FUNSD+ visual Examples

Multiple rows

FUNSD links all rows of a table to the same column, so it’s impossible to differentiate between multiple rows. We did not annotate tables for now. However, we could expand the dataset and annotate tables using the concept of Label Sets.

FUNSD to FUNSD+ side by side comparison

Use of headers

FUNSD links headers to questions inconsistently. FUNSD+ tries to reduce the number of headers and only annotated headers that clearly relate the content next to it.

FUNSD to FUNSD+ side by side comparison

Annotating the answer

FUNSD links all multiple answers to a question, even including the checkmark symbol, thus not providing clean information about the right answer.

FUNSD to FUNSD+ side by side comparison

Checkmarks

FUNSD annotates the checkmark as the answer and the textual response as a question (Independent Checkboxes). FUNSD+ annotates the text of the checkbox selected.

FUNSD to FUNSD+ side by side comparison

Exclude text with OCR errors

FUNSD includes unreadable signatures, FUNSD+ does not annotate text that can not be recognized correctly by the OCR.

FUNSD to FUNSD+ side by side comparison

Reduce number of annotations

FUNSD includes some edge cases / ambiguous cases, where sometimes many items interconnected, with a structure which is not able to be understood. FUNSD+ prefers to not annotate ambiguous cases.

FUNSD to FUNSD+ side by side comparison

Access to the dataset

The data can be downloaded via our Python SDK or can be custom hosted as a instance of the Konfuzio Server in your environment. Besides that our lableing interface allows you to easily define custom Annotations and entity relation structures besides Key Value Pair Labeling as in FUNSD. Thereby you can build and maintain individual datasets. You can find more examples for invoices, remittance advice or car registration documents on our hompeage.

How to cite?

Zagami, D., & Helm, C. (2022, October 18). FUNSD+: A larger and revised FUNSD dataset. Retrieved November 5, 2022, from https://konfuzio.com/en/funsd-plus/

@misc{zagami_helm_2022,
title = {FUNSD+: A larger and revised FUNSD dataset},
author = {Zagami, Davide and Helm, Christopher},
year = 2022,
month = {Oct},
journal = {FUNSD+ | A larger and revised FUNSD dataset},
publisher = {Helm & Nagel GmbH},
url = {https://konfuzio.com/funsd-plus/}
}

References

Harley, A. W., Ufkes, A., & Derpanis, K. G. (2015, August). Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (pp. 991-995). IEEE. Link to PDF.

Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. CoRRabs/1905.13538.

Vu, Hieu & Nguyen, Diep. (2020). Revising FUNSD dataset for key-value detection in document images.

0 Kommentare

Schreiben Sie einen Kommentar

Weitere Artikel

Digitalisierung im Gesundheitswesen

Krankenversicherung Status Quo   Fehlentscheidungen in der Gesundheitsprüfung beim Wechsel oder Abschluss einer Krankenversicherung, können weitreichende Folgen nach sich ziehen. Anhand…

Zum Artikel
Offener Laptop, auf Bildschirm ist HTML Code zu sehen

Wie komme ich Schritt für Schritt vom PDF zum Text

Seit vielen Jahren hat sich das PDF Dateiformat als Standard für die digitale Verteilung von Dokumenten etabliert. Jedes Gerät, ob…

Zum Artikel

OCR mit KI: Intelligente Dokumentenerfassung

Mithilfe von Künstlicher Intelligenz kann OCR auf eine neue Evolutionsstufe gehoben werden. Durch diesen Fortschritt in der Dokumentenverarbeitung können Dokumente…

Zum Artikel

    Kontaktieren Sie uns!

    Arrow-up