FUNSD Plus – Form Understanding in Noisy Scanned Documents

Konfuzio

We highly value the FUNSD contribution for form understanding in noisy scanned documents. Due to the restrictions to 199 documents we created a dataset of 1113 documents instead 199. While labeling we relfected on the revision of the FUNSD dataset and added example screenshots how we changed the labeling.

Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. CoRRabs/1905.13538.

Vu, Hieu & Nguyen, Diep. (2020). Revising FUNSD dataset for key-value detection in document images.

Please contact us if you want to use the FUNSD+ dataset.

FUNSD vs. FUNSD+

FUNSDFUNSD+
Handwritten answersYes, usually good qualityYes when good OCR, otherwise document excluded
SignaturesIncluded even when unreadableYes when good OCR, otherwise left blank (we declare it as unreadable by omission)
CheckboxesAll answers included, plus the checkmark signOnly correct answer linked to the question. This provides a clean Question-Answer pair without further postprocessing needed.
Independent CheckboxesMarks the checkmark as the answer and the textual response as a question. The uncheckmarked answers are questions without answers.Only the checkmarked answer is annotated as an answer, the rest is given label „Other“ as it doesn’t answer any question
TablesLinks all rows of a table to the same column, so it’s impossible to differentiate between multiple rowsLeft unannotated and labeled as „Other“. In a next version, the proper AnnotationSet structure would have „Table column/row header“ Labels associated to a single cell with Label „Table Cell Answer“
HeadersFullNo brackets, considered as comments to the headers
Trailing colonsYesNo
Irrelevant text/comments included in answers/questionsYes, fully annotatedNo, only clean information from Question-Answers pairs
Edge cases / ambiguous casesSometimes many items interconnected, with a structure which is not able to be understoodDocument excluded from the dataset

FUNSD+ provides access to more documents

FUNSDFUNSD+
docs1991113
headers5631604
questions434314695
answers362312154
questions with no answers720 (16.6%)2691 (18.3%)
answers without questions*0114 (0.9%)

* (basically Independent Checkboxes in the table above)

Examples, screenshots

Example: FUNSD links all rows of a table to the same column, so it’s impossible to differentiate between multiple rows

We did not annotate tables for now. However, we could expand the dataset and annotate tables using the concept of Label Sets.

FUNSD to FUNSD+ side by side comparison

Example: FUNSD links headers to questions inconsistently

We tried to reduce the number of headers and only annotated headers that clearly relate the content next to it.

FUNSD to FUNSD+ side by side comparison

Example: FUNSD links all multiple answers to a question, even including the checkmark symbol, thus not providing clean information about the right answer

FUNSD to FUNSD+ side by side comparison

Example: FUNSD annotates the checkmark as the answer and the textual response as a question (Independent Checkboxes)

FUNSD to FUNSD+ side by side comparison

Example: FUNSD includes unreadable signatures

FUNSD to FUNSD+ side by side comparison

Example: FUNSD includes some edge cases / ambiguous cases, where sometimes many items interconnected, with a structure which is not able to be understood

FUNSD to FUNSD+ side by side comparison

The data can be downloaded via our Python SDK or can be custom hosted as a instance of the Konfuzio Server in your environment. Besides that our lableing interface allows you to easily define custom annotation structures besides Key Value Pair Labeling as in FUNSD. Thereby you can build and maintain individual datasets. You can find more examples for invoices, remittance advice or car registration documents on our hompeage.

0 Kommentare

Schreiben Sie einen Kommentar

Weitere Artikel

Intelligent Automation für digitale Prozessoptimierung

In einer von Optimierung und der digitalen Transformation geprägten Welt müssen erfolgreiche Unternehmen schneller, besser und intelligenter als die Konkurrenz…

Zum Artikel
Prozessoptimierung

Prozessoptimierung: Definition und Umsetzung in Ihrem Unternehmen

Die Prozessoptimierung ist ein unvermeidlicher Prozess für das effiziente Funktionieren eines modernen Unternehmens. Kurzfassung Was ist die Prozessoptimierung? Wenn Unternehmen…

Zum Artikel

Who wants to be hired?

Als Entwickler ist es schwer, eine gute Arbeitsstelle zu finden. Die meisten Entwickler wollen technische Herausforderungen lösen. Wahrscheinlich mögen sie…

Zum Artikel

    Nehmen Sie Kontakt zu uns auf.

    Arrow-up