We highly value the FUNSD contribution for form understanding in noisy scanned documents. Due to the restrictions to 199 documents we created a dataset of 1113 documents instead 199. While labeling we relfected on the revision of the FUNSD dataset and added example screenshots how we changed the labeling.
Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. CoRR, abs/1905.13538.
Vu, Hieu & Nguyen, Diep. (2020). Revising FUNSD dataset for key-value detection in document images.
Please contact us if you want to use the FUNSD+ dataset.
FUNSD vs. FUNSD+
|Handwritten answers||Yes, usually good quality||Yes when good OCR, otherwise document excluded|
|Signatures||Included even when unreadable||Yes when good OCR, otherwise left blank (we declare it as unreadable by omission)|
|Checkboxes||All answers included, plus the checkmark sign||Only correct answer linked to the question. This provides a clean Question-Answer pair without further postprocessing needed.|
|Independent Checkboxes||Marks the checkmark as the answer and the textual response as a question. The uncheckmarked answers are questions without answers.||Only the checkmarked answer is annotated as an answer, the rest is given label „Other“ as it doesn’t answer any question|
|Tables||Links all rows of a table to the same column, so it’s impossible to differentiate between multiple rows||Left unannotated and labeled as „Other“. In a next version, the proper AnnotationSet structure would have „Table column/row header“ Labels associated to a single cell with Label „Table Cell Answer“|
|Headers||Full||No brackets, considered as comments to the headers|
|Irrelevant text/comments included in answers/questions||Yes, fully annotated||No, only clean information from Question-Answers pairs|
|Edge cases / ambiguous cases||Sometimes many items interconnected, with a structure which is not able to be understood||Document excluded from the dataset|
FUNSD+ provides access to more documents
|questions with no answers||720 (16.6%)||2691 (18.3%)|
|answers without questions*||0||114 (0.9%)|
* (basically Independent Checkboxes in the table above)
Example: FUNSD links all rows of a table to the same column, so it’s impossible to differentiate between multiple rows
We did not annotate tables for now. However, we could expand the dataset and annotate tables using the concept of Label Sets.
Example: FUNSD links headers to questions inconsistently
We tried to reduce the number of headers and only annotated headers that clearly relate the content next to it.
Example: FUNSD links all multiple answers to a question, even including the checkmark symbol, thus not providing clean information about the right answer
Example: FUNSD annotates the checkmark as the answer and the textual response as a question (Independent Checkboxes)
Example: FUNSD includes unreadable signatures
Example: FUNSD includes some edge cases / ambiguous cases, where sometimes many items interconnected, with a structure which is not able to be understood
The data can be downloaded via our Python SDK or can be custom hosted as a instance of the Konfuzio Server in your environment. Besides that our lableing interface allows you to easily define custom annotation structures besides Key Value Pair Labeling as in FUNSD. Thereby you can build and maintain individual datasets. You can find more examples for invoices, remittance advice or car registration documents on our hompeage.