Document Layout Analysis bridges the gap between unstructured data and its meaningful use by extracting structured information while respecting the layout of the original documents.
From deciphering complex magazine and newspaper formats to processing technical manuals, Document Layout Analysis can help highlight hidden records.
Analyzing and structuring documents efficiently is a key factor in numerous areas - from automating administrative tasks to improving information accessibility.
In this article we will guide you through the maze of Document Layout Analsis, LayoutParser and DocLayNet and explain the background.
This article was written in German, automatically translated into other languages and editorially reviewed. We welcome feedback at the end of the article.
Clarification and explanation of the "Document Layout" concept
Document Layout is the spatial arrangement and design of content on a page or in a digital document.
This includes elements such as text blocks, headings, images, charts, tables, and other graphical components. The layout of a document significantly influences how the information is presented and perceived by the reader.
Document Layout Analysis involves the recognition and interpretation of visual and spatial information in documents to achieve an in-depth understanding of a document's structure and meaning.
Significant factors of document layout and their influence on text interpretation
There are a number of factors that determine the layout of a document and influence the interpretation of text. These include the position and size of text blocks, the arrangement of images and graphics, the use of colors and fonts, and the hierarchical structure of information. A well-designed document layout guides the reader's eye, emphasizes important points, and improves understanding of the content.
For example, headings and subheadings can help divide text into manageable sections and clarify the structure of the document. Images and diagrams can visually represent information and facilitate text interpretation. Colors can be used to highlight specific areas or indicate different categories of information. In Document Layout Analysis, these and other factors are analyzed to provide a comprehensive picture of a document's structure and meaning.
Approaches for Document Layout Analysis
Here we summarize typical approaches used in the document layout Analsis by professionals:
- Synthetic dataset and model ensemble: One approach is to create a synthetic image dataset and use ensemble models such as YOLOv8 and DINO for layout prediction. To improve performance, an additional classification model is trained to categorize samples into document categories. Models are optimized using the Tree-Structured Parzen Estimator (TPE) and results are combined with Weighted Boxes Fusion (WBF).
- Image augmentation and object detection: Another approach relies on image augmentation techniques such as scaling and mosaicking methods and trains object detection models such as YOLOv5 and YOLOv8 for layout prediction. The final predictions are an ensemble of multiple detectors for superior performance.
- Mask prediction: In addition, various experts have already used models such as MaskDINO that introduce a mask prediction branch to achieve better feature alignment between detection and segmentation. Inference is then performed using the Weighted Boxes Fusion (WBF) technique on multiple scales of the same input image.
- Use of pre-trained models: Another approach is to use pre-trained models such as VSR and LayoutLMv3. The prediction results of both models are merged in the inference phase.
- Training variations of existing models: experts have trained different versions of Cascade Mask R-CNN models, based on a DiT backbone, and fused prediction results using different models.
- Baseline approach: The YOLOv5 model provides a simple baseline model. The model can be trained from scratch with default settings and standard augmentation techniques such as mosaicking, scaling, flipping, rotation, mix-up and image levels improve the results.
What is the so-called LayoutParser?
LayoutParser is a Python library that provides a wide range of pre-trained deep learning models to recognize the layout of a document image. This library uses state-of-the-art machine learning models to provide detailed and accurate analysis of document layout.
The advantage of LayoutParser is that it is really easy to implement. In fact, you only need a few lines of code to capture the layout of your document image. We will discuss the exact steps to do this in the next section.
With LayoutParser, you can benefit from pre-trained deep learning models that have been trained on various datasets. These include PubLayNet, HJDataset, PrimaLayout, Newspaper Navigator and TableBank, among others. These models have been specifically trained to recognize and interpret complex layout structures, enabling accurate and efficient document layout analysis.
If the layout of your document image has similarities with any of the above records, then you will have significant advantages with LayoutParser. It allows not only efficient layout recognition, but also deep analysis and understanding of the document content.
In addition, LayoutParser offers the flexibility to create and train customized models to meet specific requirements. This makes it a powerful and customizable tool for Document Layout Analysis.
Comparison and differentiation between layout parser and layout parser
A layout parser is an application to analyze the structure and layout of documents. Thus, text blocks, tables, images and other elements within a document can be identified and classified. The application fields range from data extraction and information retrieval to automated document processing.
The term LayoutParser refers to a Python-based document layout analysis tool. It provides functions for detecting and classifying text and non-text elements, segmenting pages, and creating layout diagrams. LayoutParser can be used in a variety of areas, including text mining, data visualization, and machine learning.
Practical use cases and examples for the use of these tools
Both tools can be used in automated document processing, for example, to extract information from a large number of documents quickly and efficiently. This can be of great benefit in areas such as accounting, human resources or customer management.
Another application area is data extraction and information retrieval. With these tools you can extract structured data from unstructured documents, which can be useful, for example, in scientific research or when creating reports and analyses.
In addition, these tools can be used in the areas of text mining and for preliminary data visualization. They can help prepare information in documents in such a way as to identify patterns and trends in large amounts of text. This can be useful in a variety of fields, from market analysis to social research.
What is DocLayNet?
DocLayNet is a human annotated document layout segmentation dataset that contains 80,863 pages from just six major document types in English. This extensive dataset has been hand-annotated by well-trained experts, making it a gold standard in layout segmentation through human recognition and interpretation of each page layout.
DocLayNet provides page-by-page layout segmentation ground truth using bounding boxes for 11 different class labels on 80,863 unique pages from 6 document categories. It has some unique features compared to related work such as PubLayNet or DocBank:
- Human annotation: As mentioned above, DocLayNet has been annotated by hand by well-trained experts. This ensures a very high accuracy in the Annotations.
- Wide layout variability: DocLayNet contains diverse and complex layouts from a wide range of public sources in the fields of finance, science, patents, tenders, legal texts and manuals.
- Detailed labels set: DocLayNet defines 11 class labels to distinguish layout features in high detail.
- Redundant annotations: A portion of the pages in DocLayNet are doubly or triply annotated, which allows us to estimate annotation uncertainty and provide an upper bound on the achievable prediction accuracy with ML models.
- Predefined training, testing and validation sets: DocLayNet provides fixed sets for each to ensure proportional representation of class labels and avoid leakage of unique layout styles across sets.
DocLayNet record details
The DocLayNet dataset is available on Hugging Face at ds4sd/DocLayNet.
The dataset contains four types of data resources: PNG images of all pages resized to square 1025 x 1025px, bounding box annotations in COCO format for each PNG image, individual PDF pages corresponding to each PNG image, and a JSON file corresponding to each PDF page providing the digital text cells with coordinates and content.
However, the DocLayNet data set has limitations. For example, the operating instructions shown are not part of the DocLayNet data set. If you would like to extend the data set, we offer the appropriate services and tools.
For more details about DocLayNet, including the structure of the dataset, the data format, and the COCO annotations, see the project's official readme.
For more technical details and a comprehensive analysis of DocLayNet, we refer to the related scientific paper: "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis" (KDD 2022). This paper is available at the following link on ArXiv: https://arxiv.org/abs/2206.01062.
DocLayNet is able to identify a variety of elements, including text blocks, headings, images, tables and other visual components. It can also analyze the spatial relationships between these elements and create a structured representation of the document layout.
Use of DocLayNet in Document Layout Analysis
In the context of Document Layout Analysis, DocLayNet has significant value. It allows to train algorithms that better understand document layouts, which can significantly improve the efficiency and accuracy of data extraction and information retrieval.
Human Annotated Datasets as a Treasure Chest of Data
Human Annotated Datasets, i.e., human-annotated datasets, are a valuable resource in many areas of machine learning and artificial intelligence. They consist of raw data that has been reviewed by humans and annotated with additional information or "annotations." These annotations can include a variety of information, such as categories, labels, tags, or other descriptions that provide additional context or meaning to the data. Human annotated datasets often serve as training data for machine learning algorithms that aim to recognize patterns in data and make predictions.
Why Human Annotated Datasets are indispensable for Document Layout Analysis
Human Annotated Datasets play a crucial role in Document Layout Analysis. They enable machine learning algorithms to understand the complexity and diversity of document layouts and learn how to identify and interpret different elements within a document. Without these human annotated training datasets, it would be difficult for machine learning models to make accurate and reliable predictions.
Practical examples of the benefits of Human Annotated Datasets using FUNSD data.
A good example of the usefulness of Human Annotated Datasets in Document Layout Analysis is the FUNSD (Form Understanding in Noisy Scanned Documents) dataset. This dataset consists of scanned documents that have been annotated by humans to identify various elements such as text blocks, headings, labels, and responses.
By training with the FUNSD dataset, machine learning models can learn how to identify these elements in similar documents and how to interpret the relationships between them. In practice, this can be used, for example, in the automation of forms processing, where machine learning-based models analyze scanned forms, extract important information, and make this information available for further processing or analysis.
In this article, a comprehensive study of Document Layout Analysis was conducted. It was emphasized that the layout of a document plays an essential role in the interpretation of the text. The Layout Parser and Layout Parser analysis tools were discussed in detail, highlighting their specific features and applications. Furthermore, the advanced technologies DocLayNet and DocNN were presented, whose capabilities and application areas are relevant for Document Layout Analysis. Lastly, the crucial role of Human Annotated Datasets in Document Layout Analysis was discussed, focusing in particular on the FUNSD dataset.
Emerging Trends and Advances in Document Layout Analysis
There are notable trends and advances in Document Layout Analysis that are worth highlighting. Continued development in Artificial Intelligence and Machine Learning promises further improvements in document layout analysis. One can expect to see significant advances in the areas of automated document processing, text mining, and data visualization in particular. In addition, it is foreseeable that access to human annotated datasets will continue to increase, favoring the development and improvement of models for document layout analysis.
Concluding remarks and invitation to exchange
This article was intended to provide a detailed overview of the world of Document Layout Analysis. It can be seen that these technologies have the potential to fundamentally change the way document processing and analysis are performed.
We encourage you to share your thoughts, questions, or experiences with these technologies. Your insights are valuable in furthering the understanding and development of these technologies. We are interested in a factual and informative exchange.
We are happy to adapt the latest research for your use case and can create ready-made environments for you to apply Artificial Intelligence to your Servers or your cloud to operate.
Bakkali, S., Ming, Z., Coustaty, M., Rusiñol, M., & Terrades, O. R. (2022). VLCDoC: Vision-language contrastive pre-training model for cross-modal document classification. arXiv preprint arXiv:2205.12029.
Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A. S., & Staar, P. (2022, August). Doclaynet: A large human-annotated dataset for document-layout segmentation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 3743-3751).
Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022, October). Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 4083-4091).
Jaume, G., Ekenel, H. K., & Thiran, J. P. (2019, September). Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 2, pp. 1-6).
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., & Wei, F. (2022, October). Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 3530-3539).
Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., & Li, W. (2021). Layout parser: A unified toolkit for deep learning based document image analysis. In Document Analysis and Recognition-ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I 16. (pp. 131-146). Springer International Publishing.
Yu, Y., Li, Y., Zhang, C., Zhang, X., Guo, Z., Qin, X. & Wang, J. (2023). StrucTexTv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289.