Which annotation tool is best for your next Natural Language Processing project to enable annotations in documents?
Document annotations pose challenges
Many annotation tools are available for free. A current scientific article presents several dozen. We complement this scientific article with the requirements in the enterprise context for NLP annotation tools. In the second section, we take a practical look at some of the available tools.
Extract text in documentes from different data formats
A Hacker News article describes the high complexity of PDF processing. All 700 comments below reflect the interest in PDF documents as a data base for NLP training. If you want to know more about "What's so hard about PDF text extraction?", you will find a good overview by clicking on this link It can be summarized that it is difficult for enterprise users to access text in PDFs or images to train NLP models.
Annotate documents for NER dependency parsing
Only the understanding of dependencies leads to added value through NLP in the business world. From a professional point of view, it is often not enough to recognize the first name or last name of a person. The context of this person must be annotated and later learned by NLP. For example, it is important for the professional understanding whether the first name of the seller or buyer is meant.
Annotate with humans in the loop
In companies, different people contribute to a high-quality data set. Already during data collection, different departments or persons usually provide data. Also when annotating in the NLP project, different users support the NLP experts in creating the NLP data. Experienced users need to review the annotation of less experienced users. Experienced annotators should be able to revise annotations of less experienced users. This process can improve data quality and accelerates organizational learning.
Automated document annotations
Once an expert has trained an NLP model, annotators should use it to save time. Annotators should use the model to generate new annotations automatically. Instead of creating new annotations, they should review them. Automated annotations help annotators stay focused and annotate more raw data. Suggested annotations shorten raw data processing time, as humans have corrected incorrect annotations faster than adding missing annotations. Even with less accurate models, Data Scientists help creating good datasets manually.
Get bounding boxes from text data
Unlike tweets, the position of text within a business document contains information. For example, contact phone numbers are usually listed in the upper right corner. The annotation tool should be able to convert any text sequence into a bounding box and page number. Visual items complement the NLP features and increase the accuracy of the model.
Free document annotation tools
The following tools are free, browser-based, and installable. These free annotation tools have already proven their value to many Data Scientists. At Konfuzio, we have great respect for the developers who created these tools. However, we will still challenge these tools based on the needs of enterprise users. We have tested all the tools after installation and configuration.
The tool brat is browser-based and allows the annotation of text files. It highlights relations between annotations. The setup allows to label annotations with high complexity within the text. When uploading the text file, the text loses the format at least in the interface. Also, annotating text across more than one line often leads to errors. The tool brat is available for download as a MIT license on the separate homepage .
Doccano is a browser-based annotation tool for categorizing, translating, and annotating sequences. The setup via Docker allows an easy deployment. Doccano is available on GitHub as a MIT license. Currently, only the annotation of text files is possible. It is not possible to group individual annotations. It is possible to add annotations automatically via the API. Unfortunately, users cannot filter for automated vs. revised annotations. This makes manual control of automatically created annotations almost impossible.
As successor to WebAnno INCEpTION offers a sophisticated but complex solution. The tool, which originates from scientific research, offers a Docs and a live demo. To host confidential data, the application can be set up on a dedicated server. INCEpTION uses the open-source Apache License v2.0. Editing PDF seems to be possible via PDF.js. Unfortunately, this text conversion with the PDF viewer loses the layout in the text. Annotations can not be created automatically according to the documentation.
With the focus to allow the user to annotate PDFs, this tool provides a web interface. Only one user can create an annotation in a document at a time. Collaboration with others is only possible by importing or exporting the data. The tool relies on PDF.js to render the PDF. Since PDF.js loads the entire PDF before starting editing, annotating larger PDFs results in long loading times. The GitHub project (MIT license) is archived.
You can find all tools for annotations this link.
Annotate documents with Konfuzio
Free annotation tools are great and mostly focus on the individual end user, e.g. a Data Scientist working on an NLP project alone. In an enterprise context, the features of these tools only map the requirements in a very heterogeneous way. This led us to develop Konfuzio in 2018. Our goal is to enable companies to build NLP models quickly, on any data source and collaboratively. We are happy to take reviews for other tools as well. Feel free to contact us via [email protected]. Our tool for annotations in documents combines the visual layer and the text. But this annotation tool for NLP models is only a small part of our AI Software Studio.