Lexicon page

Free NLP Document Annotation Tools in 2021

Team Konfuzio

Which annotation tool is best for your next Natural Language Processing project to enable annotations in documents?

Document annotations pose challenges

Many annotation tools are available for free. A current scientific article presents several dozen. We complement this scientific article with the requirements in the enterprise context for NLP annotation tools. In the second section, we take a practical look at some of the available tools.

Extract text in documentes from different data formats

A Hacker News article describes the high complexity of PDF processing. All 700 comments below reflect the interest in PDF documents as a data base for NLP training. If you want to know more about "What's so hard about PDF text extraction?", you will find a good overview by clicking on this link It can be summarized that it is difficult for enterprise users to access text in PDFs or images to train NLP models.

Annotate documents for NER dependency parsing

Only the understanding of dependencies leads to added value through NLP in the business world. From a professional point of view, it is often not enough to recognize the first name or last name of a person. The context of this person must be annotated and later learned by NLP. For example, it is important for the professional understanding whether the first name of the seller or buyer is meant.

Annotate with humans in the loop 

In companies, different people contribute to a high-quality data set. Already during data collection, different departments or persons usually provide data. Also when annotating in the NLP project, different users support the NLP experts in creating the NLP data. Experienced users need to review the annotation of less experienced users. Experienced annotators should be able to revise annotations of less experienced users. This process can improve data quality and accelerates organizational learning.

Automated document annotations

Once an expert has trained an NLP model, annotators should use it to save time. Annotators should use the model to generate new annotations automatically. Instead of creating new annotations, they should review them. Automated annotations help annotators stay focused and annotate more raw data. Suggested annotations shorten raw data processing time, as humans have corrected incorrect annotations faster than adding missing annotations. Even with less accurate models, Data Scientists help creating good datasets manually.

Get bounding boxes from text data

Unlike tweets, the position of text within a business document contains information. For example, contact phone numbers are usually listed in the upper right corner. The annotation tool should be able to convert any text sequence into a bounding box and page number. Visual items complement the NLP features and increase the accuracy of the model.

Free document annotation tools

The following tools are free, browser-based, and installable. These free annotation tools have already proven their value to many Data Scientists. At Konfuzio, we have great respect for the developers who created these tools. However, we will still challenge these tools based on the needs of enterprise users. We have tested all the tools after installation and configuration.

brat

The tool brat is browser-based and allows the annotation of text files. It highlights relations between annotations. The setup allows to label annotations with high complexity within the text. When uploading the text file, the text loses the format at least in the interface. Also, annotating text across more than one line often leads to errors. The tool brat is available for download as a MIT license on the separate homepage .

Doccano

Doccano is a browser-based annotation tool for categorizing, translating, and annotating sequences. The setup via Docker allows an easy deployment. Doccano is available on GitHub as a MIT license. Currently, only the annotation of text files is possible. It is not possible to group individual annotations. It is possible to add annotations automatically via the API. Unfortunately, users cannot filter for automated vs. revised annotations. This makes manual control of automatically created annotations almost impossible.

INCEpTION

As successor to WebAnno INCEpTION offers a sophisticated but complex solution. The tool, which originates from scientific research, offers a Docs and a live demo. To host confidential data, the application can be set up on a dedicated server. INCEpTION uses the open-source Apache License v2.0. Editing PDF seems to be possible via PDF.js. Unfortunately, this text conversion with the PDF viewer loses the layout in the text. Annotations can not be created automatically according to the documentation.

PDFAnno

With the focus to allow the user to annotate PDFs, this tool provides a web interface. Only one user can create an annotation in a document at a time. Collaboration with others is only possible by importing or exporting the data. The tool relies on PDF.js to render the PDF. Since PDF.js loads the entire PDF before starting editing, annotating larger PDFs results in long loading times. The GitHub project (MIT license) is archived.

You can find all tools for annotations this link.

Annotate documents with Konfuzio

Free annotation tools are great and mostly focus on the individual end user, e.g. a Data Scientist working on an NLP project alone. In an enterprise context, the features of these tools only map the requirements in a very heterogeneous way. This led us to develop Konfuzio in 2018. Our goal is to enable companies to build NLP models quickly, on any data source and collaboratively. We are happy to take reviews for other tools as well. Feel free to contact us via [email protected]. Our tool for annotations in documents combines the visual layer and the text. But this annotation tool for NLP models is only a small part of our AI Software Studio.

FunctionbratDoccanoINCEpTIONPDFannoKonfuzio
Data formats
Context
Team-First
automation
Visual characteristics
Create annotations in documents, images and in text with these tools.

0 comments

    Write a comment

    More Articles

    Traffic at night

    Document processing with AI, automate processes and generate insights

    Why do companies process documents with AI? Today, data is considered one of the most valuable resources in the world. In contrast,...

    Read article

    IT-Tage 2020

    From 7 to 10 December 2020, the IT Days will be held as a remote conference for the first time. The conference is aimed at...

    Read article
    Vehicle registration document sample

    Scan vehicle registration document and process digitally with AI

    AI scanner software captures all data in a few seconds for the automotive industry, insurance companies and government agencies Many garages take the information from...

    Read article

      Are you looking for more information?

      You are also welcome to send us an e-mail to [email protected] , call us via +49 6441 8994005 or book a meeting.
      Arrow-up