A major topic in the field of machine learning (ML) that often arises when performance improvements are to be made is the cost to prepare data. Generating high-quality data to raise the prediction accuracy is usually difficult, expensive, or simply not possible. In addition, data is usually considered of lower importance, in contrast to the AI model itself. This is because ML engineers tend to modify the code of a model rather than clean up the data set. But especially for AI applications, it is important to examine the data as the system is built on both code and data, compared to traditional software which is powered only by code. Although it is usually assumed that 80% of the activities in machine learning are attributed to data cleaning, ensuring data quality (data-centric) is not seen as important as working on the model (model-centric). [1]
But why is no one using a data-centric approach?
A number of factors play a role here. On the one hand, a data-centric approach is seen as a time-consuming and boring task, on the other hand, it is also difficult to define what exactly constitutes a good data set. However, the urge of working on a good dataset is even becoming more important in the future as the amount of unlabeled data is generally growing due to cheaper ways of collecting and storing it. To build good models, good data is required, which comes at a price. “(...) annotated data is hard and expensive to obtain, notably in specialized domains where only experts whose time is scarce and precious can provide reliable labels.” [2] In reaction to this situation, researchers have started to raise awareness to close the gap between data acquisition and model building. This is exactly where the discipline of active learning comes in.
How does active learning contribute to getting better data?
Active learning integrates human knowledge into machine learning as it significantly reduces data requirements and increases the predictions of the AI. “It aims to select the most useful samples from the unlabeled dataset and hand it over to the oracle (e.g., human annotator) for labeling, so as to reduce the cost of labeling as much as possible while still maintaining performance.” [3] Active learning is especially useful when the amount of data is too large to be labeled or priorities need to be made for a smart way of labeling. Natural language processing (NLP) is one of the most popular areas where active learning comes to play. Mainly because applications of NLP require a large amount of labeled data there is a very high cost of labeling it. The use of active learning in NLP reduces the amount of data that needs to be accurately labeled by an expert in the training process of a model. Instead, the model already labels data on its own and asks for feedback when not exactly sure if it is the right label.
There are three different types of active learning referred to in different research papers. The most commonly used method pool-based learning takes the most informative instances rated by a score from the entire data pool and queries the human for labels. The stream-based selective sampling evaluates the informativeness of each unlabeled data point one at a time and decides for itself whether to query the human or to assign a label itself. The membership query synthesis is a method where the model generates its own instance from an underlying natural distribution. But any kind of prioritizing data points needs some general steps to perform active learning on an unlabeled data set. First, a very small subsample of data has to be manually labeled on which the model will then be trained. Afterwards, the model predicts the class of unlabeled data points, chooses a score based on the prediction of the model, and assigns data points that have to be labeled to the human. The approach of the training process is called semi-supervised learning, i.e. having only a small amount of labeled data but a large amount of unlabeled data. After receiving the human feedback, the model can be trained on the enhanced labeled data set which makes the model’s performance better by each cycle.

How we integrated active learning to get better prediction performance
In the following part, we explain how we implemented the active learning approach in the AI platform Konfuzio. We use the stream-based selective sampling approach but extend it for our application, where we rather need a data-centric approach that is applicable for continuous learning behaviour. This is the case as our goal is not to train a big dataset once and use it as it is in the following, but rather to train the model with a small set of data and continuously increase the prediction performance through retraining. A method that can be referred to is called incremental learning, which aims at extending the model’s knowledge over time as soon as new training data becomes available [4]. Thus, the active learning approach covers a larger period of time and takes place regularly. So, what are the steps in improving our outcomes with active learning in a new project? Below you can find a visual representation of the process and a description of each step.

Active learning takes part within the step “feedback” where the interaction with the user is needed to make the predictions of the model better. Here we can speak of the “human-in-the-loop” implementation that combines our machine learning model with human interaction. Therefore, it is possible to often and quickly retrain our model as we benefit from continuous feedback by turning to the human. Through this approach, our goal to increase prediction performance on a regular basis considering a data-centric view can be easily fulfilled over time which is shown in the graph below.
How does Active Learning work in practice?

Active Learning in practice using the example of documents
- Select problem
The user selects a use case for the processing of documents by AI. A clear definition of labels to be extracted is the starting point for the AI active learning process.
- Collect data
The user uploads files to the Konfuzio platform that consist of documents for training & validation as well as documents for testing (e.g. in a ratio of 70/30). Training data now has to be labeled either by the user manually or an open source AI model can be used to pre-label the data.
- Check data
The consistency of the labeled training data is checked and corrected or extended if necessary. In a separate technical article, we show how we control the quality of our training data by summarizing it in an automated way. Read this article.
- Train AI
Automated training on labeled training data is then initiated. The best model architecture for the specific use case is selected.
- Test AI
After training, automated testing of trained AI on the uploaded test dataset is carried out.
- Report
The report shows if the “new” AI model including the test dataset is better than the previous model. If not, the previous model will be used for the next step.
- Deployed AI
If the “new” model is better than the previous model, Konfuzio updates the API on the latest trained AI.
- Real World
Now new "real world" data/documents come into play, which are processed with the deployed AI. The predictions of the new labels are calculated. Documents can be processed directly via API or Python SDK access the Deployed AI.
- Feedback
All detected data in a document is displayed to the user. This user interface allows the user to provide feedback on the predictions by visually displaying them on the respective document. Either correctness can be confirmed by clicking on a green check mark, or an incorrect label can be discarded by clicking on a red X. Additionally, the user can add labels that were missed. The human feedback should be given preferentially to the documents that could not have been extracted with the automatically generated rules, see step 3.
- Collect data
The feedback from the user is then fed back into the dataset on the repeated step “collect data”. An accepted label solely confirms the correctness, the feedback of rejection or the addition of a label increases the prediction accuracy through preprocessing and retraining the model. The “real world” data from before is thus now inserted into the loop for training.
Conclusion
Active Learning helps to continuously improve the predictions of new AI and machine learning models. The benefits of incorporating an active learning approach include reduced costs, small data sets for the first release of AI, increased reliability, and ever-increasing robustness through continuous testing of AI accuracy. The shift to data-centric development makes AI accessible to more user teams and many more use cases. Especially for models with smaller data sets, data quality significantly improves AI performance.
For data scientists, we also offer a Python SDK in addition to the web interface.

Human-in-the-Loop in a Colab Notebook
Sources:
Photo from Andrea Piacquadio from Pexels
Write a comment