Adaptive Boosting - From weak learners to top decision-makers

in

on

Not every machine learning model delivers satisfactory results straight away. Although training and implementation can be carried out quickly in many cases, the price for this is a corresponding lack of accuracy. This can lead to errors, particularly in predictive data analysis. Methods are therefore needed that come close to the simplicity and generalizability of these "weak learners", but at the same time allow for increased performance. A widespread approach is to combine several models with each other; this is referred to as Ensemble Learning.

The individual solutions from the AI bag of tricks are predominantly spread across the concepts of Bagging, Stacking and Boosting, each of which comprises different learning methods. What they all have in common is that they use combination techniques to form "strong learners", which together significantly reduce the probability of error in the sense of swarm intelligence.

Boosting is characterized in comparison to the other ensemble methods by the fact that the individual models one after the other trained and usually at high distortion and low variance can be applied. Adaptive Boosting (AdaBoost for short) is particularly popular and powerful, promising unprecedented accuracy for classifiers.

Adaptive Boosting - definition and basics

definition and essentials of the adaboost algorithm

AdaBoost is an ensemble learning algorithm that relies on sequential training procedures of several Decision Trees or similar classifiers. The technology therefore has similarities with Random Forestin which the models learn in parallel through bagging. AdaBoost, on the other hand, trains individual weak learners one after the other, as is usual with Boosting. These can be kept relatively simple - right up to Decision Stumpswhich only predict binary classes due to their one-level nature. AdaBoost uses a supervised learning process (supervised learning), which is defined by the structured nature of prepared training data.

Learning from data, minimizing errors and making well-founded decisions are the basic principles of machine learning.

Here, this is achieved by training each classifier with the same data, except that the errors made in each case are weighted more heavily in future. In the sequential order, each weak learner learns from the inaccuracies of its predecessor, so that in the end the classification, decision or prediction is as accurate as possible. Of particular importance is the iterative repetition of this process until the desired prediction accuracy is achieved.

How AdaBoost works

As usual with machine learning, the adaptation of (neural) weights plays a key role in modeling a learning process. The concatenation of different models allows AdaBoost to extend this principle by giving the individual classifiers different relevance for the overall prediction. This is known as soft voting. In order to achieve the optimum weighting, a few individual steps must first be carried out. A prerequisite is the preparation of a suitable data set.

  1. Set up basic algorithm

    Initially, a suitable weak learner is required that can provide simple predictions about data points. Decision stumps - single-level decision trees - are particularly common. They break down the data set into individual decision nodes. The comparison of the Gini coefficients in application to a single (binary) Feature helps to identify the stump with the best performance. This now forms the basic algorithm for the ensemble model, which is followed by others.

  2. Error calculation

    Weightings are then introduced for each instance of the data set so that samples can be taken and the correctness of the classified target variable can be determined in each case. The incorrect data points determined in this way can therefore be given a higher weighting for the subsequent process.

  3. Weighting of the classifier

    Based on the error rate, AdaBoost can calculate the new significance of individual classifiers for the ensemble model. The higher the accuracy or the fewer the errors, the higher the "say" in the overall prediction. This also forms an important mathematical basis for the following step.

  4. Update sampling weights

    The originally balanced sample weights now also undergo an adjustment based on the incorrect data points of the first classifier. These are given greater consideration in the subsequent procedure, while correct predictions are given a lower weighting and thus also intensify the training effect and increasingly minimize the error function.

  5. Iteration

    Using a classifier optimized by this error minimization and/or a data set revised using the sample weights, the steps can be repeated several times. This results in increasingly effective models of increased accuracy, which are finally aggregated into an ensemble model, taking into account their weights or the right of co-determination. AdaBoost can now apply this to new types of data and thus classify previously unknown data points.

adaboost schema for weak classifiers
This diagram briefly explains how adaptive boosting works. Source: Wikipedia

Example according to scheme

Problem Statement: In this example, the AI models are represented as an ensemble of models {h1, h2, h3, h4} and their predictions as a set of rectangles. The size of the rectangles (predictions) is proportional to their corresponding weights (i.e. their importance).

This is how AdaBoost works in this example: h1 makes a series of predictions. The errors of h1 (by errors we mean the predictions that deviate from the ground truth) are given greater weight. h2 must then concentrate more on correctly predicting the errors of h1. (and so on for h3, h4...)

In the end, there are 4 AI models that correct each other's errors. The final ensemble classifier is called "h".

Mathematical background

The mathematical difficulty behind AdaBoost lies primarily in the adjustment of the weights, which are initially set equal to the number N for all data points.

original sample weight

The value corresponds to the probability that this data point will be drawn for a sample. This is what creates the training set used for each iteration.

An exponential loss function, which produces a convex graph and grows exponentially for negative values, is used to calculate the error.

error function
y: Values of the feature
e: Euler's number
Cm(xi): Prediction

It models the difference between predicted and desired target values, which thus reflects the inaccuracy of the model. The aim of both machine learning and Boosting is always to minimize this function. Here, the determined error, which is to be understood as the sum of the sample weights of all incorrectly predicted data points, also helps with the prioritization of individual decision trees - i.e. their say in the overall model. This assumes values between 0 and 1.

amount of say of the single classifiers
The share of a weak learner in the overall prediction according to its susceptibility to errors

The reweighting of the samples is also based on detected (in)accuracies, which, however, result from incorrect predictions alone instead of complex calculations. For each incorrectly classified sample, AdaBoost multiplies its weight by the Euler number, which is raised to the power of the classifier: New Sample Weight = Old Sample Weight x e^(say). For each correct sample, the Negative say used.

Example: The say of a decision stump is 0.96 at the beginning with a total of 8 samples, which is therefore 1/N=⅛=0,125 be weighted. A sample was wrong Classified. Whose New weight is therefore 0.125 x e^(0.96) ≈ 0,33. That is higher than before.

For a correctly classified sample, on the other hand, the negative co-determination right results in a New weight from about 0,05. This means that AdaBoost learns from mistakes made, while correct forecasts are increasingly neglected.

AdaBoost variants

There are subtle differences between different variants of AdaBoost due to the individual mathematical approach.

  • Real AdaBoost is characterized by the use of decision trees with class probability as output and the application of the weighted least squares error.
  • Gentle AdaBoost is a further development that uses a limited step size to regulate the oscillations of the algorithm.
  • AdaBoostRegressor, also AdaBoost.R2, applies several regressions to the original data set, which are adjusted after each error calculation.
  • Logitboost differs in that the logistic loss is minimized instead of the exponential loss.

Challenges and solutions

Despite its high accuracy, AdaBoost is also not an ideal model. Due to the intensive training on erroneous data points and the therefore confident handling of underrepresented classes, it is well suited for distorted data. However, this ability only extends to a certain degree.

Low data quality

If there is a strong imbalance in the sample distribution, even the actually robust adaptive boosting can tend towards overfitting. From a mathematical point of view, this is due to the loss function used, which is susceptible to outliers due to its exponential sensitivity to negative values. Noisy data" can also become a problem if meaningless additional information influences the forecasts.

Solution approach: Feature Engineering

Many weaknesses in the data used can be eliminated even before AdaBoost is executed. Feature engineering is a generic term for such optimizations and includes various techniques aimed at structuring classes and features as sensibly as possible. This increases the subsequent performance of the machine learning model, but requires a high level of expertise.

Weak learners are too weak

Errors can also occur if classifiers are selected that perform too poorly, for example if the Gini coefficient is neglected. The use of to similar weak learners can reduce the accuracy of the ensemble. This ultimately depends on the diversified decisions of individual participants.

Solution approach: Pruning

This can be thought of as a hedge trimmer for decision trees. This technique allows the reduction of weak parts or even entire trees if their performance falls below a critical threshold value. In some cases, this initially reduces the accuracy of the training data; however, the same also applies to possible overfitting, so that the subsequent prediction of unknown data points increases in quality.

Application and development

In principle, Adaptive Boosting can be applied as an ensemble method to various machine learning models such as Naive Bayes: apply. Decision trees are the most common choice because even for complex applications, a weak learner error probability of just under 50% is sufficient. Automated classifications play a role in many analysis processes, but also in end applications.

Face recognition

AdaBoost was first used for facial recognition back in 2001, just a few years after its creation. However, this only concerns the binary decision as to whether a face is recognizable or not. Significantly more complex technologies are required to identify people, but AdaBoost could at least present the recognized faces to them. What is important here is an additional Test phasewhich can lead to further iterations up to the very high accuracy required.

face detection
The identification of individual faces is now Computer Vision and others.

Bioinformatics

Adaptive Boosting also brings its pronounced strengths in accurate classification to bear in interdisciplinary areas, even with a distorted data basis: determining the subcellular location of proteins, for example, is a task that requires such a high degree of accuracy. To this end, the researchers Fan and Wang (2011) extended the algorithm to include a multi-class feature and combined it with the pseudoamino acid composition, which had previously been the standard calculation method. The result: a significant increase in prediction accuracy.

Boosting and deep learning

The latest AI developments are mainly focused on deep learning, a sub-area of machine learning defined by the use of multi-layered neural networks. Of course, attempts are also being made to further improve existing techniques. This also applies to adaptive boosting. In the meantime, instead of decision trees, the application to Convolutional Neural Networks possible. This allows scaling to very large data sets, so that the advantages of both approaches for big data and Enterprise AI become usable. This is because the economy now always reflects the latest state of AI development, which is visible even in singular business processes:

The ensemble for intelligent document management

The aggregation of a powerful ensemble does not only work for simple boosting models. It can be extended to high-end technologies based on it, each of which can benefit from AdaBoost. The proof: Konfuzio. This software for intelligent document management comprises AI technologies with very different technical approaches. It is only through their combined application that a holistic Document Understanding down to the last detail:

  • Computer Vision: This AI technology uses machine and deep learning to enable the automated analysis of visual content. This applies to image and layout information in documents. AdaBoost can help with the Image classification help
  • Optical Character Recognition: As an optical character recognition system, it is responsible for recognizing text. The fact that such text exists at all is a prediction that can be optimized by AdaBoost.
  • Natural Language Processing: The extracted text should ideally also be understood by a machine in order to make further decisions. This is best achieved using NLP based on neural networks. These can now also be linked sequentially using AdaBoost in order to increase accuracy.

Conclusion

Once again, it becomes clear how artificial intelligence makes use of fundamental principles of human reality: four eyes see more than two. Five classifiers make a better decision than just one. In this way, ensemble methods such as adaptive boosting have managed to remain relevant for decades. Particularly valuable is the ability to learn sequentially from detected inaccuracies and to minimize these errors iteratively. In this way, AdaBoost turns weak learners such as decision trees into powerful models that are even helpful in face recognition or bioinformatics.

The end of further developments is not yet in sight. Progress in deep learning is too rapid, making adaptive boosting applicable to large amounts of data in the economy. In addition, the provision and quality of this data play a significant role in model accuracy. Feature engineering and data science are therefore also highly relevant influences, which in turn are intertwined with developments in document processing. Due to this multifactoriality, it is worth taking a look at adaptive boosting in the future and familiarizing yourself with the latest possibilities. 

Would you like to find out more about the potential of aggregating classifiers and other AI techniques? Please feel free to send us a message.








    adaptive boosting algorithm for weak learners
    lets work together
    en_USEN