Naive Bayes: Custom made probabilistic classification

Tim Filzinger

Naive Bayes allows AI models to be trained using data. Corresponding algorithms assign objects to different classes by paying attention to certain features. To determine which class is most likely to be true, a basic probabilistic principle - Bayes' theorem - helps. By providing correct assignments in the form of training data, high prediction accuracy can be achieved in this process. This makes Naive Bayes an easy-to-use and popular machine learning technique.

What is Naive Bayes?

Naive Bayes is a probabilistic classification method that uses the Bayes theorem determines the most probable membership of objects to a known class on the basis of various properties. This principle can be applied to AI Models in the form of Naive Bayes classifiers, which algorithmically distinguish text documents based on the words they contain, for example. The properties or features that give the algorithm information about the membership of a class are called features. These variables can be continuous, discrete, categorical, or binary, depending on the nature of the input data. A deep dive on this topic with a more detailed definition can be found at here.

"Naive" is the process because it attributes statistical independence to the features. They are also all supposed to contribute equally to the final classification. The Bayes theorem, as the underlying theorem is also known, was established by the mathematician Thomas Bayes in the 18th century. It describes a formula for calculating conditional probability. That is, it determines how likely an event B will occur if event A is already history. In mathematical terms, it looks like this:

bayes theorem

Here P(A|B) is the conditional probability, P(A) is the probability that A occurs, and P(B) is the probability that B occurs. In a way, this simple basic principle allows a logical inversion of conclusions, also called backward induction.

Example for the Bayes Theorem

Someone has received a positive rapid test result for coronavirus. Now you want to know how likely it is that the person actually has the disease. P(A) is the probability of a positive test and P(B) is the probability of actual disease. P(A|B) is initially unknown, but P(B|A), namely how likely ill people are to also get a positive test, is easily determinable with existing data - as are P(A) and P(B). Finally, by simply substituting in the theorem, one obtains the conditional probability of disease present in this example. For a single feature, the principle is thus quickly explained. For a larger number of features and classes, it quickly becomes more complicated, which is why one likes to let algorithms do the work.

corona test

What does Naive Bayes do for machine learning?

For AI models to deliver reliable results, basic statistical principles are often applied based on large amounts of training data. In addition to regression or clustering, this also applies to Naive Bayes. Corresponding algorithms are called Naive Bayes classifiers and are often the first choice when it comes to automated classification of objects and especially text. They are very versatile: be it for binary categories as in the example above up to the Text classification, where the occurrence of each word is a single feature. In principle, Naive Bayes is scalable. Objects with any number of features can be divided into any number of classes. For each combination of feature and class, the algorithm calculates the conditional probability P(A|B) using Bayes' theorem and multiplies the results of all features for each object. Finally, the algorithm chooses the category with the highest resulting product.

Since the decisions are "only" probabilistic predictions, the classifier requires Training datain particular to be able to estimate the reversal probability P(B|A) as accurately as possible. The data includes correct assignments of objects to the corresponding classes. Logically, the more features and categories the algorithm has to consider, the more of them it needs. In addition to accuracy, precision and recall are also important metrics. The latter describes the proportion of correctly positively classified values. The F1 score, on the other hand, reveals something about the quality of these two quality criteria, but does not provide any information about where optimization is needed.


Naive Bayes classifiers are particularly impressive due to their simplicity. They can be trained and used quickly, but can still be applied to complex cases. At the same time, they deliver comparatively accurate results. This is especially true if the basic assumption of independence of individual features is really given. In this case, Naive Bayes even beats competitors such as logistic regression, whose parameters would also have to be determined by optimization.


The stated independence is often not given for each feature in practice, which weakens the Naive Bayes approach in some cases. In addition, it needs a relatively large amount of training data, which must sufficiently cover each class. For highly complex applications, Naive Bayes often loses out to neural networks, but can at least serve as a simple baseline model.

3 Types of classifiers

Depending on the number or characteristics of the features and classes, different variants of classifiers are used, which differ primarily in their mathematical approach. Particularly popular are:

Multinomial Naive Bayes

This variant is especially suitable for integer input data and assumes a binomial distribution for all variables. This describes the total number of positive results of repeated Bernoulli experiments. For large numbers, it approximates the Gaussian distribution, for which a separate type of classifier can be used. The multinomial expression is often used for document and text classification, where it counts the frequency of individual words.

coin bernoulli
The most famous Bernoulli experiment is the coin toss

Bernoulli Naive Bayes very similar to the type before, but differs in the representation of the input data, which here are understood as a binary distribution. The variant is also often used to classify text, but accordingly distinguishes only between occurrence or absence of words. The latter is kept as a separate feature - quite in contrast to Multinomial Bayes, where the frequency number zero can cause problems regarding zero probabilities.

Gaussian Naive Bayes

Here we have the variant already mentioned above, which can be used for large numbers as well as for decimal numbers. The main thing is that the input variables follow a normal distribution and can therefore be described with the Gaussian or bell curve. In practice, this is often true for a large number of cases.

Gaussian normal distribution

Optimization techniques for Naive Bayes

No model is perfect and so Naive Bayes also has its weaknesses, as it is often not optimally tailored to planned use cases even in its different variants. Therefore, to solve problems that arise and make the algorithm either more specific or more versatile for machine learning, various optimization and combination techniques are applied. Here are three important ones:


...solves the already mentioned problem of zero probabilities, which often occurs with categorical variants of Naive Bayes. Thus, when calculating proportions, a small summand can be added to the numerator and denominator to achieve smoothing. This helps the algorithm learn to deal with previously "unseen" classes. If the summand is +1, one speaks of Laplace smoothing - if it is smaller, of Lidstone smoothing.

Feature Engineering

...does not optimize the algorithm itself, but leads to a significantly higher quality of the input features, on which Naive Bayes strongly depends. For this, features are converted, extracted, scaled and thus made "palatable" to the classifier. This ultimately leads to improved accuracy and minimizes errors.

Ensemble methods

Naive Bayes can be combined with other classifiers to optimize overall performance. In so-called stacking, methods such as logistic regression are added. However, training and classification are each performed separately by the respective techniques until the best results are selected and the classifications made are combined. Training different Naive Bayes classifiers with various randomly selected subsets of the training data will Bagging and leads to lower bias.

Practical application possibilities

In keeping with their versatility, Naive Bayes classifiers are a popular choice from the AI bag of tricks. After all, classification into different classes is also an important necessity for a variety of processes - but certainly plays the biggest role for classifying a wide variety of types of text. Here are two specific use cases:

Spam filter

Probably the best known case of text classification. Spam mails can easily be identified by the frequent occurrence of certain words like "win", "offer", or "free", but also by certain spellings or links. A Naive Bayes classifier only needs training data containing both spam and legitimate mails. It can then calculate the conditional probability for the presence of a spam mail using the frequency of named features. A similar approach is also used in almost every other form of text classification, as well as in natural language processing (NLP) application.

The term "spam" originally comes from "spiced ham" and only got its meaning for mass repetition through a sketch by Monty Python

Document management

With a combination of some of the most sophisticated AI technologies, Konfuzio ensures holistic and fully automated document management. Of course, this would hardly be possible without precise classifications, for which Naive Bayes can also be used.


Naive Bayes is an easy-to-use and popular machine learning technique that uses probabilistic classification methods to assign objects to different classes based on their features. Although there are some drawbacks, such as the assumption of feature independence, Naive Bayes still provides high prediction accuracy and is versatile. There are three main types of classifiers: multinomial, Bernoulli and Gaussian Naive Bayes, which can be used depending on the application. Optimization techniques such as smoothing, feature engineering, and ensemble methods can further improve the performance of Naive Bayes. Practical use cases range from spam filtering to document management, and Naive Bayes is often used in combination with other AI technologies.

    Is your company looking for new AI talent?

    First-class AI talent for your company

    Specialized mediation, maximum success without effort: Our partner Opushero helps you find the best talent. A network of specialized consulting agencies that mentor both aspiring youngsters and experienced AI developers. Receive pre-qualified candidate suggestions who want to get started with you.

    About me

    More Articles

    AI Data Extraction

    AI Data Extraction from PDF and other types of documents

    Modern companies have to process vast quantities of invoices, letters and other documents. The problem: senders do not comply in any way with...

    Read article

    OCR: How software providers make legal documents analyzable

    The daily routine of tax consultants, lawyers and notaries is dominated by (digital) paperwork. Above all, they have to collect data and...

    Read article

    Smart Data: How companies can make better decisions

    Rapidly increasing data volumes are a challenge of the digital age for companies: How can you efficiently identify and evaluate relevant information?...

    Read article