When it comes to complex machine learning models, one can quickly reach the limit of interpretability. In such cases, it is not easy to explain why a particular decision or prediction was made. Logistic regression is a simple alternative that can be used to statistically investigate possible relationships between variables. As a result, it is a useful tool for interpreting AI decisions as well as for making independent predictions.
This text was automatically converted in your speech.
What is logistic regression?
Logistic regression is a statistical analysis technique that deals with the relationship between one or more independent variables and a dependent binary variable. Thus, it is suitable for determining the probabilities of possible outcomes of an event. This can be, for example, a simple yes/no decision or the occurrence or non-occurrence of a certain scenario. Such variables are also called "dichotomous". The independent variables - and thus the influencing variables of the analyzed relationship - are numerically or continuously scaled. Categorical properties must therefore be transformed into binary Dummy variables can be decomposed. A more detailed explanation of the individual feature types can be found in this Deep Dive.
Logistic regression is a special case of regression analysis that also examines other types of variables. In addition, there are ordinal and multinomial extensions that allow for greater flexibility in terms of prediction. Specifically, this allows rankings or multiple categories to be predicted. As a rule, however, logistic regression refers to the investigation of a binary target variable. As a generic term for such procedures are also Logit models. They use a corresponding logit function and make use of the concept of odds. These describe the ratio of the probability that an event will occur to the counter probability. Another basis is the maximum likelihood estimation (MLE), which is used to estimate suitable odds.
Examples of possible examinations
- What is the probability of a particular purchase decision depending on previous purchases?
- Can a discount code positively influence the decision?
- It becomes visible that a company takeover among listed companies is imminent. Will the share price of the buying company rise or fall?
- Is a person with certain characteristics creditworthy or not?
- Will it rain in New York tomorrow?
Even though some of these binary questions could equally be solved by other methods, they provide an insight into the situation of binary probability modeling. The peculiarity of logistic regression is that in addition to the yes or no answer, it also considers the confidence of this answer. Decision examined.
How does logistic regression work?
In principle, logistic regression looks at the effect of the independent variable on the dependent variable by evaluating historical data. Like linear regression, it assumes a linear relationship, but the target value of the dependent variable is transformed into a quantity between 0 and 1. This is done by the logit function used, which thus generates the corresponding probability of the event. A resulting function curve is used to map the existing database, which is used for the predictions.
Graphically, it looks like this:
In the typically s-shaped curve of the logistic regression it becomes visible how values between 0 and 1 are output independently of the input. The great advantage here is the interpretability and the possible estimation of an error probability. For example, if the value of f(x) assumes 0.51, event 1 is more likely, but a slight deviation could change the decision. The result can therefore be treated with appropriate caution. Logistic regression is thus well suited for classifications - unlike, for example, linear regression, which merely interpolates between cases and thus only reveals the final decision.
The role of odds
For the practical use and interpretability of logistic regression, the odds, also called odds ratio, contribute an important part. As a ratio of probabilities, they are also called odds probability and represent the effect size between the variables. When performing a logistic regression with a statistical program such as SPSS, the odds are therefore usually generated as an additional output - in addition to standard error and probability of error.
Odds > 1 mean a higher probability for an event to occur than for it not to occur.
Odds = 1 mean equal chances.
Odds < 1 means the failure is more likely.
This also allows us to model how much a slight increase in the independent variable increases the probability of the event - for example, the increased risk of disease with each additional year of life.
Significance for Machine Learning
Like many other statistical principles, logistic regression lends itself to algorithmic insight. This is referred to as supervised machine learning in the sense of a discriminative model. As a solution approach for classification problems, logistic regression analysis also competes, for example, with the Naive Bayes classifierwhich, however, also works generatively. Compared to Deep Learning, not only is there greater transparency, but more influence on the calculations is also possible. If these are performed highly mechanically, hardly any changes or observations are feasible.
While social scientists tend to look at coefficients with the help of statistical programs to find explanations, in economics they mainly try to predict unknown data points. Although the actual computational work is done by an algorithm, analysts and researchers have to do quite a bit of work beforehand: Historical training data has to be collected and processed through so-called Feature Engineering be put into the form of suitable variables. Once a suitable training set has been created, regression analysis can quickly begin. In addition to stand-alone forecasts, logistic regression can also be applied to complex models as part of interpretation techniques. In both cases, there are corresponding benefits:
Advantages of the process
- Interpretability: In addition to the probability generated by the logit function, coefficients such as the odds represent an effect strength between variables. The calculation of error probabilities is also possible. For complex procedures and AI models with dozens of parameters, however, decision making often resembles a "black box". Logistic regression can shed light on this by showing the exact relationship between individual variables of a construct.
- Simplicity: Due to the speed of data-based knowledge acquisition, logistic regression can serve as a simple baseline model until a correspondingly more complex and more accurate model is set up. In comparison, the computational and data requirements are also lower - but the latter include sufficient representation of the variables. Linear regression is even simpler in this respect, but delivers correspondingly less meaningful values.
- Robustness: Complex models tend to overinterpret a small amount of training data; this is called overfitting. In this case, too little weight is given to new data, although they have a high informative value due to their topicality. Logistic regression, on the other hand, will always make a statistically sound statement based on all available values. This makes it more robust to exogenous changes, but it can also be susceptible to bias if there are too many variables.
- Medicine: Logistic regression is particularly well suited for identifying risk factors for the occurrence of a disease. This can be easily coded as a dichotomous dependent variable. Independent variables can be, for example, diet, lifestyle, age or gender.
- Social studies: To explain societal and social developments, scientists often examine different sociocultural and demographic factors that fit well into a logistic regression due to their characteristics. For example, the effects of social origin on education or occupation can be measured. Various other types of regression analysis are also used.
- Financial sector: In business, countless opportunities arise to generate valuable insights through logistic regression analysis. A good example is the financial industry, where one often has to deal with the estimation of risks. For example, lenders can determine how likely a default is to occur. In addition, certain activities can be classified as suspicious based on various characteristics.
The simplicity, robustness, and high interpretability of logistic regression make it a versatile analysis technique. Using the logit function, it can calculate the probabilities of binary target variables based on historical data. By using different coefficients, this form of regression analysis also allows detailed insights into the effect sizes of relationships under investigation. Thus, as an alternative or baseline model, it can provide more transparency than complex algorithms. In the long run, however, these provide more accurate results due to the consideration of significantly more parameters.
In the complex landscape of machine learning, logistic regression is an essential methodology that brings clarity and precision to the analysis of data. Before you embark on using these or other statistical techniques in your project, we invite you to take advantage of the expertise and experience of our specialists. Our experts will be happy to contribute to your project in order to achieve your goals in the best possible way.