Hyperparameter Tuning for Machine Learning Model Optimization

Hyperparameter tuning - a question of settings

Tim Filzinger

Even though machine learning enables automated decisions, there are certain properties of the models that they cannot optimize independently. It is precisely these so-called Hyperparameters that have a considerable influence on subsequent performance. Developers therefore invest a lot of time and energy in determining the ideal settings right from the start. This makes hyperparameter tuning one of the most important processes in the preparation of AI projects. However, this only applies if the data is suitable too.

In this respect, there are significantly more ways to optimize model accuracy and quality. This concerns, for example Feature Engineering or Data Cleaning. The constant supply of high-quality data through human corrections (HITL) is also a frequently used concept. In comparison, hyperparameter tuning is characterized by its one-off implementation. The decisions made in the process therefore have a special significance.

Definition: Hyperparameter tuning describes the search for the optimal adaptation of a machine learning model before the training.

What hyperparameters are there?

The more complex a machine, the more adjusting screws influence the way it works. With Machine Learning is no different. Particular attention is paid to those factors that leave little room for later readjustment. It is not for nothing that these parameters bear the Greek prefix "hyper" - their importance is about all others. The only exception is the central selection of the model type, which must match the planned project and determines which properties can be influenced at all.

The following hyperparameters can play an important role for almost any machine learning model:

Learning rate

A central concept of machine learning is the iterative repetition of training predictions that result in a (neuronal) adaptation of the model. Typically, a comparison is made with a defined target value that the prediction should approach. This ultimately leads to the minimization of a loss function. The learning rate specifies the step size of these optimizations and thus influences the speed and effectiveness of the training.

Batch Size

This hyperparameter describes the number of samples that are run through during training. These are predefined subsets, so-called batches, to which the batch gradient descent is assigned. This is the case when a learning algorithm iterates the entire data set. If the batch comprises only a single sample, it is referred to as stochastic gradient descent. Batch sizes in between are called mini-batch gradient descent and often comprise 32, 64 or 128 samples.

Epochs

Regardless of the batch size, the frequency with which the entire data set is presented to the machine learning model is important. Here, too, a careful balance must be struck between adaptation and generalizability when tuning. An epoch number of several hundred to a thousand is within the normal range. Values that are too high only increase performance during training at the expense of possible overfitting. Line diagrams or learning curves, which visualize the time and adaptation of the model, help with the weighing up.

Machine learning curve of a neural network model
General learning curve of a neural network. Source: Learning Curves in Machine Learning

The most powerful models currently available are based on deep learning and neural networks. In addition to the hyperparameters mentioned above, other selected hyperparameters are also relevant here:

Number of layers and neurons

The way neural networks function is heavily dependent on their structural design - this is known as architecture. Even if they are actually just complex non-linear functions, they can be represented as a spatial network. This is based on the composition of neuron layers, which enable more complex calculations with increasing numbers and denser networking. In the area of the natural language processing this leads, for example, to the acquisition of longer contexts within natural language.

Regularization

This refers to a series of procedures that also influence the complexity of neural networks. However, the aim is to temporarily reduce their complexity in order to avoid overfitting to the training data. This is achieved, for example, by L1 or L2 regularizationwhich reduces the weights of the neurons by adjusting the loss function. In the Dropout on the other hand, complexity is reduced by randomly excluding individual neurons. Although regularization is of particular importance for neural networks, it can also be applied to other models under certain circumstances.

In addition to these more general categories, very specific hyperparameters can also be assigned to individual algorithms:

Which techniques are used?

There are basically two different procedures that can be used for hyperparameter tuning: With Manual tuning various experiments are carried out taking into account different hyperparameter settings. A comparison of the respective results and performance reports ultimately leads to the selection of settings. A typical example is the manual search, in which data scientists select and adjust values intuitively or based on experience.

The Automated tuning on the other hand, is characterized by the use of various algorithms that are designed to calculate an optimal combination of hyperparameters. However, at least the preselection is also done manually here. The degree of controllability is somewhat lower due to the algorithms, but this also applies to the time and effort required. The following techniques are particularly suitable for this:

Random Search

The The name says it all. A random selection of values is made from a predefined statistical distribution for the individual hyperparameters. Configurations derived from this are used to train the model, which also undergoes an assessment using various evaluation metrics. This forms the basis for continuous adjustments. Due to the random selection, the process is less computationally intensive than other methods. The results are nevertheless impressive.

Grid Search

To try out as many combinations of different parameter values as possible, it is a good idea to display them in a grid. This grid is systematically searched until settings of the desired quality are identified. As it is possible to process a predefined grid up to a completely exhaustive grid, particularly powerful settings can be generated. However, this is also associated with a corresponding computing intensity.

Bayesian Optimization

This technique is based on Bayes' theorem, which is also used in Naive-Bayes-classifiers is used. The basic assumption is initially a random function that should continuously approach the optimum, i.e. the ideal hyperparameter setting. An acquisition function, which weighs up between exploration and utilization of the search space, helps to select suitable configurations. Iterative evaluations of the functional performance produce data that is used to adjust the probability model.

Evolutionary optimization

So-called evolutionary algorithms are inspired by Darwinian principles and are therefore particularly suitable for optimization problems. When applied to hyperparameter tuning, the programs form populations of possible settings. Through mutation, combination and selection, a gradually improved hyperparameter set is then created. Configurations that do not correspond to the previously defined Fitness are successively sorted out.

Evaluate performance with cross validation

In hyperparameter tuning, the most complex issue is deciding on a specific model adaptation. In addition to the techniques described, part of the process is therefore also the comparison of different configurations in order to evaluate the model performance for unseen data. Overfitting would have a negative impact on subsequent accuracy and flexibility. To avoid this, the cross-validation resampling technique is often used. The data set used is transformed in such a way that it comes close to new types of data. 

Typically, this is done by splitting into k (number) different subsets (k-folds). The model is trained on each fold in turn and compared with a separate set for validation. Ultimately, iterative average values of metrics such as the F1 score. The process can be repeated for any number of hyperparameter settings. Only when the data scientists and machine learning experts are satisfied with the results is the model ready for the actual training phase with a larger data set.

fold values in cross validation
Distribution of training and validation sets. k=5 or 10 folds are considered common. Source: Cross-validation

Challenges

Hyperparameter tuning is an extremely complex process that requires specialized algorithms and a high degree of expertise in their selection and application. Even experienced data scientists regularly encounter significant challenges. This includes the trade-off between adaptation and generalizability, which is once again the typical core problem of machine learning.

challenges of hyperparameter tuning before training

Overfitting and overengineering

With all the countless customization options, it can be easy to overdo it with tuning. This inhibits generalizability in two ways: On the one hand, too many iterations can lead to over-adaptation to a small data set or even fold. On the other hand, over-differentiated settings also lead to reduced flexibility in the subsequent application. By definition, it is not possible for the model itself to independently overcome obstructive hyperparameters.

Search space and resources

For many of the techniques used, the calculations require a great deal of computing power. This increases linearly with the size of the search space analyzed, which in turn determines the quality of the results. This is the reason why grid search, for example, with its extensive grid display, is a highly effective but also expensive technique. Manual or random search, on the other hand, have pretty much the best "price-performance ratio".

Dependencies

Many hyperparameters cannot be viewed and optimized in a completely differentiated manner. Instead, there is often mutual influence when an attempt is made to adjust a single parameter. These dependencies have become more complex, especially with regard to neural networks. For example, the neurons and layers are influenced by regularization, especially dropout, as the network is spatially reduced. This once again highlights the need for a high level of expertise, experience and intuition.

What are the benefits of hyperparameter tuning?

The choice and tuning of suitable hyperparameters undoubtedly has a significant influence on the expected model performance. This can be seen, for example, in a Study by Saudi Arabian researchers Hoque and Aljamaan (2021): With the help of a Wilcoxon tests they compared the predictive accuracy of machine learning models with regard to share prices - partly with and partly without tuning. The result: significantly more accurate forecasts after prior tuning of the hyperparameters. But: The basis was a high-quality data set that was compared with the Sliding Window Technique was elaborately adapted.

Another Study (Weegar et al. 2016) emphasizes the importance of this process. Even the simplest modification of features in the data set led to the outperformance of supposedly better models. Ultimately, even the most elaborate hyperparameter tuning does not go beyond the importance of sensibly structured and suitable data. What counts is the informative content and the correlations that a machine learning model should recognize during training. The final benefit only arises in combination.

Application of optimized models

As hyperparameter tuning is a fundamental concept of machine learning, there is no single use case that should be emphasized. Each individual AI application requires the implementation of finely tuned and functional models. On the Konfuzio Marketplace there are a large number of them - already tuned, trained and ready for immediate use. The industry-specific applications range from Medical NER about Real estate exposés up to Securities settlements.

Our experts have taken care of the appropriate hyperparameters.

Register now and test for free

Conclusion

Making the right settings is the indispensable basis of every technology project. When it comes to machine learning, this process is known as hyperparameter tuning. The focus here is on those properties that can no longer be changed in the course of a learning process. This concerns, for example, the learning rate, the number of neurons or the batch size. In addition to manual adjustment, typical techniques include grid search or Bayesian optimization.

Regardless of the specific method used, a high level of experience and expertise on the part of the data scientists is always a prerequisite. This also has an impact on the quality of the database used in training, which strongly influences the achievable performance. The use of customized AI platforms such as Konfuzio is particularly useful when these resources are not available.

Would you like to further optimize your own AI models? Feel free to send us a message. Our experts look forward to hearing from you.

    About me

    More Articles

    Low Code Tools: How Companies find the right Provider

    60 percent of all apps are developed outside of IT departments. And: By 2025, 70 percent of all apps will be developed via no-code...

    Read article
    DocuWare Alternative

    All about DocuWare alternatives and Konfuzio as a smart add-on

    Are you looking for a DocuWare supplement or suitable alternatives? Then this is the right article for you. A smooth document management...

    Read article

    Character AI Guide - Functions, Rules and Application

    From ChatGPT to Pygmalion AI and Chai: users can now choose from a wide range of powerful chatbots on the market....

    Read article
    Arrow-up