Downsize your AI model while maintaining performance

in

on

Update on

Introduction

The increasing demand for artificial intelligence (AI) requires smaller, more efficient models for device-limited resources. These models must achieve comparable test results despite downsizing to ensure accuracy and reliability. In this paper, we consider four machine-learning optimization techniques that enable compact AI models without sacrificing performance: Model Distillation, Model Pruning, Model Quantization, and Dataset Distillation.


Model distillation

Definition: What is knowledge distillation?

Knowledge distillation is the process of transferring knowledge from a large model to a smaller model. In machine learning, large models have a higher knowledge capacity than small models, but this capacity may not be fully utilized. In knowledge distillation, knowledge is transferred from a large model to a smaller model without losing validity.

Process

The process of model distillation involves training a smaller student model to mimic the behavior of a larger teacher model. By using the knowledge that the teacher model possesses, the student model can achieve similar performance even though it is significantly smaller. In this process, the student model is usually trained using a combination of the original training data and the soft labels generated by the teacher model. By transferring the knowledge from the teacher model to the student model, we create a compact model that contains the essential information needed to make accurate predictions.

Knowledge Distillation
The teacher-student framework for knowledge distillation [1].

Pruning model

Definition: What is Model Pruning?

Model pruning is a technique in which unnecessary connections, parameters, or entire layers are removed from a pre-trained neural network. Pruning can be done based on various criteria, such as the size of the weights, sensitivity analysis, or structured sparsity. By eliminating redundant or less important components, we can significantly reduce the size of the model while maintaining its performance. In addition, pruning can also lead to improved inference speed and reduced memory requirements. This technique is an attractive approach for deploying AI models on resource-constrained devices.


Model quantization

Definition: What is model quantization ?

Model quantization involves reducing the precision of numerical values in a neural network. Typically, deep learning models use 32-bit floating point numbers (FP32) to represent weights and activations. However, by quantizing the model to representations with smaller bit widths, we can significantly reduce the model size and memory requirements. e.g. 8-bit integers (INT8).

Explanation

Reducing the number of bits means that the resulting model requires less memory, consumes less energy (theoretically), and operations such as matrix multiplication with integer arithmetic can be performed much faster. It also allows models to run on embedded devices that sometimes only support integer data types.

Although quantization can introduce some quantization errors, modern techniques such as quantization-sensitive training can minimize the loss of accuracy. With proper calibration and optimization, quantized models can achieve similar performance to their full accuracy counterparts while using fewer computational resources. See this article from NVIDIA [2] for more information on quantization-sensitive training.

With 8-bit quantization, each weight and activation value in the model is limited to an 8-bit integer that can represent values from 0 to 255. This means that instead of a wide range of floating point values, we restrict the range to a discrete set of integer values. This reduction in precision allows for efficient storage and computation, since 8-bit integers require fewer bits than 32-bit floating-point numbers.

It should be noted that 8-bit quantization is only one example of quantization. There are other quantization techniques, such as 4-bit quantization, where the precision is further reduced to 4-bit integers. The basic idea remains the same - representing weights and activations with fewer bits to achieve smaller model sizes and lower memory requirements.


Distillation of data sets

Definition: What is distillation of records ?

Dataset distillation is a technique in which a smaller model is trained using a carefully selected subset of the original training data. The goal is to create a distilled dataset that captures the essential patterns and features of the full dataset while significantly reducing its size. This distilled dataset serves as a proxy for the original dataset and allows training of models that achieve comparable performance while requiring less memory.

Dataset Distillation
An overview of the process of distilling data sets [3]

Process

The process of record distillation usually includes the following steps:

  1. Selection of the data set: The first step is to select a representative subset of the original training data. This subset should cover the data distribution and capture the most important patterns and features of the entire data set. To ensure that the distilled data set is diverse and representative, various techniques, such as clustering or stratified sampling, can be used.
  2. Model Training: Once the distilled dataset is created, a smaller model is trained on this subset. The training process involves optimizing the parameters of the model to fit the distilled data set, similar to traditional training on the full data set. However, because the distilled dataset is smaller, the training process is typically faster and requires fewer computational resources.
  3. Performance Evaluation: After the smaller model is trained on the distilled dataset, its performance is evaluated to assess its effectiveness. This evaluation may involve measuring metrics such as accuracy, precision, recognition, or F1 score, depending on the task and application. By comparing the performance of the distilled model to that of the full model, we can determine the extent to which the dataset distillation was successful.

Disadvantages

  1. Information loss: Because distillation of datasets selects a subset of the original training data, there is a possibility of information loss. The distilled dataset may not capture all the nuances and rare cases that are present in the full dataset, which may result in lower model performance in certain scenarios.
  2. Generalization to unseen data: The smaller model trained on the distilled data set may not generalize as well to unseen data as a model trained on the full data set. It is critical to carefully evaluate the performance of the distilled model on both the training and evaluation datasets to ensure that it maintains satisfactory performance across different data distributions.
  3. Bias of the data set: There is a possibility of bias in the selection of the distilled data set. If the distilled data set is not representative of the full data set, the trained model may exhibit biased behavior, affecting its fairness and accuracy. Careful consideration and evaluation of the distilled data set is necessary to mitigate such bias.

Conclusion

Efficiency and compactness are essential aspects when using AI models in resource-constrained environments. By using techniques such as model distillation, model pruning, model quantization, and dataset distillation, we can effectively reduce the size of AI models without sacrificing performance. These techniques provide practical solutions to optimize model size and enable deployment on end-user devices, mobile platforms, and other resource-constrained environments. In AI development, balancing model size and performance becomes critical for widespread adoption in various domains.

More

If you want to read more AI related blogs from Konfuzio:


Literature

[ 1 ] J. Gou, B. Yu, S. J. Maybank, and D. Tao, "Knowledge Distillation: A Survey," International Journal of Computer Vision, accepted for publication, 2021. [arXiv:2006.05525 [cs.LG]]

[ 2 ] N. Zmora, H. Wu, and J. Rodge, "Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT," Jul 20, 2021.

[ 3 ] R. Yu, S. Liu, and X. Wang, "Dataset Distillation: A Comprehensive Review," arXiv preprint arXiv:2301.07014, 2023.


Author of the article

en_USEN