Only 3.8B parameters are activated, and the performance is comparable to the same 7B model! Training fine-tuning can be used, from Microsoft

Qubits
2024/07/18 13:28

| Official account QbitAI

Only 60% of the parameters need to be activated, and the performance comparable to that of a fully activated dense model can be achieved.

A new study by Microsoft Research Asia has achieved full sparse activation of the model, resulting in a significant reduction in the cost of inference.

And it's available in a wide range of applications, whether you're training from scratch, continuing to train, or fine-tuning.

The method, named Q-Sparse, achieves model sparsity at the neuron level, which is finer than other methods, and has better performance and sparsity rate at the same inference cost.

In the name, Q stands for Quantization, which means that in addition to ordinary models, it is also compatible with quantization techniques and is suitable for models of various quantization methods.

The authors further show that if Q-Sparse is combined with model quantization technology, it can also achieve a greater degree of cost reduction and efficiency increase.

In addition, while studying Q-Sparse, the team also conducted an in-depth exploration of the relationship between parameter size, sparsity rate and model performance, and found a "Scaling Law" suitable for model inference optimization.

Some netizens believe that this technology is really good, and it is better than ReLU.

Someone else turned on Make-A-Wish mode, saying that it would be nice if [AMD's] ROCm could support the technology faster than Nvidia.

The core operation of Q-Sparse is to apply the Top-K sparsity function to the input tensor.

Specifically, the Transformer architecture uses the nn.Linear layer (matrix multiplication) for projection in both the attention layer and the feedforward layer, which can be expressed as Y=X·W^T. (where X is the input tensor, W is its weight, and Y is the output tensor).

In Q-Sparse, for an input activation tensor X, its absolute value of |is first calculatedX|, and sort to find the K element with the highest absolute value.

Here K is a pre-set hyperparameter that determines the degree of sparsity.

Q-Sparse then creates a binary mask tensor M that is the same shape as X, for a series of |The position corresponding to the K element with the largest absolute value in X|, the corresponding position in M is set to 1, and the rest of the positions are set to 0.

Then, the input tensor X and the mask tensor M are Hadamard product (multiplication element by element) to obtain a sparsity tensor X_sparse.

In the forward propagation process, the sparsified tensor X_sparse will replace the original input tensor X for subsequent calculations (e.g., matrix multiplication).

Since most of the elements in the X_sparse are already set to zero, the amount of computation and memory bandwidth requirements can be significantly reduced.

In the process of backpropagation, Q-Sparse uses a Straight-Through Estimator (STE) to calculate the gradient of the Top-K function.

In the traditional training method, it is usually necessary to calculate the gradient of the loss function on the network parameters, and use the gradient descent method to update the parameters to minimize the loss.

However, when there are some non-differentiable operations such as quantization and Top-K in the network, the calculation of gradients will encounter problems, because the gradients of the output pairs of these operations to the input are 0 at most points, resulting in the gradient cannot be effectively propagated.

STE avoids the problem of vanishing gradients by directly passing the gradient to the tensor before sparsity.

In general backpropagation, the gradient of the loss function L to x ∂L/∂x=∂L/∂y⋅∂y/∂x, but it cannot be directly calculated due to non-differentiability.

STE's solution is to compute only the gradient of the loss function to the sparsity tensor y, and then copy it directly to the original tensor x, i.e., directly take the ∂L/∂y as an estimate of ∂L/∂x.

△Gradient comparison with/without STE

For the feedforward layer, Q-Sparse uses the squared ReLU function instead of the regular ReLU activation function, and the squared operation can further improve the sparsity of the activation (⊙ represents the Hadamard product).

In addition, in order to adapt to the quantization model, Q-Sparse will quantize the input tensor before applying Top-K sparsity to ensure that the sparsity operation is compatible with the quantization representation, and its functions are expressed as follows:

Where ε is a small constant to avoid a situation where the denominator is zero.

In particular, for the weights of 1-bit quantization, Q-Sparse uses the following quantization function, where α is the average absolute value of the weight tensor W.

Comparative experiments show that Q-Sparse is significantly better than the previous ReLU method in terms of sparsity rate and model performance.

For the specific effect of Q-Sparse, the authors evaluated its performance on three tasks: de novo training, continuous training, and fine-tuning.

The model used in the de novo training experiment was Llama, and the results showed that on the 700M and 7B models, Q-Sparse with 70% top-K (i.e., 40% overall sparsity) could achieve a training loss comparable to that of the dense baseline.

The purpose of the continued training is to sparse the dense model, and the subject here is Mistral-7B.

As a result, when the activation parameters were 2.9B and 3.8B, the scores of the model in the datasets such as ARC and MMLU did not decrease significantly.

In the fine-tuning experiment, for the Qwen-7B and Mistral-7B models, Q-Sparse showed similar results to the continued training, with about 60% of the activation parameters to achieve a performance very close to that of the intensive model.

These results imply that, for the same performance, the sparse activation model can significantly reduce the activation parameters during inference compared to the dense model, which in turn reduces the amount of FLOPS consumed.

For the quantitative model, the team applied Q-Sparse on the self-developed BitNet b1.58 model, and trained and evaluated it on multiple datasets.

It can be seen that at the scale of 700M and 7B, the convergence speed and final loss function values of the quantization model using Q-Sparse are comparable to those of the quantization model without Q-Sparse (BitNet b1.58).

This shows that Q-Sparse can be seamlessly integrated into the quantization model without significantly affecting the training and convergence of the model.

Based on this, the authors believe that the combination of Q-Sparse and quantization technology can further improve the efficiency of large language models in the inference stage.

In addition to evaluating the performance of these models with sparse activation, the authors also explored the relationship between model performance, scale, and sparsity rate, and made some new discoveries.

Performance scaling law for sparse active models: The authors found that, similar to dense models, the performance of sparse active models follows a power-law scaling relationship.

Specifically, given the sparsity rate S, the value of the loss function of the model at convergence L(N,S) can be approximated by the following formula:

where N is the number of model parameters; E is a constant that represents the model's loss at infinity; A(S) is a scaling factor related to the sparsity S.

This scaling law shows that the performance of a sparse active model increases with the size of the model, but at a slower rate.

At the same time, the authors found that the performance of the model is also affected by the sparsity rate.

As mentioned in the section on the relationship between parameter size and performance, A(S) is a scaling factor related to the sparsity S, which can be approximated by the following formula:

where B and C are constants, and β is a parameter that controls the exponential decay rate.

This formula shows that when the sparsity S increases (the model becomes more sparse), it means that a higher sparsity rate leads to a decrease in performance, which is exponential.

Based on the above findings, the authors derive an optimal sparsity S* for inference, which can minimize the value of the model loss function when the budget (floating-point operands at the time of inference) is constant.

For the full-precision (FP32) model, the optimal sparsity rate is about 45.58%. The optimal sparsity rate of the low-precision (e.g., 1.58-bit) model is higher, about 61.25%.

The authors observed that as the model size increased, the performance gap between the sparse activation model and the dense model gradually narrowed.

This can be explained by the scaling law: when the model size N tends to infinity, the value of the loss function of the sparse activation model tends to L(∞,S)=E, while the value of the loss function of the dense model tends to L(∞,0)=E.

This means that at very large scales, the sparse activation model has the potential to achieve performance comparable to that of the dense model, which provides a useful reference for designing and training large-scale sparse activation models.

Address: https://arxiv.org/abs/2407.10969

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com