The parameters are 80% less, and the effect is still more than LoRA! Shanghai Jiao Tong University & Shanghai AI Lab launched FLoRA, an efficient fine-tuning framework

Qubits
2024/07/03 13:12

In order to make large models play a greater role in specific tasks and scenarios, LoRA methods that can balance performance and computing resources are being favored by researchers.

However, there is still a problem with many low-rank fine-tuning methods represented by LoRA (including DoRA, MoRA, AdaLoRA and other derivative methods):

They are generally more suitable for "straight-in, straight-out" low-dimensional tensors such as Linear and Embedding layers, ignoring the consideration of higher-dimensional or even N-dimensional tensors.

Although these methods can fine-tune the parameters by converting the high-dimensional tensor into a 2D tensor in some way, such as LoRA converting the four-dimensional tensor of the Conv2D convolutional layer parameters into a two-dimensional tensor. However, there are two challenges:

Although this method of disassembling the convolutional kernel and reshaping it into dimensions avoids the large-scale increase of parameters, it destroys the structural characteristics of the convolutional kernel itself. This has a negative effect on the local inductive bias required for intensive prediction tasks.

With the increase of tensor dimension, the two-dimensional method of reshape will cause a sharp increase in the number of parameters, which deviates from the original intention of the efficient parameter fine-tuning method.

In order to solve the above two problems, researchers from Shanghai Jiao Tong University and Shanghai AI Lab proposed the FLoRA method (flora means flora, which has a wide range of meanings).

Taking vision tasks as an example, FLoRA can achieve the same results with 80% fewer parameters than LoRA.

The authors believe that the adjustment of the parameters of each dimension should be carried out through a global subspace of the low-rank core space, and the low-rank core space itself retains the topological relationship and interactivity between the different dimensions of the original parameters.

Specifically, by applying Tucker decomposition to realize the construction of the low-rank core space, the author completes the adaptation of the low-rank fine-tuning method for deriving N-dimensional tensors from a unified perspective, so that the low-rank fine-tuning method can be extended to various common layers such as Conv2D layer, Embedding layer, Linear layer, etc. At the same time, the authors found that FLoRA can degenerate into multiple different low-rank fine-tuning methods by adjusting different parameters.

Convolutions have an inductive bias for local learning. If a convolutional layer of ,, the parameter shape should be [10,1,3,3], and the last two dimensions [3,3] form a filter with a square structure.

In the process of splitting in a certain way, there are both permute and reshape operations, and the originally adjacent filters are scattered. This makes it more difficult to learn parameters to model the original local properties.

In a convolutional structure, the parameters of a layer network have four dimensions.

If the parameters are split into the form corresponding to the AB in LoRA in the same way, it should be and .

If the parameters are split into the form corresponding to the AB in LoRA, it should be an AND .

The former parameter quantity is, and the latter parameter quantity is.

At that time, the way to be and, in general, >>, was to introduce a very large number of parameters. Therefore, the switch to the latter is a compromise between structural integrity and parametric quantity.

Tucker decomposition is a matrix factorization method. For tensors with N-dimensions, the Tucker decomposition can be expressed as the product of a Core Tensor and the matrices obtained along each dimension, where Jn is the channel size of the nth dimension. It can be written as:

where is the modulo multiplication, which represents the multiplication of a tensor and a matrix.

In Tucker decomposition, the kernel tensor represents the interaction between different dimensions, while the matrix acts like the principal component of each dimension. In this way, relying on the kernel tensor to learn the relationship between different dimensions, and relying on the matrix of each dimension to learn the intrinsic characteristics of this dimension, the learning process can be better optimized on the basis of retaining the N-dimensional tensor topology.

Based on the above introduction to Tucker decomposition, the author introduces this decomposition method into the efficient fine-tuning of parameters. Specifically, compared to LoRA

among others.

FLoRA unifies the N-dimensional tensor decomposition as:

where is the kernel tensor, s is the adjustable scale coefficient, and is the low-rank matrix of the nth dimension, where Jn is the low-rank r, and Jn <<I n.

Corresponding to the convolutional kernel parameter with 4 dimensions, there is

Among them, , as well.

R3 andR4 generally take the same value that is smaller than the convolution kernel size k. According to the above equation, the authors believe that there is a Convolution Core in the fine-tuning of convolution parameters, and FLoRA is responsible for finding the value of this core and configuring the weight values of different dimensions. Compared with LoRA, FLoRA allows a larger rank r to be set on the same number of parameters, and in the case of the same rank, FLoRA greatly reduces the number of parameters.

For example, if k = 3, r3 = r4 = 2, r1 = r2 = r = 32, din=256, dout=512，

The number of FLoRA parameters is:

The parameters of LoRA are:

If FLoRA reaches the same amount of parameters as LoRA, then r=70.

Corresponding to a linear layer parameter with 2 dimensions, there is

，

Thereinto. Analogous to the 4-dimensional convolutional kernel parameter, where G is the corresponding linear core.

Referring to the above example, in the case of the same r, the amount of FLoRA parameters is only % more than that of LoRA, and the corresponding example is 4.17%.

In practical applications, due to the existence of nuclear tensors, the equivalent r1 and r2 can be smaller than the r of LoRA, so that the effect is consistent with or even better than LoRA when the same scale or even fewer parameter quantities can be achieved.

In LoRA, the value of s is determined by r and another hyperparametric r_alpha, usually fixed at s=2.

In FLoRA, this value is set as a fixed value in the form of a hyperparameter, which does not require the introduction of r_alpha, and essentially replaces the r_alpha with s, so there is no additional number of hyperparameters introduced compared to LoRA.

For the selection of S, the authors found that the values of parameters of different sizes and scales and different types of models (i.e., parameter spaces of different dimensions) were different, but they showed certain characteristics. For the convolutional model, the value of s is as large as possible within a certain range, and it is set to 4 when fine-tuning with ConvNext-L as the backbone. For the linear model, the value of s is as small as possible, and the value of s is set to 0.04 when fine-tuning InternViT-6B and LLaVA-7B.

The authors conducted experiments on visual tasks, language tasks, and multimodal tasks, covering two types of models (Conv and ViT) and four parameter scales (DeBERTav3-base: 184M, ConvNeXt-large: 196M, InternViT-6B, LLava-v1.5-7B), involving 18 datasets.

Experimental results show that FLoRA has achieved significant performance improvement in various visual tasks, and can still achieve the same effect as LoRA even when it has 80% fewer parameters than LoRA. The experimental results show that the way to avoid destroying the topology by introducing the kernel tensor to model the dimensional relationship is conducive to the fine-tuning of multi-dimensional parameters, and good results can be obtained.

The authors have also done some experiments on language tasks, and have achieved significant performance gains at all tunable parameter scales.

In terms of multimodal tasks, the author also made a visual instruct tuning evaluation based on llava-v1.5-7b. It has also shown better results than LoRA.

The authors also fine-tuned the diffusion model and gave a comparison of the generated results.

The authors also explain the differences between FLoRA and LoRA in terms of training time and memory overhead.

Thumbs upcollection

Share to:

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.

If there is any suspicion of infringement, please contact the administrator of the Business News Platform.

Contact: system@shangyexinzhi.com

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com