Home About us

Xiaomi's new framework for improving the efficiency of large models: training is up to 34% faster, and inference is up to 52%! Produced in collaboration with the father of Kaldi

Qubits 2024/06/24 16:32
Xiaomi AI Lab Contribution
Qubits | Official account QbitAI

The inference speed of large models is increased by more than 50%, and the learning performance of fewer samples can be guaranteed!

Xiaomi's large model team proposed SUBLLM (Subsampling-Upsampling-Bypass Large Language Model), and Daniel Povey, an international AI voice expert and the father of the open-source speech recognition tool Kaldi, also participated in the guidance.

Compared to models such as Llama, SUBLLM has significant improvements in training and inference speeds, as well as reduced memory.

IN LARGE MODEL TRAINING, SUBLLM IS 26% FASTER, AND MEMORY PER GPU IS REDUCED BY 10GB. In inference, it's 37% faster and has 1GB less memory per GPU.

Training and inference speeds can be increased by up to 34% and 52%, respectively.

qubits, Xiaomi's new framework for improving the efficiency of large models: training is up to 34% faster, and inference is up to 52%! Produced in collaboration with the father of Kaldi

SUBLLM makes the model more efficient in training and inference by intelligently selecting and processing data: the subsampling module rejects unnecessary information, the upsampling module restores the integrity of the data, and bypassing the module speeds up the learning process.

qubits, Xiaomi's new framework for improving the efficiency of large models: training is up to 34% faster, and inference is up to 52%! Produced in collaboration with the father of Kaldi

Pick the most crucial 500 words out of 10,000 words

Currently, large models in the cloud typically require up to eight GPUs to process very long text tasks, which is not only time-consuming, but also expensive. If the large model is compared to the human brain, then the operating power of the current large model is more than 100 times that of the human brain.

Previously, Daniel Povey proposed Zipformer in the field of speech recognition, which can use a minimum frame rate of 16 times to achieve the same or even higher speech recognition rate as larger models, completing the "four or two thousand pounds" in the field of speech recognition.

The large model team of Xiaomi Group tried to extend this idea to large language models, and achieved more efficient large model computing without compromising performance.

In general, the working principle of SUBLLM is to dynamically allocate computing resources by introducing subsampling, upsampling, and bypass modules, thereby reducing the redundant token computing burden and accelerating the model training and inference process.

It is like picking the most critical 500 words out of 10,000 words, keeping the necessary parts of the text and cutting out the redundancy, so that the text that the large model needs to process is shorter.

qubits, Xiaomi's new framework for improving the efficiency of large models: training is up to 34% faster, and inference is up to 52%! Produced in collaboration with the father of Kaldi

As far as the implementation path is concerned, the subsampling modules will filter them based on the importance score of the token, keeping the important tokens and discarding the unimportant ones.

Subsequently, the upsampling module restores the subsampled sequences to their original lengths to ensure the sequential consistency of the language model when generating tokens.

At the same time, the bypass module further improves the convergence speed of the model by combining the sequences before and after subsampling. This design not only significantly reduces the computational cost, but also maintains the semantic integrity of the input sequence.

If SUBLLM is understood as a clever editor, just as our brain can recognize the main points, it can quickly identify which words are critical and which are not so important when reading a large paragraph of text. SUBLLM RETAINS THE IMPORTANT WORDS AND IGNORES THE LESS IMPORTANT ONES, WHICH GREATLY REDUCES THE AMOUNT OF INFORMATION THAT NEEDS TO BE PROCESSED.

SUB-LLM CAN THEN RESTORE THE CONDENSED INFORMATION TO ITS ORIGINAL INTEGRITY, ENSURING THE COHERENCE AND COMPLETENESS OF THE ENTIRE TEXT. SUBLLM also finds the best way to express information more quickly when processing information.

Next, let's take a closer look at the model structure of SUBLLM.

What does SUBLLM look like?

Not long ago, Google Deepmind proposed a mixture of depths (MoD) model structure, which uses a static computing budget, uses the router selection token of each block for computation, and optimizes FLOP usage by selecting self-attention and MLP blocks or residual connections.

Earlier, the classic paper CoLT5 used conditional routing to decide whether a given token is passed through the feedforward and attention layers through a light or heavy branch in order to allocate more resources to the important tokens.

Similar to these models, SUBLLM uses principles that are close to the human brain's information processing mechanisms.

The human brain has two modes of thinking, a low-power fast mode and a high-power slow mode, with a clear division of labor, and the two modes use the same brain region.

Therefore, the authors of SUBLLM also think about how to reasonably distribute the computing power of large models from the perspective of this information processing mode: important tokens use all computing power, and relatively unimportant tokens use less computing power.

Specifically, the model structure of SUBLLM is based on the large language model architecture of decoder-only, and the structure is upgraded on some special layers without changing the original model structure.

qubits, Xiaomi's new framework for improving the efficiency of large models: training is up to 34% faster, and inference is up to 52%! Produced in collaboration with the father of Kaldi

To manage the number of tokens to be processed, subsampling and upsampling modules are integrated between Transformer blocks.

First, the model processes the complete sequence using several Transformer blocks, capturing a comprehensive representation of the token sequence.

With the introduction of subsampling modules, these modules temporarily remove uncritical tokens, reducing the length of the sequence required for processing.

The reduced sequence is then subsampled more times, i.e., the reduction of the sequence is nested. The highest level of sequence compression occurs in the most middle Transformer block of the network.

Subsequently, the sequence length is gradually recovered using the upsampling module. These modules merge the shorter processing sequences with the original sequences before subsampling, returning them to full length.

This mechanism allows only the decoder model to operate as a language model, generating tokens sequentially, guaranteeing that the input and output sequences are of the same length.

In addition, a bypass connection module is integrated after the upsampling process to help improve the learning process from subsampling to upsampling by taking advantage of the embedding before each subsampling.

Subsequent experiments confirm that this method significantly improves the convergence efficiency.

Compared to the LLaMA model, SUBLLM achieves a 26% and 37% speed improvement in training and inference, respectively, while significantly reducing memory costs while maintaining performance.

Detailed analysis of computational efficiency in the pre-training stage and inference stage:

qubits, Xiaomi's new framework for improving the efficiency of large models: training is up to 34% faster, and inference is up to 52%! Produced in collaboration with the father of Kaldi


— END —

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com