The speed of mobile phone running large models is 4-5 times! Microsoft's Asian Research Institute open-source new technology, as long as there is a CPU

Qubits
2024/08/09 13:20

Qubits | Official account QbitAI

With a CPU, you can run large models, and the performance even exceeds that of NPU/GPU!

That's right, in order to optimize the deployment of the model on the device side, Microsoft Research Asia proposed a new technology - T-MAC.

This technology focuses on cost performance, which not only allows the end-to-end model to run faster, but also consumes less resources.

How??

Generally speaking, in order to use large language models on mobile phones, PCs, Raspberry Pis, and other devices, we need to solve storage and computing problems.

A common approach is model quantization, which quantizes the parameters of the model to a lower number of bits, such as 4 bits, 3 bits, or even lower, so that the storage space and computing resources required by the model are reduced.

However, this also means that mixed-precision matrix multiplication (mpGEMM) is required to perform inference, i.e., with low-precision weights and high-precision activation vectors.

However, existing systems and hardware do not natively support this mixed-precision matrix multiplication, so they often need to convert low-precision weights back to high-precision, a process called dequantization.

However, this approach is not only inefficient, but also does not lead to performance gains when the number of bits is further reduced.

In this regard, the new technology T-MAC adopts a look-up table-based (LUT)-based computing paradigm, which does not require dequantization and directly supports mixed-precision matrix multiplication.

In this way, T-MAC not only improves the inference performance, but also makes the model more unified and scalable, especially suitable for deployment on resource-constrained devices.

In addition, T-MAC does not rely on dedicated hardware accelerator NPUs or GPUs, enabling the model to be deployed using only the CPU. In some cases, it can even outpace dedicated accelerators for inference.

The key innovation of T-MAC is the adoption of a look-up table-based (LUT)-based computing paradigm instead of the traditional multiply-accumulate (MAC) computing paradigm.

T-MAC leverages lookup tables to directly support low-bit calculations, eliminating the dequantization operations necessary in other systems and significantly reducing the number of multiplication and addition operations.

After experiments, the T-MAC showed excellent performance:

On Surface AI PCs equipped with the latest Qualcomm Snapdragon X Elite chipset, the 3B BitNet-b1.58 model can generate up to 48 tokens per second, the 2-bit 7B llama model can generate up to 30 tokens per second, and the 4-bit 7B llama model can generate up to 20 tokens per second.

This even surpasses the performance of the NPU!

When deploying the llama-2-7B-4bit model, although 10.4 tokens per second can be generated using the NPU, the CPU can reach 12.6 tokens per second with only two cores, and can even soar to 22 tokens per second.

These are far beyond the average reading speed of humans, and are 4~5 times faster than the original llama.cpp framework.

△BitNet on T-MAC (based on LUTs) vs llama.cpp (based on dequantization)

Even on lower-end devices such as the Raspberry Pi 5, the T-MAC achieves a generation rate of 1.58 tokens per second against 3B BitNet-b11.

At the same time, the T-MAC also has significant power consumption advantages:

To achieve the same generation rate, the number of cores required for T-MAC is only 1/4 to 1/6 of the original llama.cpp, reducing energy consumption and leaving computing resources for other applications.

It is worth noting that the computational performance of T-MAC increases linearly with the decrease of the number of bits, which is difficult to observe in GPUs and NPUs based on dequantization.

This further enables T-MAC to achieve 10 tokens per second for a single core and 28 tokens per second for a quad-core at 2 bits, which greatly exceeds the performance of the NPU.

Okay, now that we're out of the way, let's move on to the technical details of T-MAC.

For low bit parameters (weights), T-MAC groups each bit individually (e.g., a set of 4 bits), multiplies these bits with the activation vector, precomputes the sum of all possible parts, and then stores them using LUTs.

After that, the T-MAC employs shift and accumulation operations to support scalable bits from 1 to 4.

In this way, T-MAC abandons the inefficient FMA (multiply-add) instructions on the CPU in favor of the less power-consuming, more efficient TBL/PSHUF (look-up table) instructions.

△Mixed-precision GEMV is based on the existing dequantization paradigm vs T-MAC is based on the new paradigm of lookup tables

Traditional dequantization-based computation is actually a data-type-centric computation, which needs to be customized for each different data type.

Each bit-width combination of activations and weights, such as W4A16 (weights int4 activates float16) and W2A8, requires a specific weighting layout and computational core.

For example, the layout of the W3 requires the 2-bit and the other 1-bit to be packed separately and utilize different interleaved or shuffling methods for memory alignment or fast decoding.

The corresponding compute cores then need to unpack this specific layout into hardware-supported data types for execution.

T-MAC, on the other hand, calculates the multiplication of low-bit matrices by observing the low-bit matrix from the perspective of bits, and only needs to design the optimal data structure for a single bit, and then expands to a higher 2/3/4 bit by stacking.

At the same time, for activation vectors of different precision (float16/float32/int8), only the process of building the table needs to change, and there is no need to consider different data structures when looking up the table.

△Bit as the core lookup table calculates mixed-precision GEMV

At the same time, when the traditional dequantization-based method is reduced from 4-bit to 3/2/1-bit, although the memory usage is less, the computational effort is not reduced, and the performance may be worse because the cost of dequantization does not decrease but increases.

However, the computation amount of T-MAC can be linearly reduced as the number of bits decreases, resulting in better acceleration at lower bits, providing an efficient deployment solution for the latest 2-bit models released by BitNet, EfficientQAT, etc.

For example, the following diagram illustrates:

(1) Using a single core of the CPU of different devices, the mixed-precision GEMV operator of T-MAC at 4 to 1 bits is 3-11 times faster than llama.cpp.

(2) The GEMM time-consuming of T-MAC decreases linearly with the decrease of the number of bits, which cannot be achieved by the llama.cpp based on inverse quantization (the performance of the operator of 1-bit llama.cpp is estimated from its 2-bit implementation).

In summary, bit-based computing has many advantages, but it still has its own challenges when implemented on CPUs:

In contrast to continuous data access for activations and weights, the access to a table is random.

The residence of tables in fast on-chip memory is particularly important for the ultimate inference performance, however, on-chip memory is limited, and the look-up table (LUT) method increases on-chip memory usage compared to traditional mpGEMV.

This is because the lookup table needs to hold the result of multiplying the activation vector with all possible bit patterns, which is much more than the activation itself.

△T-MAC differs from llama.cpp in computing data streams

To this end, researchers at Microsoft Research Asia delved into the table-based computational data flow and designed an efficient data structure and computational flow for this computational paradigm, including:

1. Deposit the LUT into the on-chip memory to improve the performance of random memory access by using the table lookup vector instruction (TBL/PSHUF) on the CPU.

2. Change the order of matrix axis calculation to maximize the data reuse rate of limited LUTs put into on-chip memory.

3. The optimal matrix tiling method is designed separately for the lookup table, and the optimal tiling parameters are searched in combination with autotvm

4. Layout optimization of weights parameters:

a. weights are rearranged to provide as many consecutive access as possible and improve the cache hit ratio

b. Weights are staggered to improve decoding efficiency

5. Targeted optimization of Intel/ARM CPU, including:

a. Register rearrangement to quickly establish a lookup table

b. Do fast 8-bit accumulation by taking the average instruction

The researchers applied various optimizations step by step on a basic implementation, and finally achieved a significant acceleration compared to the SOTA low-bit operator.

For example, after implementing various optimizations, the T-MAC 4-bit operator ends up with a significant acceleration relative to the llama.cpp:

Finally, T-MAC is now open source, and related papers have been published in arXiv, so you can learn more about it.

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com