Home About us

【Semiconductor】GPU training Llama 3.1 crashes wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

Artificial Intelligence Industry Chain Alliance 2024/08/03 20:32

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?



Recruitment of the founding team of the chip project team! ◀ Click to check!

It's time to use the CPU general server to run a large model with hundreds of billions of parameters!

Musk built the world's largest supercomputer with 100,000 H100 in 19 days, and has been fully engaged in the training of Grok 3.

At the same time, foreign media broke the news that the next supercomputing cluster jointly built by OpenAI and Microsoft will consist of 100,000 GB200.

In this AI race, major technology companies are investing heavily in GPUs, which seems to imply that having more and more powerful GPUs can make them invincible.

However, this frenetic pursuit of high-end GPUs is not a flawless solution in all cases.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

According to the father of Pytorch, there are many interesting details about the infrastructure hidden in the technical report, including how to parallelize and how to make the system more reliable

In terms of stability, in the 54 days of Llama 3.1 training, Meta's 16,000-block H100 cluster experienced a total of 419 unexpected outages, which equates to an average of one every three hours.

Of those, 148 (30.1%) were due to various GPU failures.

In comparison, there were only 2 interrupts caused by CPU failures.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

On the other hand, if you want to run the Llama 3.1 405B, you have to pair it with two 8×H100 DGX workstations - that is, 1280GB of video memory.

A warrior once tried to run a 4090, but waited 30 minutes for the model to slowly spit out a "The".

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

A complete reply, which took a full 20 hours

Anyone familiar with model training and inference knows that these things are not surprising at all.

Cluster construction (GPU configuration, network design, track optimization, etc.), cluster management (real-time monitoring, troubleshooting, etc.), ...... All of them are "roadblocks".

What about companies that lack the relevant experience and capital?

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

Recently, the R&D engineers of Inspur Information, with only 4 CPUs, have made the "Source 2.0" with 100 billion parameters run on the general server!

Faced with the task of writing a program in Java, Source 2.0 gave the results very quickly.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

Give it another reasoning question - there is a soft ladder hanging next to the boat, 2 meters above the sea surface, the sea water rises half a meter per hour, how many hours can the sea water submerge the soft ladder?

Similarly, the AI gives detailed steps and answers with almost zero delay.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

It can be said that the use of a general-purpose server to run a large model with hundreds of billions of parameters is unprecedented, and the accumulation in this field is completely blank, and there is no experience to learn from.

How does Inspur Information do it?

With 4 CPUs, leverage a large model with 100 billion parameters

In order to realize the inference of a large model with hundreds of billions of parameters in a single server, there are two main stages, both of which put forward hard requirements for computing power.

First, there is the pre-filling phase, also known as the forward propagation phase.

This phase involves the processing of the input data and the first reading of the model parameters.

For example, when you enter the prompt "Write me an article about AI", all the tokens and model parameters in the question will be entered into the calculation at once in the pre-filling stage.

Sometimes, this input may be a few words, or thousands of words, or a book.

How much computational the first stage requires depends mainly on the length we enter.

In the process of calculating the first token, due to the first loading of the model, all the weight parameters and KV Cache and other data will be stored in memory.

This is 2-3 times the memory space occupied by the model parameters themselves.

For a 100-billion-parameter model, a large number of parameters and data inputs need to be processed in a powerful computing unit. In this regard, it needs to support vectorized instruction sets and matrix computation instruction sets to implement a large number of matrix multiplication and tensor operations.

The second is the decoding phase, which is the stage where the model begins to output results after all the inputs to the problem.

At this stage, the only requirement for the large model is that the output is as fast as possible. At the same time, the challenge is no longer a computing power challenge, but a "data handling" challenge.

It consists of two parts: "data porting":

These transfers play a decisive role in the calculation and inference speed of large models. Data transfer is fast, and LLM can spin words faster.

The LLM output is mainly through KV Catch, which generates tokens one by one, and stores the key-value vector of the new word block after each step of generation.

Therefore, for the real-time inference of hundreds of billions of large models, the server needs to have high computing power and high data transfer efficiency from storage unit to computing unit.

In short, the two stages of large model inference have completely different computing characteristics, and need to be co-optimized in terms of software and hardware.

GPUs aren't a panacea

Traditionally, GPUs have been the first choice for AI training and inference due to their superior parallel processing capabilities.

cost

However, high-end GPU servers are often in short supply in the market and are extremely difficult to obtain.

Only well-funded tech giants, such as Microsoft and Google, can afford the cost.

On the other hand, not only can't afford it, but it can't afford to use it.

GPU-based cloud service rentals are costly in inference tasks. For researchers and application vendors who need to achieve greater cost-effectiveness, they have to look elsewhere.

Memory

In addition, one of the biggest disadvantages of GPUs is that they have limited memory capacity.

At present, the network architecture of LLMs in the industry has gradually moved from GPT to MoE. The size of the large model parameters leading to AGI will only grow exponentially.

This means that the size of closed-source/open-source mainstream models will only get bigger, and hundreds of billions of parameters, or even trillions, of parameters will become mainstream.

For a model with tens of billions of parameters, 20-30GB of video memory is sufficient. However, if you want to run hundreds of billions of parameters, you need about 200-300GB of video memory space.

At present, the mainstream AI chips usually have only a few tens of GB of video memory, which obviously can't fit such a large model. (At present, the strongest AI chip has not yet reached 200GB)


Underrated general-purpose servers

If the GPU doesn't work, then start with the CPU.

Although it is not possible to train the model on a large scale, the general-purpose server has a great advantage in the inference task.

In the process of specific practice, the engineers of Inspur Information have overcome one "roadblock" after another from the level of hardware resources and algorithms.

Large memory + high-speed bandwidth

In terms of computing power, the leading server CPUs are already equipped with AI acceleration functions.

Similar to the GPU's Tensor cores, AMX Advanced Matrix Extensions can accelerate low-precision calculations, program them into instruction sets for CPU cores, and use dedicated cores to accelerate them.

In terms of algorithms, the general server of Inspur Information can support mainstream AI frameworks such as PyTorch and TensorFlow, as well as popular development tools such as DeepSpeed, which meets the needs of users for a more mature, easy-to-deploy and more convenient open ecosystem.

In terms of communication, the design of the full-link UPI (Ultra Path Interconnect) bus interconnection realizes efficient data transmission between CPUs

  1. Allows direct data transfer between any two CPUs, reducing communication latency
  2. High transfer rates of up to 16GT/s (Giga Transfers per second) are available

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

In addition, Inspur's R&D engineers have also optimized the routing path and impedance continuity between CPUs and between CPUs and memory.

Based on the results of the 3D simulation, they adjusted the via arrangement to reduce the signal crosstalk to less than -60dB, a 50% reduction compared to the previous generation.

In addition, through the DOE matrix active simulation, the combined optimal solution of all corners of the channel is found, so that the computing power performance can be fully utilized.

In terms of memory, it can be said that the biggest advantage of a general-purpose server is it.

For a 4-socket server, you only need to plug in 8 32GB memory to each CPU, and you can easily reach 1TB. It can even be expanded to 16TB after plugging in, and can support models with up to one trillion parameters.

With DDR5 memory, you can achieve 4800MHz × 8bit × 8 channels × 4pcs ÷ 1024 = 1200GB/s theoretical bandwidth.

The measured results show that the read bandwidth is 995 GB/s, the write bandwidth is 423 GB/s, and the read and write bandwidth is 437 GB/s.

This data can be said to be no less impressive for some GPUs or accelerator cards equipped with GDDR video memory.


But hardware alone is not enough

It is not enough to rely on hardware innovation alone, and it is difficult for CPUs to perform large-scale parallel computing of large model algorithms.

As mentioned in the beginning, large models have very high requirements for communication bandwidth, whether it is between data computing, computing units, or between computing units and memory.

If the BF16 accuracy is calculated, if you want the running delay of the 100 billion large model to be less than 100ms, the communication bandwidth between the memory and the computing unit must reach at least 2TB/s.

Not only that, but for large AI models designed based on acceleration cards that are good at massively parallel computing, the processors of general-purpose servers are not suitable for them.

The reason is obvious: the latter, while having highly versatile and high-performance computing cores, does not have an environment that works in parallel.

Normally, the general-purpose server will pass the weights of the model to one CPU, and then it will concatenate the other CPUs to realize the transmission of weight data.

However, because large models need to frequently transfer algorithm weights between memory and CPU when running, the consequence is that the bandwidth utilization between CPU and memory is not high, and the communication overhead is huge.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

How to solve the problem? Innovate with algorithms

In order to solve the above problems, Inspur Information proposed two technological innovations, "Tensor Parallel" and "NF4 quantization", and successfully realized the real-time inference of the 100 billion large model Yuan2.0-102B.

According to the results of the performance analysis, you can clearly see the distribution of the calculation time of different parts of the model -

50% of the linear layer runtime, 20% of the convolution runtime, 20% of the aggregate communication time, and 10% of other calculations.

Note that in the whole reasoning process, the calculation time accounts for 80%!

This is in stark contrast to AI accelerator cards that use multiple PCIe cards, which can have up to 50% communication overhead, resulting in a significant waste of computing power.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

Yuan2.0-102B model inference performance analysis results

Tensor parallelism

The so-called tensor parallelism first splits the convolutional operator, and then inputs the matrix calculation weights of the attention layer and feedforward layer in the large model into the memory of multiple processors.

In this way, the four CPUs in the general-purpose server can obtain the algorithm weight at the same time and accelerate the calculation.

However, tensor parallelism has a fine granularity of slicing the model parameters, which requires the CPU to synchronize the data after each tensor calculation.

For this requirement, the aforementioned full-link UPI bus interconnection technology can fully meet (communication bandwidth up to 16GT/s).

In the end, this kind of collaborative and parallel work directly improves the computing efficiency by 4 times!

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

NF4 quantification

As for the problem of insufficient memory bandwidth, it is necessary to "slim down, that is, quantize" the model without affecting the accuracy.

The advantage is that, on the one hand, the LLM parameters can be quantized into low-bit data, and the weight will be smaller. On the other hand, when the weights are reduced, the amount of data transferred at the time of calculation will also be smaller.

Here, Inspur Information adopts a rare quantile quantification method - NF4 (4-bit NormalFloat).

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

The NF4 quantization method can compress the size of Yuan2.0-102B to 1/4 of the original size

Specifically, the core idea of NF4 is to ensure that the number of values of the input tensor in the quantization interval is equal.

This feature is very suitable for presenting LLM weights with an approximate normal distribution.

Since the standard deviation can be adjusted to fit the range of quantized data types, NF4 can achieve higher accuracy than traditional 4-bit integer or 4-bit floating-point quantification.

In this way, the quantized model can not only meet the accuracy requirements, but also greatly reduce the amount of memory access data for large-scale parallel computing, so as to meet the decoding requirements of real-time inference.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

The data intervals of integer or floating-point quantification methods are usually evenly distributed or exponentially distributed

To further compress the weight parameters of the model, the team also used the Nested Quantization (Double Quant) technique.

This is based on NF4 quantification, and the secondary quantization is carried out.

Because NF4 generates a large number of scale parameters after quantization, it will take up a lot of memory if you use 32-bit floating-point number (FP32) storage.

For an LLM with 100 billion parameters, if every 64 parameters are counted as a quantization block (block size=64), an additional 6 GB of memory is required to store the scale parameter alone: (100B ÷ 64) × 4 = 6GB.

By quantizing these scale parameters to an 8-bit floating-point number (FP8), the team significantly reduced the amount of storage space required.

With 256 as the quantized block size (block size=256), the additional space required to store all the scale parameters is only 1.57GB: (100B ÷ 64 ÷ 256) × 4 + (100B ÷ 64) × 1 = 1.57GB.

Through nested quantization, each weight parameter of the model ends up taking up only 4 bytes of memory space, which is a significant memory footprint saving over the original FP32.

At the same time, it improves the efficiency of data transfer from memory to CPU by 4 times.

This optimization significantly reduces the limitation of memory bandwidth on the inference and decoding efficiency of the Yuan2.0-102B model, thereby further improving the inference performance of the model.

The so-called universal means to let everyone use it

At this point, the Inspur information has been successfully delivered!

Through system optimization, the NF8260G7 of Inspur Information has realized for the first time in the industry that it is only based on general-purpose processors and supports the operation of 100 billion parameter large models.

So far, the parameter scale of the AI large model that can be supported by general computing power has exceeded 100 billion, completely filling the gap in the industry and becoming a new starting point for enterprises to have AI.

The deployment of 100 billion parameter AI models has now had a more powerful and cost-effective option. The application of AI large models can be more closely integrated with the cloud, big data, and databases.

Artificial Intelligence Industry Chain Alliance, [Semiconductor] GPU training Llama 3.1 collapsed wildly, and there are large manufacturers using CPU servers to run hundreds of billions of parameter large models?

The ultimate goal of scientific and technological progress must be to fall into the mortal world.

Looking at the present, AIGC has penetrated into thousands of industries. AI has infiltrated every computing device at an astonishing rate.

From January to April 2024, the number of winning bids for domestic large models has exceeded the total number for the whole year of 2023, and the amount of winning bids disclosed has reached 77% of the total for the whole year of 2023.

In the financial industry, hospital outpatient departments, and enterprise IT departments, practitioners have found this: the computing infrastructure of traditional industries is no longer enough!

Nowadays, the 100 billion parameter model is the key to the emergence of intelligence in thousands of industries. Whether the general computing power can run a large model with hundreds of billions of parameters is the key to measuring whether it can support the emergence of intelligence in thousands of industries.

The initiative of Inspur Information enables customers in the Internet, finance, medical and other industries to achieve efficient deployment, and the first investment can save more than 80% of the construction cost.

Whether it is financial fraud prevention, financial data analysis, enterprise CRM marketing insights, medical intelligent diagnosis, personalized diagnosis and treatment plans, education and training, etc., will witness the wide application of AI.

From now on, all computing is AI. 

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com