Home About us

Read in one article: What kind of network is needed in the AI era?

Terrific cloud computing 2024/08/20 14:31

Hello everyone, I'm an old cat.

Today, let's talk about data center networking.

When it comes to the network, the network is usually compared to the highway, the network card is equivalent to the gate to get up and down the highway, the data packet is equivalent to the car transporting the data, and the traffic law is the "transmission protocol".

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

Just as highways will also be congested, the network will also encounter congestion problems on the data highway, especially in today's rapid development of artificial intelligence, which puts forward higher requirements for data center networks.

Today, let's talk about what kind of network can meet the needs of the AI era.

▉ Why is the network not working now?

After so many years of network development, why has it been frequently brought up recently, and why has the traditional network become the bottleneck of modern data centers?

There is no doubt that this is inseparable from computing-intensive scenarios such as AI and machine learning. According to IDC, the global demand for computing power doubles every 3.5 months, far exceeding the current growth rate of computing power. In order to meet the increasing demand for computing power, the data center network, which is one of the three core components of the data center, will face challenges while increasing the utilization efficiency and communication performance of computing power.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

This is because, in the traditional von · Neumann architecture system, the network generally only plays the role of data transmission, and the computing is centered on the CPU or GPU, and when large and complex models such as ChatGPT and BERT distribute their workloads to a large number of GPUs for parallel computing, a large number of burst gradient data transmission will be generated, which can easily lead to network congestion.

This is a natural drawback of the traditional von · Neumann architecture, and in the era of AI with increased computing power, neither increasing bandwidth nor reducing latency can solve this problem of the network.

So how do you continue to improve the performance of your data center network?

▉ Are there any new ways to improve network performance?

In order to improve network performance, there are two traditional methods: increasing bandwidth and reducing latency. These two ways are easy to understand, just like transporting goods on the highway, either increasing the width of the road or increasing the speed limit of the road, thus solving the problem of network congestion.

When we encounter a slow network in our daily life, we will also use these two ways, either to add money to upgrade to higher broadband, or to buy better performance network equipment.

However, these two methods have a limit to improving the network, because when the bandwidth is upgraded to a certain width and the device reaches a certain level, it is difficult to improve the actual network performance, which is the main reason for the bottleneck of the network in the current AI era.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

Is there a better solution to improve the network?

The answer, of course, is yes. In order to accelerate model training and process large datasets, NVIDIA, as the global AI computing power hegemon, has long discovered the bottleneck of traditional networks. To do this, NVIDIA has chosen a new path: deploying computing around data. To put it simply, where the data is, the compute is there: when the data is on the GPU, the compute is on the GPU; When data travels through a network, computing is in the network.

In short, the network not only guarantees the performance of data transmission, but also undertakes some data processing calculations.

Through this new architectural approach, the CPU or GPU can concentrate on the computing tasks that it is good at, and distribute some infrastructure operation workloads to the nodes connected to the network, so as to solve the bottleneck problem or packet loss problem in network transmission. It is understood that in this way, the network latency can be reduced by more than 10 times.

Therefore, infrastructure computing has become one of the key technologies of our data-centric core computing architecture.

▉ Why can DPU improve the network?

When it comes to infrastructure computing, we have to mention the concept of DPU, the full name of DPU is Data Processing Unit, which is the third main chip in the data center, and its appearance is mainly to share the infrastructure workload of the CPU in the data center except for general computing.

NVIDIA is a global leader in the field of DPUs. In the first half of 2020, NVIDIA acquired Mellanox Technologies, an Israel network chip company, for a consideration of $6.9 billion, and launched the BlueField-2 DPU in the same year, defining it as the "third main chip" after CPU and GPU, officially kicking off the prelude to the development of DPU.

So some people have to ask, what role can this DPU play in the network?

Let me give you an example.

Just like running a restaurant, in the past, there were relatively few people, and the boss was responsible for all the work of purchasing, washing and cutting, side dishes, cooking, serving dishes and cashier, just like the CPU, not only to perform mathematical and logical operations, but also to manage external devices, perform different tasks at different times, and switch tasks, so as to meet the needs of business application execution.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

However, with the increase in the number of dining customers to be served, different tasks need to be shared by different personnel, and multiple clerks are responsible for purchasing, washing, cutting, and serving dishes to ensure the chef's cooking preparation; There are multiple chefs cooking in parallel to improve the efficiency of dish production; There are multiple waiters to provide services and deliver dishes to ensure the service quality of multi-table customers; The boss is only responsible for cashier and management.

In this way, the team of store assistants and waiters is like a DPU, processing and moving data; The chef team is like a GPU, performing parallel calculations on data, while the boss is like a CPU, capturing business application requirements and delivering results.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

CPUs, GPUs, and DPUs work together to maximize the workloads they do best, dramatically improve data center performance, energy efficiency, and get a better return on investment.

▉ What DPU products have NVIDIA launched?

After the launch of the BlueField-2 DPU in 2020. In order to address the unique needs of AI workloads, NVIDIA released a next-generation data processor, the NVIDIA BlueField-3 DPU, in April 2021.

BlueField-3 is the first DPU designed for AI and accelerated computing. It is understood that the BlueField-3 DPU can well offload, accelerate and isolate data center infrastructure workloads, thereby freeing up valuable CPU resources to run business-critical applications.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

Modern hyperscale cloud technology is driving data centers from the ground up to new architectures, leveraging a new type of processor designed specifically for data center infrastructure software to offload and accelerate the massive compute load generated by virtualization, networking, storage, security, and other cloud-native AI services. That's where the BlueField DPU comes in.

As the industry's first 400G Ethernet and NDR InfiniBand DPU, BlueField-3 delivers outstanding network performance. Delivering software-defined, hardware-accelerated data center infrastructure solutions for demanding workloads, accelerating AI to hybrid cloud and high-performance computing to 5G wireless networks, the BlueField-3 DPU redefines what's possible.

After the release of the BlueField-3 DPU, NVIDIA still hasn't stopped exploring. NVIDIA has found that with the emergence and popularity of large models, how to improve the distributed computing performance and efficiency of GPU clusters, improve the horizontal scaling capability of GPU clusters, and achieve business performance isolation on the generative AI cloud have become a common concern of all large model vendors and AI service providers.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

To this end, at the end of 2023, NVIDIA launched the BlueField-3 SuperNIC, which optimizes performance for east-west traffic, derived from the BlueField DPU, which uses the same architecture as the DPU, but is different from the DPU. The DPU focuses on offloading infrastructure operations and accelerates and optimizes north-south traffic. BlueField SuperNIC leverages technologies such as dynamic routing, congestion control, and performance isolation on the InfiniBand network, as well as the convenience of Ethernet standards on the cloud, to meet the performance, scalability, and multi-tenancy needs of generative AI clouds.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

In summary, the NVIDIA BlueField-3 networking platform currently includes two products: the BlueField-3 DPU for speed-limiting software-defined, networking, storage, and network security tasks, and the BlueField SuperNIC designed to powerfully support hyperscale AI clouds.

▉ What is the use of DOCA for DPUs?

When we talk about DPUs, we often talk about DOCA. So what is DOCA? What value does it have to the DPU?

From the above, we learned that NVIDIA has two products, BlueField-3 DPU and BlueField-3 SuperNIC, which can play a good role in accelerating the current surge in AI computing power.

However, at present, it is difficult to meet the current different application scenarios by relying solely on hardware products, so it is necessary to use the power of software.

CUDA is a well-known software platform for GPUs in the computing power market, and for the needs of the network platform, NVIDIA has adopted the same acceleration method of software and hardware integration, and also launched DOCA, a software development platform tailored for DPUs three years ago, and now it is also suitable for BlueField-3 SuperNIC.

NVIDIA DOCA, with its rich libraries, drivers, and APIs, provides a "one-stop-shop" for DOCA developers and is key to accelerating cloud infrastructure services.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

NVIDIA DOCA software framework

And as part of the full stack, DOCA is a key part of solving the AI puzzle that ties compute, networking, storage, and security together. With DOCA, developers can meet the performance and security needs of the modern data center by creating software-defined, cloud-native, DPU- and SuperNIC-accelerated services with support for Zero Trust protection.

Now, after three years of iterative upgrades, DOCA 2.7 expands the role of BlueField DPUs in offloading, accelerating, and isolating network, storage, security, and management infrastructure within the data center. This release also further enhances the AI Cloud Data Center and accelerates the NVIDIA Spectrum-X networking platform, delivering superior performance for AI workloads.

Let's take a look at the key role of DOCA for GPUs and NVIDIA BlueField-3DPUs or BlueField–3 SuperNICs:


In summary, NVIDIA DOCA for DPUs and SuperNICs is like CUDA for GPUs. DOCA brings together a variety of powerful APIs, libraries, and drivers that can be used to program and accelerate modern data center infrastructure.

▉ Will DOCA development be the next blue ocean track?

There is no doubt that with the emergence of AI, deep learning, metaverse and other technical scenarios, more and more enterprises need more DOCA developers to join in to allow more innovation and ideas to land. Familiar cloud service providers have an increasing demand for DPUs and need to optimize the performance of data centers with the help of DOCA hardware acceleration technology.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

DOCA is a tool for developers

And with the increasing demand for efficient and secure data processing, DOCA development has also become a skill for cloud infrastructure engineers, cloud architects, network engineers and other positions to gain a competitive advantage. In addition, DOCA developers will be able to create software-defined, cloud-native, and DPU-accelerated services, and participating in DOCA development will not only improve their skills, but also enhance their impact in the technical community.

At present, the number of DOCA developers is far from being able to meet the market demand. According to official data, there are more than 14,000 DOCA developers in the world, nearly half of whom are from China. Although it seems like a lot of people, compared to CUDA's 5 million developers around the world, DOCA developers still have a lot of room for growth.

But after all, DOCA has only been released for more than three years, and CUDA has been around for almost 30 years. Of course, this also shows that DOCA is still in the early stage of development, and there is still a lot of potential.

In order to attract more developers to join the development of DOCA, NVIDIA has been actively providing more help to developers through various activities in recent years, including preparing and landing the DOCA China developer community, holding online and offline training camps for DOCA developers, and holding DOCA developer hackathons.

In addition, in June 2024, the NVIDIA DPU Introductory Programming Course officially opened at Macau University of Science and Technology, and the public course outline shows that the content includes a comprehensive introduction to how the NVIDIA BlueField network platform and the NVIDIA DOCA framework accelerate AI computing, helping college students gain a competitive advantage in the AI era.

For developers who want to transform and college students who are about to graduate, DOCA development is a direction that many people are optimistic about.

The Great Cloud Computing, Read in One Article: What kind of network is needed in the AI era?

At the NVIDIA DOCA application code sharing event that ended at the beginning of the year, many developers stood out and won awards, including many college students. Chen Qin, who won the first prize in this event and is a master's student in computer science and technology, said, "Through the development of DOCA, it has not only improved my ability, but also brought me potential job opportunities. I have also received a lot of affirmation from my seniors in the community, which has made me more confident in myself. ”

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com