Framework of Distributed AIDC Network

In the realm of artificial intelligence (AI), the computational demands associated with training such models have grown exponentially, posing significant challenges in terms of hardware resources, data storage, and training time. For example, training chatGPT-4 requires parameters of over 1.8 trillion and 20000 Nvidia A100 chips. Obviously, these models require immense computational power and memory capacity to perform effectively. This means that training very large AI models requires a high-speed interconnected network of thousands even millions of GPUs in a cluster. From a technical perspective, building a unified ultra large scale data center is the most ideal solution. This solution has efficient data processing capabilities and storage efficiency. However, in the actual process, this solution will encounter various challenges and constraints. The first challenge is the investment cost. The cost includes hardware equipment procurement, infrastructure construction, software and hardware platform development, etc., and the construction period often ranges from several months to several years. At the same time, the construction of large-scale data centers puts forward high requirements for the construction area of the data center, for example, 10000 GPUs require at least 5000 square meters of data center. In addition, there are challenges in energy consumption such as heat dissipation and power supply. The large model is truly a ”power consumer”, placing more than 100000 H100 GPUs in the same area would cause power grid paralysis. To make full use of the resources of some small data centers, multiple data centers can be connected through the network to provide infrastructure services for AI tasks.To shorten the training time and relieve the resource demand on a single data center, distributed model training has been proposed in the form of cloud-network coordination.

To solve the problems of insufficient space/power and heat dissipation in the data center of building a 10,000 GPUs or even 100,000 GPUs cluster, distributed networking of smart computing centers can connect multiple smart computing centers into a large virtual smart computing cluster. At present, the lossless network between distributed AI data centers is mainly suitable for the following two types of scenarios.

At present, the scale of most AIDC is between 100-300 PFlops, which is difficult to meet the requirement of training large-scale model training. Distributed model training collaborate computing of multiple intelligent computing centers in a region , so that larger models can be trained without building super-scale intensive intensive intelligent computing centers.In the process of computing resource usage, the computing power demand of tenants is often inconsistent with the actual deployment of computing power, resulting in the fragmentation of computing resources. Some AI data centers are often faced with insufficient resource utilization, resulting in waste of computing resources. In this scenario, the distributed AI data center networking can provide a lossless network connection between servers in remote data centers. This can make full use of the fragmentation resources of data centers in different geographical locations to perform appropriate model training tasks and improve system resource utilization.

High performance and high reliable storage is one of the most basic services of public cloud. At present, the storage and computing separation architecture is widely used in public clouds, that is, computing clusters and storage clusters may be located in different DCS within a Region. The network connecting computing clusters and storage clusters becomes the key to achieve high performance and high reliability of cloud storage services. The distributed AIDC network can connect computing clusters and storage clusters in a Region to meet the requirements of data localization and ensure data security.

This document proposes a framework to address the challenge of efficient lossless interconnection and reliable data hosting between multiple data centers, which is suitable for the above scenarios such as large model inference training.

The distributed AIDC lossless network architecture is composed of multiple independent data center networks, and multiple AIDCs are interconnected through the wide area interconnection area to jointly support the operation of the multiple AIDCs. The AI cluster network architecture is divided into five layers from bottom to top:

• The access layer is composed of server leaf switches, which supports high-density scale access of AI servers, and the uplink and downlink bandwidth convergence ratio is recommended 1:1. Each interface of the AI Server is configured with an independent IP address, and it is connected to the Server Leaf switch in an independent link mode without link bundling. The access layer supports the optical module fault protection mechanism to avoid the training interruption caused by the access side link failure. • The aggregation layer consists of a Spine switch, which is connected to a Server Leaf switch downlink and a DCI gateway uplink. The number of Spine switches determines the total size of the AI cluster of the node. Depending on the choice of the training business model, the convergence layer can have a certain convergence ratio. • The cluster egress layer consists of DCI gateways. As the exit of the AI cluster, the DCI gateway is fully interconnected with multiple Spine switches in the downlink and interconnected with OTN and other nodes in the uplink. The cluster exit layer can also converge according to the choice of business model. In addition, the exit layer of the cluster needs to support technologies such as computing network business awareness and precise flow control, realize network load balancing and long-distance lossless, and provide basic network guarantee for LLM efficient training. • The wide area interconnection layer consists of routers, OTNs and other devices. Multiple AIDCs are connected through an OTN with high throughput. OTN uses high-speed and large-capacity technology to provide high-quality large-bandwidth connections and realize cross-DC interconnection of AI cluster training networks. The wide area interconnection layer has intelligent operation and maintenance capabilities to ensure high reliability of the interconnection and support flexible link disassembly and construction according to services. Based on these, the distributed AI cluster network architecture can provide stable and efficient data transmission capabilities in long-distance and large-scale distributed computing environments. • The control layer consists of controllers. The forwarding and control of the distributed AIDC network are separated. The controller collects the network topology and the network traffic reported by the network devices. At the same time, the controller also collects the service information reported by the server such as model segmentation and parallel mode. The controller synthetically calculates the global traffic path and sends it to the network devices.

Distributed AIDC lossless network extends DC lossless network from data center network to wide area network. From the perspective of network operation, the following requirements should be met by distributed AIDC network. Requirement 1: Long Distance lossless Interconnection RDMA is used as the input-output protocol during large model training. Since RDMA is very sensitive to network congestion and packet loss, even a small number of packet losses will cause a sharp performance degradation. Therefore, the underlying network must have lossless transmission capabilities to ensure that there is no congestion or packet loss during data transmission, so as to avoid the performance degradation of upper layer protocols. Requirement 2: Large-capacity interconnection bandwidth Large-capacity interconnection bandwidth can ensure the rapid transmission of large amounts of data between distributed smart computing centers, accelerating the training and inference process of AI models. With the increase of data volume, efficient synchronization of data and model parameters among distributed smart computing centers is required, which requires the network to provide sufficient throughput to avoid network congestion and performance degradation. Requirement 3: Ultra-high reliability To ensure long-term stable training between distributed AI data centers and prevent training interruptions caused by external factors such as network failures, the transmission network needs to have high reliability. For example, the network can quickly recover when a link failure occurs to ensure that the AI service is not interrupted, so as to avoid the back-off of smart computing training and the decrease of computational efficiency caused by link interruption. Requirement 4: Elastic and agile The distributed AIDC lossless network needs to be able to flexibly set up different sizes and types of clusters according to the different needs of multi-tenants. This means that the network needs to have the ability of elastic and agile disassembly and construction on demand, which can adjust quickly according to the change of computing requirements and dynamically allocate large bandwidth resources. Requirement 5: Intelligent network operation In the model training scenario, packet loss leads to a large decrease in training performance. Therefore, this has put forward higher requirements for network quality and operation and maintenance, and it is necessary to monitor network links in real time. The lossless network of the distributed intelligent computing center needs to have the ability of intelligent operation and maintenance, which can quickly and accurately locate and solve problems, improve the accuracy of fault location, and ensure the stable operation of the network.

The communication pattern for large model training is collective communication. Collective communication among GPUs refers to the information exchange conducted by multiple GPU nodes following specific rules, utilizing a series of predefined communication patterns such as all-reduce, all-gather, reduce, broadcast, etc.,to achieve data synchronization and sharing. Each iteration of LLM training will synchronize parameters through collective communication, and there are multiple rounds of data interaction and multiple cross-long distance communications within collective communication. Long distance links lead to increased communication delay, which affects the training efficiency of LLM. It is necessary to optimize the design of collective communication mechanism in heterogeneous networks, aiming at reducing the amount of data transmitted on long-distance links and the number of transmission, so as to greatly reduce the possibility of long-distance link congestion.

Load balancing mainly solves the problem of congestion and packet loss in non-faulty and homogeneous networks in smart computing business scenarios. LLM training traffic has the characteristics of high synchronization, large flow, and periodic appearance. At the same time, there are flows on every equivalent path in the network. The traditional load balancing technology based on ECMP hash can not achieve perfect balance of all paths. Network-level load balancing technology can pre-plan the whole network traffic uniformly through the traffic characteristics, so that all paths are perfectly balanced and conflict-free to avoid congestion and packet loss.

To solve the problem of performance degradation caused by packet loss in intelligent computing business, the distributed AIDC network architecture adopts a precise flow control technology that can transfer the "congestion point" occurs on the long-distance link to the network device closest to the transmitting node (first-hop device), effectively alleviating network congestion by shortening the congestion feedback path. Specifically, the network device determines whether congestion occurs on the links by checking the network state, such as the queue accumulation condition and buffer usage of each port. If congestion occurs and the device is not the first-hop device for the congested traffic flows, the congestion message is notified to the first-hop device. Subsequently, the first-hop device runs an algorithm based on the degree of congestion to determine the proportion of the congested traffic flows to be limited. Finally, the first-hop device controls the rate of traffic flows by sending PFC/CNP or other flow control protocol packets.

The packet loss monitoring technology supports the following capabilities: (1) Fast fault localization: real-time monitoring of traffic delay, packet loss and other indicators; (2) Visualization: the centralized control of the network supports the visualization of flow paths; (3) Packet loss statistics: In a certain statistical period, the difference between all the traffic entering the network and the traffic leaving the network is counted, which is the packet loss number of the carrying network in the statistical period; (4) Delay statistics: In a certain statistical period, the difference between the time when the same flow enters the network and the time when it leaves the network is calculated between the specified two network nodes, which is the delay of the network in the statistical period.