<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com -->
<!-- This can be converted using the Web service at http://xml.resource.org/experimental.html
     (which supports the latest, sometimes undocumented and under-tested, features.) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<!-- You want a table of contents -->
<?rfc symrefs="yes"?>
<!-- Use symbolic labels for references -->
<?rfc sortrefs="yes"?>
<!-- This sorts the references -->
<?rfc iprnotified="no" ?>
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<?rfc compact="yes"?>
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<rfc category="std" docName="draft-li-rtgwg-distributed-lossless-framework-00"
     ipr="trust200902">
  <front>
    <title abbrev="Distributed AIDC Network Framework">Framework of
    Distributed AIDC Network</title>

    <author fullname="Cong Li " initials="C." surname="Li">
      <organization>Chinat Telecom</organization>

      <address>
        <postal>
          <street>Beiqijia Town, Changping District</street>

          <city>Beijing, 102209</city>

          <country>China</country>
        </postal>

        <email>licong@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Siwei Ji" initials="S." surname="Ji">
      <organization>Chinat Telecom</organization>

      <address>
        <postal>
          <street>Beiqijia Town, Changping District</street>

          <city>Beijing, 102209</city>

          <country>China</country>
        </postal>

        <email>jisw@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Keyi Zhu" initials="K." surname="Zhu">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street>Huawei Campus, No.156 Beiqing Road</street>

          <city>Beijing, 100095</city>

          <country>China</country>
        </postal>

        <email>zhukeyi@huawei.com</email>
      </address>
    </author>

    <date day="21" month="October" year="2024"/>

    <area>Routing</area>

    <workgroup>Routing Area Working Group</workgroup>

    <keyword>AIDC, Network framework</keyword>

    <abstract>
      <t>With the rapid development of large language models, it puts forward
      higher requirements for the networking scale of data centers.
      Distributed model training has been proposed to shorten the training
      time and relieve the resource demand in a single data center.This
      document proposes a framework to address the challenge of efficient
      lossless interconnection and reliable data transmission between multiple
      data centers, which can connect multiple data centers to form a larger
      cluster through network connection. The document further conducts
      in-depth research on the key technologies and application scenarios of
      this distributed AIDC network.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>In the realm of artificial intelligence (AI), the computational
      demands associated with training such models have grown exponentially,
      posing significant challenges in terms of hardware resources, data
      storage, and training time. For example, training chatGPT-4 requires
      parameters of over 1.8 trillion and 20000 Nvidia A100 chips. Obviously,
      these models require immense computational power and memory capacity to
      perform effectively. This means that training very large AI models
      requires a high-speed interconnected network of thousands even millions
      of GPUs in a cluster.</t>

      <t>From a technical perspective, building a unified ultra large scale
      data center is the most ideal solution. This solution has efficient data
      processing capabilities and storage efficiency. However, in the actual
      process, this solution will encounter various challenges and
      constraints. The first challenge is the investment cost. The cost
      includes hardware equipment procurement, infrastructure construction,
      software and hardware platform development, etc., and the construction
      period often ranges from several months to several years. At the same
      time, the construction of large-scale data centers puts forward high
      requirements for the construction area of the data center, for example,
      10000 GPUs require at least 5000 square meters of data center. In
      addition, there are challenges in energy consumption such as heat
      dissipation and power supply. The large model is truly a &rdquo;power
      consumer&rdquo;, placing more than 100000 H100 GPUs in the same area
      would cause power grid paralysis.</t>

      <t>To make full use of the resources of some small data centers,
      multiple data centers can be connected through the network to provide
      infrastructure services for AI tasks.To shorten the training time and
      relieve the resource demand on a single data center, distributed model
      training has been proposed in the form of cloud-network
      coordination.</t>
    </section>

    <section title="Terminology">
      <t>The following terms are used in this document:</t>

      <t>AI: Artificial Intelligence</t>

      <t>DC: Data Center.</t>

      <t>LLM&#65306;Large Language Model</t>

      <t>GPU&#65306;Graphics Processing Unit</t>

      <t>OTN&#65306;Optical Transmission Network</t>

      <t>RDMA&#65306;Remote Direct Memory Access</t>
    </section>

    <section title="Scenarios">
      <t>To solve the problems of insufficient space/power and heat
      dissipation in the data center of building a 10,000 GPUs or even 100,000
      GPUs cluster, distributed networking of smart computing centers can
      connect multiple smart computing centers into a large virtual smart
      computing cluster. At present, the lossless network between distributed
      AI data centers is mainly suitable for the following two types of
      scenarios.</t>

      <section anchor="uif" title="Scenario 1:Distributed Model Training">
        <t>At present, the scale of most AIDC is between 100-300 PFlops, which
        is difficult to meet the requirement of training large-scale model
        training. Distributed model training collaborate computing of multiple
        intelligent computing centers in a region , so that larger models can
        be trained without building super-scale intensive intensive
        intelligent computing centers.In the process of computing resource
        usage, the computing power demand of tenants is often inconsistent
        with the actual deployment of computing power, resulting in the
        fragmentation of computing resources. Some AI data centers are often
        faced with insufficient resource utilization, resulting in waste of
        computing resources. In this scenario, the distributed AI data center
        networking can provide a lossless network connection between servers
        in remote data centers. This can make full use of the fragmentation
        resources of data centers in different geographical locations to
        perform appropriate model training tasks and improve system resource
        utilization.</t>
      </section>

      <section title="Distributed storage and computation">
        <t>High performance and high reliable storage is one of the most basic
        services of public cloud. At present, the storage and computing
        separation architecture is widely used in public clouds, that is,
        computing clusters and storage clusters may be located in different
        DCS within a Region. The network connecting computing clusters and
        storage clusters becomes the key to achieve high performance and high
        reliability of cloud storage services. The distributed AIDC network
        can connect computing clusters and storage clusters in a Region to
        meet the requirements of data localization and ensure data
        security.</t>
      </section>
    </section>

    <section title="Framework">
      <t>This document proposes a framework to address the challenge of
      efficient lossless interconnection and reliable data hosting between
      multiple data centers, which is suitable for the above scenarios such as
      large model inference training.</t>

      <section title="Overview">
        <t>The distributed AIDC lossless network architecture is composed of
        multiple independent data center networks, and multiple AIDCs are
        interconnected through the wide area interconnection area to jointly
        support the operation of the multiple AIDCs. The AI cluster network
        architecture is divided into five layers from bottom to top:</t>

        <t><figure>
            <artwork><![CDATA[                      +------------------------------+
Control layer         |          Controller          |
                      +-+-------+-------------+----+-+
------------------------|-------|-------------|----|------------------------                          
       +----------------+       |             |    + -------------+                                
       |                        |             |                   |
 +-----+-----+        +---------+-+     +-----+-----+       +-----+-----+           
 |    OTN    |        |    OTN    |     |    OTN    |       |    OTN    |
 +----+-+----+        +---+-+-----+     +----+-+----+       +----++-----+
      |  \               /  |                |  \               / |  Interconnection layer
      |   \             /   |                |   \             /  | 
      |    +-----------/----|--+             |    +-----------/---|---+    
      |   +-----------+     |  |             |   +-----------+    |   | 
 +----+---+--+        +-----+--+--+     +----+---+-+        +-----+---+-+
 |  S-Spine  |        |  S-Spine  |     |  S-Spine |        |   S-Spine | 
 +----+-+----+        +---+-+-----+     +----+-+---+        +----++-----+ 
      |  \               /  |                |  \               / |     Cluster egress layer
      |   \             /   |                |   \             /  | 
      |    +-----------/----|--+             |    +-----------/---|---+    
      |   +-----------+     |  |             |   +-----------+    |   | 
 +----+---+-+        +------+--+-+     +-----+---+-+        +-----+---+-+     
 |   Spine   |       |   Spine   |     |   Spine   |        |   Spine   |   
 +----+-+----+       +---+-+-----+     +----+-+----+        +----++-----+     
      |  \               /  |                |  \               / |  Aggregation layer
      |   \             /   |                |   \             /  | 
      |    +-----------/----|--+             |    +-----------/---|---+       
      |   +-----------+     |  |             |   +-----------+    |   |       
+-----+---+-+        +------+--++      +-----+---+-+        +-----+---+-+    
|    leaf   |        |    leaf  |      |    leaf   |        |    leaf   |    
+--+----+---+        +--+----+--+      +--+----+---+        +--+----+---+     
   |    |               |    |            |    |               |    |   Access layer        
   H1  H2               H3   H4           H1   H2              H3   H4                                      
          
            Cluster A                              Cluster N              
                                  
                Figuer1: Framework of distributed AIDC network
]]></artwork>
          </figure>&bull; The access layer is composed of server leaf
        switches, which supports high-density scale access of AI servers, and
        the uplink and downlink bandwidth convergence ratio is recommended
        1:1. Each interface of the AI Server is configured with an independent
        IP address, and it is connected to the Server Leaf switch in an
        independent link mode without link bundling. The access layer supports
        the optical module fault protection mechanism to avoid the training
        interruption caused by the access side link failure.</t>

        <t>&bull; The aggregation layer consists of a Spine switch, which is
        connected to a Server Leaf switch downlink and a DCI gateway uplink.
        The number of Spine switches determines the total size of the AI
        cluster of the node. Depending on the choice of the training business
        model, the convergence layer can have a certain convergence ratio.</t>

        <t>&bull; The cluster egress layer consists of DCI gateways. As the
        exit of the AI cluster, the DCI gateway is fully interconnected with
        multiple Spine switches in the downlink and interconnected with OTN
        and other nodes in the uplink. The cluster exit layer can also
        converge according to the choice of business model. In addition, the
        exit layer of the cluster needs to support technologies such as
        computing network business awareness and precise flow control, realize
        network load balancing and long-distance lossless, and provide basic
        network guarantee for LLM efficient training.</t>

        <t>&bull; The wide area interconnection layer consists of routers,
        OTNs and other devices. Multiple AIDCs are connected through an OTN
        with high throughput. OTN uses high-speed and large-capacity
        technology to provide high-quality large-bandwidth connections and
        realize cross-DC interconnection of AI cluster training networks. The
        wide area interconnection layer has intelligent operation and
        maintenance capabilities to ensure high reliability of the
        interconnection and support flexible link disassembly and construction
        according to services. Based on these, the distributed AI cluster
        network architecture can provide stable and efficient data
        transmission capabilities in long-distance and large-scale distributed
        computing environments.</t>

        <t>&bull; The control layer consists of controllers. The forwarding
        and control of the distributed AIDC network are separated. The
        controller collects the network topology and the network traffic
        reported by the network devices. At the same time, the controller also
        collects the service information reported by the server such as model
        segmentation and parallel mode. The controller synthetically
        calculates the global traffic path and sends it to the network
        devices.</t>
      </section>

      <section title="Technical Requirements">
        <t>Distributed AIDC lossless network extends DC lossless network from
        data center network to wide area network. From the perspective of
        network operation, the following requirements should be met by
        distributed AIDC network.</t>

        <t>Requirement 1: Long Distance lossless Interconnection</t>

        <t>RDMA is used as the input-output protocol during large model
        training. Since RDMA is very sensitive to network congestion and
        packet loss, even a small number of packet losses will cause a sharp
        performance degradation. Therefore, the underlying network must have
        lossless transmission capabilities to ensure that there is no
        congestion or packet loss during data transmission, so as to avoid the
        performance degradation of upper layer protocols.</t>

        <t>Requirement 2: Large-capacity interconnection bandwidth</t>

        <t>Large-capacity interconnection bandwidth can ensure the rapid
        transmission of large amounts of data between distributed smart
        computing centers, accelerating the training and inference process of
        AI models. With the increase of data volume, efficient synchronization
        of data and model parameters among distributed smart computing centers
        is required, which requires the network to provide sufficient
        throughput to avoid network congestion and performance
        degradation.</t>

        <t>Requirement 3: Ultra-high reliability</t>

        <t>To ensure long-term stable training between distributed AI data
        centers and prevent training interruptions caused by external factors
        such as network failures, the transmission network needs to have high
        reliability. For example, the network can quickly recover when a link
        failure occurs to ensure that the AI service is not interrupted, so as
        to avoid the back-off of smart computing training and the decrease of
        computational efficiency caused by link interruption.</t>

        <t>Requirement 4: Elastic and agile</t>

        <t>The distributed AIDC lossless network needs to be able to flexibly
        set up different sizes and types of clusters according to the
        different needs of multi-tenants. This means that the network needs to
        have the ability of elastic and agile disassembly and construction on
        demand, which can adjust quickly according to the change of computing
        requirements and dynamically allocate large bandwidth resources.</t>

        <t>Requirement 5: Intelligent network operation</t>

        <t>In the model training scenario, packet loss leads to a large
        decrease in training performance. Therefore, this has put forward
        higher requirements for network quality and operation and maintenance,
        and it is necessary to monitor network links in real time. The
        lossless network of the distributed intelligent computing center needs
        to have the ability of intelligent operation and maintenance, which
        can quickly and accurately locate and solve problems, improve the
        accuracy of fault location, and ensure the stable operation of the
        network.</t>
      </section>

      <section title="Key Mechanisms">
        <section title="Collective Communication mechanism in heterogeneous networks">
          <t>The communication pattern for large model training is collective
          communication. Collective communication among GPUs refers to the
          information exchange conducted by multiple GPU nodes following
          specific rules, utilizing a series of predefined communication
          patterns such as all-reduce, all-gather, reduce, broadcast, etc.,to
          achieve data synchronization and sharing. Each iteration of LLM
          training will synchronize parameters through collective
          communication, and there are multiple rounds of data interaction and
          multiple cross-long distance communications within collective
          communication. Long distance links lead to increased communication
          delay, which affects the training efficiency of LLM. It is necessary
          to optimize the design of collective communication mechanism in
          heterogeneous networks, aiming at reducing the amount of data
          transmitted on long-distance links and the number of transmission,
          so as to greatly reduce the possibility of long-distance link
          congestion.</t>
        </section>

        <section title="Global load balancing">
          <t>Load balancing mainly solves the problem of congestion and packet
          loss in non-faulty and homogeneous networks in smart computing
          business scenarios. LLM training traffic has the characteristics of
          high synchronization, large flow, and periodic appearance. At the
          same time, there are flows on every equivalent path in the network.
          The traditional load balancing technology based on ECMP hash can not
          achieve perfect balance of all paths. Network-level load balancing
          technology can pre-plan the whole network traffic uniformly through
          the traffic characteristics, so that all paths are perfectly
          balanced and conflict-free to avoid congestion and packet loss.</t>
        </section>

        <section title="Precise flow-control technology">
          <t>To solve the problem of performance degradation caused by packet
          loss in intelligent computing business, the distributed AIDC network
          architecture adopts a precise flow control technology that can
          transfer the "congestion point" occurs on the long-distance link to
          the network device closest to the transmitting node (first-hop
          device), effectively alleviating network congestion by shortening
          the congestion feedback path. Specifically, the network device
          determines whether congestion occurs on the links by checking the
          network state, such as the queue accumulation condition and buffer
          usage of each port. If congestion occurs and the device is not the
          first-hop device for the congested traffic flows, the congestion
          message is notified to the first-hop device. Subsequently, the
          first-hop device runs an algorithm based on the degree of congestion
          to determine the proportion of the congested traffic flows to be
          limited. Finally, the first-hop device controls the rate of traffic
          flows by sending PFC/CNP or other flow control protocol packets.</t>
        </section>

        <section title="Packet loss detection">
          <t>The packet loss monitoring technology supports the following
          capabilities:</t>

          <t>(1) Fast fault localization: real-time monitoring of traffic
          delay, packet loss and other indicators;</t>

          <t>(2) Visualization: the centralized control of the network
          supports the visualization of flow paths;</t>

          <t>(3) Packet loss statistics: In a certain statistical period, the
          difference between all the traffic entering the network and the
          traffic leaving the network is counted, which is the packet loss
          number of the carrying network in the statistical period;</t>

          <t>(4) Delay statistics: In a certain statistical period, the
          difference between the time when the same flow enters the network
          and the time when it leaves the network is calculated between the
          specified two network nodes, which is the delay of the network in
          the statistical period.</t>
        </section>
      </section>
    </section>

    <section title="Conclusion">
      <t>This document proposes a distributed AIDC network framework and
      provides an in-depth introduction to key technologies such as collective
      communication mechanism, global load balancing technology and precise
      flow-control technology under this framework, in order to further
      promote the verification of distributed AIDC interconnection in the
      future.</t>
    </section>

    <section anchor="security" title="Security Considerations">
      <t>There is no additional security risk introduced by this design.</t>
    </section>

    <section title="IANA Considerations">
      <t>This document introduces no additional considerations for IANA.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <?rfc include="reference.RFC.8174"?>

      <?rfc ?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.I-D.huang-rtgwg-wan-lossless-uc'
?>
    </references>
  </back>
</rfc>
