HPCC++: Enhanced High Precision Congestion Control

The link speed in data center networks has grown from 1Gbps to 100Gbps in the past decade, and this growth is continuing. Ultralow latency and high bandwidth, which are demanded by more and more applications, are two critical requirements in today's and future high-speed networks. Given that traditional software-based network stacks in hosts can no longer sustain the critical latency and bandwidth requirements as described in , offloading network stacks into hardware is an inevitable direction in high-speed networks. As an example, large-scale networks with RDMA (remote direct memory access) often uses hardware-offloading solutions. In some cases, the RDMA networks still face fundamental challenges to reconcile low latency, high bandwidth utilization, and high stability. This document describes a new congestion control mechanism, HPCC++ (Enhanced High Precision Congestion Control), for large-scale, high-speed networks. The key idea behind HPCC++ is to leverage the precise link load information from signaled through &INT; to compute accurate flow rate updates. Unlike existing approaches that often require a large number of iterations to find the proper flow rates, HPCC++ requires only one rate update step in most cases. Using precise information from &INT; enables HPCC++ to address the limitations in current congestion control schemes. First, HPCC++ senders can quickly ramp up flow rates for high utilization and ramp down flow rates for congestion avoidance. Second, HPCC++ senders can quickly adjust the flow rates to keep each link's output rate slightly lower than the link's capacity, preventing queues from being built-up as well as preserving high link utilization. Finally, since sending rates are computed precisely based on direct measurements at switches, HPCC++ requires merely three independent parameters that are used to tune fairness and efficiency. HPCC++ is an enhanced version of . HPCC++ takes into account system constraints and aims to reduce the design overhead and further improves the performance. describes these detailed proposed design enhancements and guidelines. This document describes the architecture changes in switches and end-hosts to support the needed tranmission of inband telemetry and its consumption, that imporves the efficiency in handling network congestion.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

shows the end-to-end system that HPCC++ operates in. During the traverse of the packet from the sender to the receiver, each switch along the path inserts &INT; that reports the current state of the packet's egress port, including timestamp (ts), queue length (qLen), transmitted bytes (txBytes), and the link bandwidth capacity (B), together with switch_ID and port_ID. When the receiver gets the packet, it may copy all the &INT; recorded from the network to the ACK message it sends back to the sender, and then the sender decides how to adjust its flow rate each time it receives an ACK with network load information. Alternatively, the receiver may calculate the flow rate based on the &INT; information and feedback the calculated rate back to the sender. The notification packets would include delayed ack information as well. Note that there also exist network nodes along the reverse (potentially uncongested) path that the feedback reports traverse. Those network nodes are not shown in the figure for sake of brevity.

| |-------->| |-------->| Data | | Sender |=========|Switch1|=========|Switch2|=========| Receiver| +--------+ Link-0 +-------+ Link-1 +-------+ Link-2 +---------+ /|\ | | | +---------------------------------------------------------+ Notification Packets/ACKs ]]>

Data sender: responsible for controlling inflight bytes. HPCC++ is a window-based congestion control scheme that controls the number of inflight bytes. The inflight bytes mean the amount of data that have been sent, but not acknowledged by the sender yet. Controlling inflight bytes has an important advantage compared to controlling rates. In the absence of congestion, the inflight bytes and rate are interchangeable with equation inflight = rate * T where T is the base propagation RTT. The rate can be calculated locally or obtained from the notification packet. The sender may further use the data pacing mechanism, potentially implemented in hardware, to limit the rate accordingly.
Network nodes: responsible of inserting the &INT; information to the data packet. The &INT; information reports the current load of the packet's egress port, including timestamp (ts), queue length (qLen), transmitted bytes (txBytes), and link bandwidth capacity (B). Besides, the &INT; contains switch_ID and port_ID to identify a link.
Data receiver: responsible for either reflecting back the &INT; information in the data packet or calculating the proper flow rate based on network congestion information in &INT; and sending notification packets back to the sender.

HPCC++ is a window-based congestion control algorithm. The key design choice of HPCC++ is to rely on network nodes to provide fine-grained load information, such as queue size and accumulated tx/rx traffic to compute precise flow rates. This has two major benefits: (i) HPCC++ can quickly converge to proper flow rates to highly utilize bandwidth while avoiding congestion; and (ii) HPCC++ can consistently maintain a close-to-zero queue for low latency. This section introduces the list of notations and describes the core congestion control algorithm.

This section summarizes the list of variables and parameters used in the HPCC++ algorithm. also includes the default values for choosing the algorithm parameters either to represent a typical setting in practical applications or based on theoretical and simulation studies.

The HPCC++ algorithm can be outlined as below:

u then 7: u = u'; tau = ack.L[i].ts - L[i].ts; 8: tau = min(tau, T); 9: U = (1 - tau/T)*U + tau/T*u; 10: return U; ]]>

= eta or incStage >= maxStagee then 13: Wc W = ----- + W_ai; U/eta 14: if updateWc then 15: incStagee = 0; Wc = W ; 16: else 17: W = Wc + W_ai ; 18: if updateWc then 19: incStage++; Wc = W ; 20: return W ]]>

lastUpdateSeq then 23: W = ComputeWind(MeasureInflight(ack), True); 24: lastUpdateSeq = snd_nxt; 25: else 26: W = ComputeWind(MeasureInflight(ack), False); 27: R = W/T; L = ack.L; ]]> The above illustrates the overall process of CC at the sender side for a single flow. Each newly received ACK message triggers the procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq is used to remember the first packet sent with a new W c , and the sequence number in the incoming ACK should be larger than lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and 18-19). The sender also remembers the pacing rate and current &INT; information at Line 27. The sender computes a new window size W at Line 23 or Line 26, depending on whether to update W c , with function MeasureInflight and ComputeWind. Function MeasureInflight estimates normalized inflight bytes with Eqn (2) at Line 5. First, it computes txRate of each link from the current and last accumulated transferred bytes txBytes and timestamp ts (Line 4). It also uses the minimum of the current and last qlen to filter out noises in qlen (Line 5). The loop from Line 3 to 7 selects maxi(Ui) in Eqn. (3). Instead of directly using maxi(Ui), we use an EWMA (Exponentially Weighted Moving Average) to filter the noises from timer inaccuracy and transient queues. (Line 9). Function ComputeWind combines multiplicative increase/ decrease (MI/MD) and additive increase (AI) to balance the reaction speed and fairness. If a sender finds it should increase the window size, it first tries AI for maxStage times with the stepWAI (Line 17). If it still finds room to increase after maxStage times of AI or the normalized inflight bytes is above, it calls Eqn (4) once to quickly ramp up or ramp down the window size (Line 12-13).

HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. eta controls a simple tradeoff between utilization and transient queue length (due to the temporary collision of packets caused by their random arrivals, so we set it to 95% by default, which only loses 5% bandwidth but achieves almost zero queue. maxStage controls a simple tradeoff between steady state stability and the speed to reclaim free bandwidth. We find maxStage = 5 is conservatively large for stability, while the speed of reclaiming free bandwidth is still much faster than traditional additive increase, especially in high bandwidth networks. W_ai controls the tradeoff between the maximum number of concurrent flows on a link that can sustain near-zero queues and the speed of convergence to fairness. Note that none of the three parameters are reliability-critical. HPCC++'s design brings advantages to short-lived flows, by allowing flows starting at line-rate and the separation of utilization convergence and fairness convergence. HPCC++ achieves fast utilization convergence to mitigate congestion in almost one round-trip time, while allows flows to gradually converge to fairness. This design feature of HPCC++ is especially helpful for the workload of datacenter applications, where flows are usually short and latency-sensitive. Normally we set a very small W_ai to support a large number of concurrent flows on a link, because slower fairness is not critical. A rule of thumb is to set W_ai = W_init*(1-eta) / N where N is the expected or receiver reported maximum number of concurrent flows on a link. The intuition is that the total additive increase every round (N*W_ai ) should not exceed the bandwidth headroom, and thus no queue forms. Even if the actual number of concurrent flows on a link exceeds N, the CC is still stable and achieves full utilization, but just cannot maintain zero queues.

There are three compoments HPCC++ needs to implement: telementry padding, congestion notification, and rate update.

The specifications of switch padding for inband telemetry can be found in .

HPCC++ uses congestion notification to fetch network congestion information from switches for proper rate updates at end-hosts. Although the basic algorithm described in is to add &INT; information into every data packet for optimal performance, HPCC++ supports flexible implementation choices to work seamly with transport protocol stacks. We consider congestion nofication choices in both forward and reverse directions of the traffic.

Forward direction is the traffic direction of data packets that experience bandwidth contention and possible network congestion. The function of congestion notification in forward direction is to fetch &INT; from switches. HPCC++ defines two approaches of doing this. 1. Inband with data packet. This is basic algorithm setting described in , where the end-host inserts inband telemetry header into data packets. Switches along the path detect the inband telemetry header and correspondingly add &INT; information into data packet to react to congestion as soon as the very first packet observing the network congestion. This is especially helpful to reduce the risk of severe congestion in incast scenarios at the first round-trip time. In addition, original HPCC's algorithm introduction of Wc is for the purpose of solving the over-reaction issue from using this per-packet response. Different with in , end-host can choice uses every data packet or only a subset of data packets to reduce the overhead. To insert telemetry header, differet telemetry protocols have specific settings for IFA, IETF IOAM, and P4.org INT as following. 2. Probe packet. Switches touching every data packet for &INT; inserting may lead to security or performance concerns, HPCC++ supports the ``out-of-band'' approach that uses special-generated probe packets at end-hosts to fetch &INT; from switches. Thereby, the probe packets should take the same routing path and QoS queueing with the data packets. End-hosts can generate probe packets less frequently and we recommend once per round trip time. This is it sends a new probe packet once it receives the response. In addition, the end-host issues probe packets only when it has data packet in the flight.

Reverse direction is the receiver conveying &INT; back to traffic sender for rate updates. Similar to forward direction, there are also inband and out-of-band approaches. 1. Inband with ACK packet. HPCC++ supports to use the ACK packet in transport protocols to convey the &INT;. TCP generates ACK packet once per every data packet or per a few data packets. With ACK packet, the receive sends accumulated &INT; back to sender for rate updates. 2. Notification packet. Using ACK packet for &INT; notification requires transport stack modification and sometimes leads to delay in notification when certain delayed acknowledged mechanism is used. Hence, HPCC++ allows the receiver to use special-generated notification packets to deliver &INT;. The nofication packet is generated per each probe packet or data packet with &INT;.

shows HPCC++ implementation on a NIC. The NIC provides an HPCC++ module that resides on the data path of the NIC, HPCC++ modules realize both sender and receiver roles.

| Scheduler |-------> |Tx pipeline|--+-> | | | rate update +-----------+ +-----------+ | | | HPCC++ | ^ | | | | inband telemetry| | | | module | | | | | | +-----+-----+ | | | |<----------------------------------- |Rx pipeline|<-+-- | +--------+ telemetry response event +-----------+ | +---------------------------------------------------------------+ ]]> 1. Sender side flow The HPCC++ module running the HPCC CC algorithm in the sender side for every flow in the NIC. Flow can be defined by some transport parameters including 5-tuples, destination QP (queue pair), etc. It receives &INT; response events per flow which are generated from the RX pipeline, adjusts the sending window and rate, and update the scheduler on the rate and window of the flow. The scheduler contains a pacing mechanism that determine the flow rate by the value it got from the algorithm. It also maintains the current sending window size for active flows. If the pacing mechanism and the flow's sending window permits, the scheduler invokes for the flow a PktSend command to TX pipeline. The TX pipeline implements packet processing. Once it receives the PktSend event with flow ID from the scheduler, it generates the corresponding packet and delivers to the Network. If a sent packet should collect telemetry on its way the TX pipeline may add indications/headers that triggers the network elements to add telemetry data according to the &INT; protocol in use. The telemetry can be collected by the data packet or by dedicated prob packets generated in the TX pipeline. The RX pipe parses the incoming packets from the network and identifies whether telemetry is embedded in the parsed packet. On receiving a telemetry response packet, the RX pipeline extracts the network status from the packet and passes it to the HPCC++ module for processing. A telemetry response packet can be an ACK containing &INT;, or a dedicated telemetry response prob packet. 2. Receiver side flow On receiving a packet containing &INT;, the RX pipeline extracts the network status, and the flow parameters from the packet and passes it to the TX pipeline. The packet can be a data packet containing &INT;, or a dedicated telemetry request prob packet. The Tx pipeline may process and edit the telemetry data, and then sends back to the sender the data using either an ACK packet of the flow or a dedicated telemetry response packet.

Note that the window/rate calculation can be implemented at either the data sender or the data receiver. If the ACK packets already exist for reliability purpose, the &INT; information can be echoed back to the sender via ACK self-clocking. Not all ACK packets need to carry the &INT; information. To reduce the Packet Per Second (PPS) overhead, the receiver may examine the &INT; information and adopt the technique of delayed ACKs that only sends out an ACK for a few of received packets. In order to reduce PPS even further, one may implement the algorithm at the receiver and feedback the calculated window in the ACK packet once every RTT. The receiver-based algorithm, Rx-HPCC, is based on int.L, which is the &INT; information in the packet header. The receiver performs the same functions except using int.L instead of ack.L. The new function NewINT(int.L) is to replace NewACK(int.L)

(lastUpdateTime + T) then 30: W = ComputeWind(MeasureInflight(int), True); 31: send_ack(W) 32: lastUpdateTime = now; 33: else 34: W = ComputeWind(MeasureInflight(int), False); ]]> Here, since the receiver does not know the starting sequence number of a burst, it simply records the lastUpdateTime. If time T has passed since lastUpdateTime, the algorithm would recalcuate Wc as in Line 30 and send out the ACK packet which would include W information. Otherwise, it would just update W information locally. This would reduce the amount of traffic that needs to be feedback to the data sender. Note that the receiver can also measure the number of outstanding flows, N, if the last hop is the congestion point and use this information to dynamically adjust W_ai to achieve better fairness. The improvement would allow flows to quickly converge to fairness without causing large swings under heavy load.

HPCC++ can be adopted as the CC algorithm by a wide range of transport protocols such as TCP and UDP, as well as others that may run on top of them, such as iWARP, RoCE etc. It requires to have the window limit and congestion feedback through ACK self-clocking, which naturally conforms to the paradigm of TCP design. With that, HPCC++ introduces a scheme to measure the total inflight bytes for more precise congestion control. To run in UDP, some modifications need to be done to enforce the window limit and collect congestion feedback via probing packets, which is incremental.

We describe reference implementation on RDMA RoCEv2. This is an implementation for ``Sender-based HPCC++'' (see section 6.3.1.) using dedicated probe packets to collect the telemetry. HPCC++ module in the sender triggers the sending of ``telemetry request packet'' for a given flow. The NIC then sends the probe packet. The packet will have the same IP and UDP headers as the data packets of the given flow. Such packet is expected to be sent every RTT, see section 6 for more details. On receiving of telemetry request packet, the NIC extracts the telemetry from all the links along the path from the sender. HPCC++ module chooses the link with the highest inflight bytes and sends its telemetry (queue length, timestamp and tx bytes) back to the receiver on top of dedicated ``telemetry response packet''. On receiving of telemetry response packet, the NIC extracts the telemetry and pass it to the HPCC++ module which using this info to implement the rate update scheme.

Taking the benefit of precise congestion control for TCP is a natural next step. Since TCP segmentation at TX side (e.g., TSO) and coalescing at RX side (e.g., GRO) happen at the NIC HW or low-layer of TCP/IP stack, carrying per-pkt &INT; info between the TCP congestion control engine and network fabric has to work with the TSO and GRO. Instead, one way to adopt HPCC++ for TCP is using the special probe and notification packets to retrieve &INT; information. The sender generates a probe packet when it is actively sending data. The probe packet has the same 5-tuples (source and destination addresses, source and destination ports and protocol number) with the data packets and the &INT; header. The switches along the path identify the probe packet by its &INT; header and insert the &INT;. Once received the probe packet with &INT;, the receiver replies with a response packet piggybacking the &INT; to the sender. Note, both probe and response packets use a special DSCP number so that it can bypass the TSO and GRO in each side.

This document makes no request of IANA.

Although the discussion above mainly focuses on the data center environment, HPCC++ can be adopted at Internet at large. There are several security considerations one should be aware of. There may rise privacy concern when the telemetry information is conveyed across Autonomous Systems (ASes) and back to end-users. The link load information captured in telemetry can potentially reveal the provider's network capacity, route utilization, scheduling policy, etc. Those usually are considered to be sensitive data of the network providers. Hence, certain action may take to anonymize the telemetry data and only convey the relative ratio in rate adaptation across ASes without revealing the actual network load. Another consideration is the security of receiving telemetry information. The rate adaptation mechanism in HPCC++ relies on feedback from the network. As such, it is vulnerable to attacks where feedback messages are hijacked, replaced, or intentionally injected with misleading information resulting in denial of service, similar to those that can affect TCP. It is therefore RECOMMENDED that the notification feedback message is at least integrity checked. In addition, discusses the potential risk of a receiver providing misleading congestion feedback information and the mechanisms for mitigating such risks.

HPCC++ falls in the general category of switch-assisted congestion control. However, HPCC++ includes a few unique design choices that are different from other switch-assisted approaches.

First, HPCC++ implements a primal-mode algorithm that requires only the ``write-to-packet'' operation from switches, which has already been supported by telemetry protocols like INT or IOAM . Please note that this is very different from dual-mode algorithms such as XCP and RCP , where switches take an actively role in determining flows' rates.
Second, HPCC++ achieves a fast utilization convergence by decoupling it from fairness convergence, which is inspired by XCP.
Third, HPCC++ enables the switch-guided multiplicative increase (MI) by defining the ``inflight byte'' to quantify the link load. The inflight byte tells both the underload and overload of the link precisely and thus it allows the flow to increase/decrease the rate multiplicatively and safely. By contrast, traditional approaches of using the queue length or RTT as the feedback cannot guide the rate increase and instead have to rely on additive increase (AI) with heuristics. As the link speed contines to grow, this becomes increasingly slow in reclaiming the unused bandwidth. Besides, queue-based feedback mechanisms subject to latency inflation.
Last, HPCC++ uses TX rate instead of RX rate used by XCP and RCP. As detailed in , we view the TX rate is more precise because RX rate and queue length are overlapped and thus it causes oscillation.

Under the use of QoS (Quality of service) priority queuing in switches, the length of flow's own queue cannot tell the actual queuing time and the exact extent of congestion. Although general approaches for running congestion control with QoS queuing are out of the scope of this document, we provide a few hints for HPCC++ running friendly with QoS queuing. In this case, HPCC++ can leverage the packet sojourn time (the egress timestamp minus the ingress timestamp) instead of the queue length to quantify the packet's actual queuing delay. In addition, the operators typically use the Deficit Weighted Round Robin (DWRR) instead of the strict priority (SP) as their QoS scheduling to prevent traffic starvation. DWRR provides a minimum bandwdith guarantee for each queue so that HPCC++ can leverage it for precise rate update to avoid congestion.

HPCC++ allows switches and end-hosts to share precise information of network utilization, which suggests a framework for path selection and rate control at end-hosts. The framework HPCC++ enabled is to leverage each switch to report its link load information via &INT;. The end-host fetches &INT; along the traffic routes and makes a timely and accurate decision on path selection and traffic admission.

The authors would like to thank RTGWG members for their valuable review comments and helpful input to this specification.

The following individuals have contributed to the implementation and evaluation of the proposed scheme, and therefore have helped to validate and substantially improve this specification: Pedro Y. Segura, Roberto P. Cebrian, Robert Southworth and Malek Musleh.