← All Posts
DC Fabric Design

Why AI Training Clusters Need Lossless Ethernet — And What That Means for Your Fabric

Mario Geovanny Campos·2026-05-08·8 min read

The Problem No One Talks About in the Marketing Decks

When NVIDIA or Meta talks about training a large language model, they talk about GPU count, FLOPS, and model parameters. What they don't put in the headline is this: a distributed training job across 1,000 GPUs will fail or slow to a crawl if your network drops a single packet at the wrong moment.

That's not an exaggeration. It's the fundamental reason AI networking is different from everything else we've built in data centers.

Why Training Is Different from Inference

Most network engineers understand general DC traffic patterns — bursty east-west traffic, some north-south, stateless HTTP flows, databases. Training clusters break all of those assumptions.

Inference is stateless. A request comes in, a model runs, a response goes out. A dropped packet just means a slower response or a retry. The job doesn't die.

Training is stateful and collective. Every GPU in the cluster participates in operations called AllReduce — a synchronization primitive where every node shares gradient updates with every other node. The operation can't complete until every participant finishes. One slow or stalled GPU holds up the entire job.

This changes the network design requirement fundamentally. You're not designing for average throughput. You're designing for tail latency at scale.

What Happens When You Drop a Packet in RoCEv2

RoCE (RDMA over Converged Ethernet) uses RDMA — Remote Direct Memory Access. The GPU writes directly to the memory of another GPU over Ethernet, bypassing the CPU entirely. This is what gives you the microsecond-level latency you need for collective operations.

The problem: RDMA has no built-in congestion control at the transport layer. Unlike TCP, which backs off gracefully when it detects loss, early RDMA implementations would simply retransmit aggressively — making congestion worse, not better.

The solution the industry landed on is lossless Ethernet — a fabric that simply never drops packets due to congestion. Instead of dropping, switches signal senders to pause.

The Three Technologies That Make It Work

Priority Flow Control (PFC) — IEEE 802.1Qbb

PFC is a per-priority pause mechanism. When a switch buffer fills up on a particular priority class, it sends a PAUSE frame to the upstream device, telling it to stop sending on that priority.

The key design decision: you isolate RoCE traffic into its own priority class (typically priority 3). PFC only pauses that class — not your regular IP traffic. This is called a lossless lane.

Explicit Congestion Notification (ECN)

ECN is the early warning system. Before buffers actually fill up and PFC kicks in, switches mark packets with a Congestion Experienced (CE) bit. Endpoints see the marked packets and reduce their transmission rate proactively.

In RoCEv2 environments, this is handled by DCQCN (Data Center Quantized Congestion Notification) — a rate-based algorithm that responds to ECN marks by reducing the send rate, then slowly ramps back up.

The Design Principle: PFC is the emergency brake, ECN is the cruise control.

You want ECN doing most of the work. If PFC is firing constantly, your ECN tuning is wrong.

DCQCN (Data Center Quantized Congestion Notification)

DCQCN is the end-to-end algorithm that ties ECN together with rate-based congestion control. Developed by Microsoft and now standard in RoCEv2 deployments, it operates in the RDMA NIC (RNIC) rather than the OS stack.

Parameters you'll tune: K_min, K_max (ECN marking thresholds), g (rate reduction factor), timer values for rate increase. Getting these wrong means either under-reactive (PFC storms) or over-reactive (throughput collapse) behavior.

What This Means for Fabric Design

A lossless fabric isn't just a configuration change. It constrains several design decisions:

Topology matters more. An oversubscribed access layer introduces congestion points that PFC has to deal with. Non-blocking CLOS topologies with proper buffering are the baseline assumption for AI clusters.

Buffer sizing changes. Deep buffer switches (Arista 7800R3, Cisco Nexus 9300-GX) handle PFC pause propagation better than shallow buffer switches. When PFC pauses propagate upstream — a condition called PFC deadlock — shallow buffers make it worse.

Every link in the path must be configured consistently. One switch in the path with wrong QoS policy breaks the lossless guarantee. In a multivendor environment, this is where deployments go wrong.

Monitoring becomes critical. PFC pause counters, ECN mark rates, DCQCN statistics — you need telemetry from every device in the path. This is one reason Arista dominates AI DC deployments — their streaming telemetry gives you per-queue, per-second visibility.

The CCDE Design Question

If you were sitting the CCDE AI Infrastructure scenario and the question was "design a 128-GPU training fabric," the answer isn't just "use RoCEv2 and enable PFC." That's implementation, not design.

The design question is: given the constraints of this environment, what fabric topology, buffer architecture, and congestion control approach gives you the best combination of performance, operational simplicity, and blast radius when something goes wrong?

That's what we'll dig into in the next post — sizing and topology decisions for GPU clusters at different scales.


Mario Geovanny Campos is a CCIE EI pursuing CCDE AI Infrastructure. He writes about network fabric design and AI infrastructure at fluxfabric.net.

← All PostsFollow on LinkedIn ↗