Why MRC Represents a New Paradigm for AI Datacenter Networking
In 1983, when the TCP/IP protocol suite replaced NCP as the standard on ARPANET, the fundamental paradigm of internet interconnection underwent a seismic shift. Over the following four decades, TCP/IP served as the universal network language, enabling countless heterogeneous systems to communicate seamlessly and spawning the entire internet economy. Today, in the realm of AI infrastructure, we may be witnessing a similar paradigm shift.
In May 2026, OpenAI, together with AMD, Broadcom, Intel, Microsoft, and NVIDIA, open-sourced the Multipath Reliable Connection (MRC) protocol through the Open Compute Project (OCP). Designed for 100K+ GPU-scale AI training clusters, this network transport protocol compresses failure recovery time from seconds to microseconds, achieving a qualitative leap in network reliability. As NVIDIA Senior VP Gilad Shainer described it: "MRC extends the routing 'brain' to the host."
This is not merely the debut of a new protocol — it represents a new paradigm for AI datacenter networking, redefining how large-scale GPU clusters communicate with each other and how they respond to failures.
MRC Technical Architecture Deep Dive
1. SRv6 Source Routing: Moving Routing Decisions from Switches to NICs
Traditional datacenter networks rely on dynamic routing protocols like BGP, allowing each switch to independently compute packet forwarding paths. This design works well in small-to-medium-scale networks but faces severe challenges at 100K+ GPU hyperscale:
- Long convergence time: When a link fails, dynamic routing protocols require multiple RTTs to complete global route recomputation, potentially taking "tens of seconds"
- ECMP complexity: In high-radix two-tier topologies, each T0 switch must maintain nearly 256 ECMP group entries, approaching the total number of T0 switches in the cluster
- Diagnostic difficulty: Complex interactions in dynamic routing make fault localization extremely difficult
MRC takes a fundamentally different approach: it disables dynamic routing and adopts SRv6 (IPv6 Segment Routing) source routing. The implementation works as follows:
- Path encoded in packets: The sender embeds the complete path (a sequence of switch identifiers) into the packet's destination address; each switch simply forwards according to the address, requiring no autonomous decision-making
- Static forwarding tables: Switch forwarding tables are configured at initialization and remain largely unchanged, dramatically reducing switch complexity
- Microsecond-level failure response: When a path fails, MRC immediately deactivates that path and switches to an alternate, without waiting for routing protocol convergence
According to OpenAI's published paper, under this architecture, the longest path traverses only 3 switches instead of 5-7 in traditional networks. This directly reduces end-to-end latency by 30%-50%.
2. Multipath Packet Spraying: Redefining Load Balancing
MRC's core innovation lies in distributing a single RDMA transport across hundreds of paths simultaneously, rather than binding it to a single path as traditional protocols do.
Entropy Value (EV) Mechanism:
- Each MRC packet carries a 32-bit entropy value (EV), distributed across the UDP source port and IPv6 flow label fields
- At QP (Queue Pair) initialization, the sender generates 128-256 EVs forming an EV set
- During transmission, different EVs are rotated, causing packets to be sprayed across different paths
ECN-Driven Adaptive Load Balancing:
- Switches enable ECN (Explicit Congestion Notification) as a load balancing signal
- The receiver echoes ECN signals back to the sender, indicating congestion levels on specific paths
- The sender temporarily avoids congested paths, keeping network queues stable
Key design: MRC disables PFC (Priority Flow Control), adopting a "best-effort" mode. PFC creates head-of-line blocking when a single flow spans hundreds of paths, severely impacting tail latency. OpenAI's production measurements show that in large-scale synchronous training, this design improves network utilization by 15%-20%.
3. Multi-Plane Architecture: 800Gb/s Becomes 8×100Gb/s
Traditional architectures treat an 800Gb/s NIC as a single link, requiring 3-4 tiers of switches to support a 100K GPU cluster. MRC introduces a revolutionary multi-plane design:
Core concept: Split an 800Gb/s NIC into 8 independent 100Gb/s ports, connecting to 8 parallel 100Gb/s network planes.
Architecture advantages:
| Comparison | Traditional 3-Tier 800Gb/s | MRC 2-Tier 8×100Gb/s Multi-Plane |
|---|---|---|
| Switch tiers for 100K GPUs | 3-4 tiers | 2 tiers |
| Per T0 switch ports | 64 ports @ 800Gb/s | 512 ports @ 100Gb/s |
| Maximum cluster scale | ~64K NICs | 131,072 GPUs |
| Maximum path hop count | 5-7 hops | 3 hops |
| Optics requirement | 100% | 66% |
| Switch count | 100% | 60% |
| Single link failure impact | 12.5% bandwidth loss | 3% bandwidth loss (100Gb/s plane) |
This design yields significant scale benefits:
- Cost reduction: At full bisection bandwidth, optics reduced by 1/3, switches reduced by 2/5
- Power reduction: Fewer switch tiers mean lower PUE
- Higher redundancy: T0-T1 link failure loses only 3% bandwidth instead of 12.5%
4. Microsecond-Level Failure Recovery Mechanism
In 100K GPU synchronous training, any delay can cause millions of dollars in training losses. MRC achieves true microsecond-level failure recovery:
Fast Selective Retransmission (SACK):
- The receiver precisely indicates received packets via SACK (Selective Acknowledgment)
- The sender retransmits only the lost packets, not the entire window
Packet Trimming:
- Congested switches trim packet payloads, forwarding only headers to the destination
- The receiving NIC generates a NACK to trigger fast retransmission
- Distinguishes between congestion-induced and link-failure-induced packet loss
Path Health State:
- Each EV maintains a small number of path health state bits
- Upon packet loss, the EV is immediately deactivated and a backup path is used
- Background probe packets confirm whether the path has recovered
Production Validation: OpenAI's production measurements show that during large-scale synchronous pretraining:
- Multiple T0-T1 link flaps per minute have no measurable impact on training jobs
- Rebooting 4 T1 switches during training without notifying the training team — jobs continue uninterrupted
- GPU-NIC link failure causes only 1/8 bandwidth loss; tasks continue running
MRC vs Traditional Approaches: RDMA/RoCE/InfiniBand Technical Comparison
| Dimension | MRC | InfiniBand | RoCE v2 |
|---|---|---|---|
| Base Architecture | Extended RoCEv2, multipath spraying | Dedicated IB network | Standard Ethernet + RDMA |
| Single Cluster Scale | 131,000+ GPUs (theoretical) | ~10K nodes (typical) | ~1K-10K nodes |
| Network Topology | 2-tier multi-plane | Multi-tier Fat-tree | Depends on network design |
| Failure Recovery Time | Microsecond-level | Millisecond-level | Millisecond-level |
| Congestion Control | ECN + adaptive spraying | IB congestion control (CCA) | PFC + ECN |
| Path Utilization | Hundreds of paths sprayed simultaneously | Single path (typically) | ECMP (limited multipath) |
| Latency | Very low (3 hops) | Lowest | Low |
| Hardware Cost | Ethernet pricing | Dedicated IB equipment, most expensive | Ethernet pricing |
| Ecosystem Openness | OCP open-source specification | NVIDIA-dominated | IBTA standard |
| Target Use Case | 100K+ GPU hyperscale | 10K-scale HPC/AI | 1K-scale mixed workloads |
Core Differences Explained:
MRC vs InfiniBand: MRC is not intended to completely replace InfiniBand. InfiniBand still holds advantages in single-path latency and determinism, but MRC offers generational advantages in scalability and operational simplicity at hyperscale. More importantly, MRC is based on standard Ethernet, reducing procurement and maintenance costs.
MRC vs RoCE: Traditional RoCE employs lossless Ethernet design, requiring complex PFC configuration and suffering from performance limitations under high-radix ECMP. MRC's packet spraying mechanism fundamentally solves the flow collision problem, and ECN-driven adaptive load balancing is more intelligent than traditional ECMP.
MRC vs UEC (Ultra Ethernet Consortium): UEC is another multi-vendor Ethernet transport standard initiative. MRC's advantage lies in its production validation at OpenAI and Microsoft, while UEC is still being refined. NVIDIA has stated that the two will coexist, with different hyperscalers choosing the solution that fits their needs.
MRC vs veRoCE: Choosing Between Two Approaches
Highly Similar Technical Paths
veRoCE (ByteDance's Enhanced RoCE) and MRC essentially solve the same core problems that RoCEv2 faces in large-scale GPU clusters: PFC storms, ECMP conflicts, and single-path bottlenecks. The two approaches show remarkable similarities in technical implementation:
| Feature | veRoCE | MRC |
|---|---|---|
| Multi-path Transmission | Modified source entropy + switch spray | Packet Spray |
| Out-of-order Processing | DDP (Direct Data Placement) | SACK + out-of-order reception |
| Selective Retransmission | SACK + lazy SACK | SACK + NACK |
| Congestion Control | Path-level + connection-level dual mode | NSCC (based on UEC 1.0 spec) |
| PFC Independence | No lossless network dependency | PFC-free |
| Slow Path Detection | Sequence number based fast exclusion | Microsecond-level failover |
| RoCEv2 Compatibility | Auto-fallback to RoCEv2 mode | Preserves RDMA semantics |
Key Differences: A Divergence in Architectural Philosophy
Despite highly similar technical implementations, the two approaches diverge significantly in architectural philosophy:
1. Architectural Philosophy: Revolutionaries vs. Reformists
- MRC is the "Revolutionary" — redesigns the forwarding plane using SRv6 source routing, pushing routing decisions to the NIC, disabling dynamic routing
- veRoCE is the "Reformist" — enhances RoCEv2 while preserving traditional routing architecture
2. Standardization Path
- MRC follows the OCP open-source route, driven by the OpenAI + AMD + NVIDIA + Intel + Broadcom + Microsoft consortium
- veRoCE currently uses a ByteDance proprietary + vendor collaboration model, released via the Volcano Engine developer platform
3. Multi-plane Support
- MRC natively designs multi-plane 2-tier architecture, supporting 8 independent 100Gb/s network planes
- veRoCE focuses on optimization within existing 3-tier fat-tree
4. Deployment Scale
- MRC targets 100K+ GPUs, validated by Oracle Abilene/Microsoft Fairwater deployments
- veRoCE currently at 128-GPU validation stage
5. Hardware Implementation
- MRC already has AMD Pensando Pollara 400/Vulcano 800 NIC implementations
- veRoCE being adapted with NVIDIA/AMD/Broadcom NICs
6. Ecosystem Openness
- MRC fully open-source specification via OCP
- veRoCE released via Volcano Engine developer platform, openness level remains to be seen
Performance Data Comparison
| Metric | veRoCE | MRC |
|---|---|---|
| Validation Scale | 128 GPU cluster | 100K+ GPU cluster |
| LLM Training Speed Improvement | 11.2% | - |
| AlltoAll Throughput Improvement | 48.4% | - |
| Effective Throughput at 2% Packet Loss | 95.7% | - |
| Failure Recovery Time | - | Microsecond-level (compressed from seconds) |
| Switch Tiers | 3-4 tiers | 2 tiers |
Conclusion: Two Routes, Different Use Cases
For Chinese enterprises, veRoCE's compatibility-focused approach may be more pragmatic — it doesn't require network architecture reconstruction and can be progressively deployed on existing RoCEv2 infrastructure. veRoCE's fallback mechanism also provides stronger compatibility guarantees.
For hyperscale training clusters (50K+ GPUs), MRC's SRv6 architecture offers long-term advantages — it fundamentally solves dynamic routing convergence problems, and the 2-tier architecture provides generational advantages in cost and latency.
Long-term, these two protocols may converge — MRC's SRv6 forwarding plane combined with veRoCE's congestion control algorithms could be the optimal combination for future AI networking. This also aligns with the standardization direction promoted by UEC (Ultra Ethernet Consortium).
Strategic Significance of the Six-Company Joint Release
Why OpenAI Leads
OpenAI is the enterprise with the greatest need for network reliability today. Its training jobs consume hundreds of millions of dollars in GPU compute, and a single network failure can crash an entire training run, costing millions.
OpenAI's Sachin Katti (Head of Industrial Compute) stated: "At meaningful scale, that reliability and efficiency is not a nice-to-have; it is part of what makes synchronous frontier model training possible."
Strategic considerations for OpenAI leading MRC development:
- Demand-driven: Internal urgent need for 100K+ GPU network reliability
- Technical expertise: Team possesses years of large-scale cluster operations experience
- Standard-setting influence: Open-sourcing avoids single-vendor lock-in
- Ecosystem building: Attracting hardware vendors to co-build, expanding influence
Why AMD/NVIDIA/Intel All Participate
The logic behind three major chip vendors participating simultaneously:
NVIDIA: Spectrum-X is its core AI networking platform. MRC strengthens Spectrum-X's competitiveness, positioning it as the "optimal MRC execution platform." NVIDIA emphasizes its differentiation lies in deep hardware telemetry and intelligent fabric control.
AMD: AMD Pollara and Vulcano NICs support MRC, expanding its AI networking market share. AMD's participation signals serious commitment to the AI infrastructure market.
Intel: Participating through IPU-side driver development, Intel is repositioning its role in AI infrastructure.
Broadcom: Thor Ultra NIC and Tomahawk 5/6 switch silicon natively support MRC, serving as the core network silicon contributor.
Microsoft: As a cloud provider and OpenAI compute supplier, Microsoft deploys MRC in its Fairwater supercomputers, providing production environment validation.
Key Insight: This represents a classic "co-opetition" pattern — NVIDIA and AMD compete fiercely in the GPU market but choose to collaborate on network protocols. This reflects the new game theory in AI infrastructure scaling.
Deployed Cases: Oracle Abilene and Microsoft Fairwater
Oracle Cloud Infrastructure — Abilene Data Center
Oracle Abilene datacenter is a key component of OpenAI's compute infrastructure. This facility uses NVIDIA GB200 systems, running frontier model training tasks powering ChatGPT and Codex.
Deployment Results:
- Successfully runs large-scale synchronous pretraining at 75K GPU level
- Network idle wait time reduced by 90%+
- GPU effective compute utilization significantly improved
Microsoft Fairwater Supercomputer
Fairwater is Microsoft's supercomputer built for AI training, located in Atlanta and Wisconsin.
Deployment Results:
- Two-tier multi-plane architecture supports 100K+ GPU clusters
- Switch maintenance can proceed hot, without affecting training
- True "zero-interruption operations" achieved
Real-World Data: MRC Performance from the Paper
OpenAI's published paper "Resilient AI Supercomputer Networking using MRC and SRv6" provides detailed production measurements:
- Startup packet loss rate: During 75K GPU job startup, packet loss rate drops rapidly within 2 minutes, eventually stabilizing at fewer than 1 loss per second per NIC (~1 in 25 million at 800Gb/s)
- Link flap tolerance: T0-T1 link flaps multiple times per minute have no measurable impact on synchronous pretraining
- Switch reboot impact: Rebooting 4 T1 switches during training without human intervention, job continues running
Profound Impacts on the Industry Chain
Network Equipment Vendors: Arista/Cisco/Juniper
Challenges:
- Dynamic routing and complex configurations become redundant in MRC architecture
- Must support SRv6 micro-segment routing and MRC forwarding mode
- Higher hardware performance requirements: 512-port @ 100Gb/s switches become standard
Opportunities:
- Multi-plane architecture increases total switch demand
- SRv6 support becomes a differentiator
- Deep collaboration with chip vendors becomes essential
Arista has already partnered with OpenAI to implement SRv6 in EOS. Other vendors need to accelerate their response.
Chip Vendors: NVIDIA/Mellanox vs AMD/Pensando/Broadcom
NVIDIA/Mellanox:
- Spectrum-X is the optimal MRC execution platform, brand advantage strengthened
- ConnectX-8 SuperNIC natively supports MRC
- Risk: Open protocols may erode InfiniBand premium pricing
AMD:
- Pollara and Vulcano NIC support MRC, expanding AI networking market share
- ROCm ecosystem and MRC co-optimization opportunities
Broadcom:
- Thor Ultra NIC supports 2/4/8-plane architectures, distributing across 128 paths
- Tomahawk 5 (51.2Tbps) and Tomahawk 6 (102.4Tbps) become core switching silicon
Intel/Pensando:
- IPU/DPU MRC support provides differentiated value
- Pensando DSC (Distributed Services Card) combines with MRC in SmartNIC market
Cloud Providers: Azure/AWS/GCP/OCI
Azure:
- Fairwater supercomputers are MRC production validation models
- Multi-tenant GPU cloud services can leverage MRC to improve utilization 30%-50%
- Offer more reliable training services to Azure AI customers
OCI:
- Abilene datacenter operational experience becomes core competitive advantage
- Attract more AI customers to GPU cloud services
AWS/GCP:
- TPU/Trainium platforms face similar network bottlenecks
- Industry prediction: adaptation actions expected within 12 months
Hyperscale Datacenter Construction Costs Reduced 20-30%
MRC's TCO impact is comprehensive:
- Hardware cost: Two-tier switch architecture reduces switches by 2/5, optics by 1/3
- Power cost: 100K GPU clusters save ~230M RMB annually in electricity (based on GPU utilization improvement to 95%+)
- Operations cost: Microsecond-level failure recovery reduces manual intervention, ops teams can manage larger clusters
- Opportunity cost: Training interruption losses dramatically reduced
Enterprise Decision Recommendations
AI Labs and Frontier Research Institutions
- Evaluate timing for adopting MRC as next-generation training cluster network standard
- Participate in OCP community, drive protocol evolution
- Follow Spectrum-X, Mellanox, Broadcom Thor and other MRC-supported hardware platforms
Cloud Service Providers
- Incorporate MRC into AI cloud service technical selection
- Evaluate utilization improvement potential for multi-tenant GPU pools
- Partner with chip vendors to optimize MRC performance for specific workloads
Enterprise AI Teams
- Monitor MRC applicability in smaller clusters (<1000 GPUs)
- Evaluate migration cost-benefit from RoCE v2
- Maintain communication with technology suppliers, track product roadmaps
Network Equipment and Chip Vendors
- Accelerate SRv6 feature support
- Collaborate with OCP community on interoperability testing
- Differentiation competition focuses on hardware telemetry and intelligent control planes
Conclusion: A New Paradigm for AI Networking Has Arrived
MRC's release marks AI datacenter networking entering a new phase. It demonstrates that at hyperscale AI training, the network is no longer a "dumb pipe" but requires specialized "intelligent infrastructure" design.
The network standard for future million-GPU clusters may be taking shape today. Enterprises need to seriously evaluate MRC and its impact on their AI strategies now. Early movers will gain advantages in future AI infrastructure competition.
Why it Matters
Why This Protocol Matters to the Industry
Network Evolves from "Dumb Pipe" to "Intelligent Infrastructure"
In traditional datacenters, the network was viewed as a data "pipeline." But MRC proves that at 100K+ GPU synchronous training scales, the network IS part of the compute pipeline.
Microsecond Failure Recovery Redefines Reliability Standards
MRC compresses failure recovery from seconds to microseconds, meaning training jobs achieve true "always-on" status.
Open Source Breaks Vendor Lock-in
Through OCP, MRC avoids becoming any single vendor's differentiation tool, becoming shared industry infrastructure.
DECISION
Decision Recommendations (by Role)
AI Labs and Frontier Research Institutions
- Act now: Evaluate MRC as next-gen training cluster network standard
- Technical preparation: Contact hardware vendors (Spectrum-X, Broadcom Thor)
- Community engagement: Join OCP community
- Talent development: Build teams with SRv6 and MRC capabilities
Cloud Service Providers
- Strategic assessment: Evaluate MRC's impact on competitiveness
- Performance validation: Verify MRC improvements in test environments
- Cost modeling: Calculate TCO changes from RoCE v2 migration
Enterprise AI Teams
- Watchful waiting: Monitor MRC applicability in smaller clusters
- Vendor dialogue: Discuss MRC support timelines
Network Equipment and Chip Vendors
- Product acceleration: Accelerate SRv6 feature development
- Interoperability testing: Collaborate with OCP community
PREDICT
6-12 Month Impact Predictions
Market Impact
- AWS/GCP follow-up: Expected within 12 months
- MRC ecosystem expansion: Over 30 vendors will announce MRC support
- InfiniBand pressure: NVIDIA premium pricing will face pressure
Technology Evolution
- MRC 2.0: Protocol optimizations expected late 2026
- UEC convergence: MRC and UEC may converge on certain features
- Million GPU support: Research for 1M GPU clusters will begin
Industry Chain Changes
- Switch architecture: 512-port @ 100Gb/s switches become mainstream
- Operations transformation: From "firefighting" to "planning" mode
- Cost reduction: Hyperscale datacenter costs drop 20-30%
💬 Comments (0)