A
AMD
2026-05-06
Architecture Shift Impact: Major Strength: High Conf: 85%

AMD and OpenAI Introduce MRC, a Next-Gen Transport Protocol for AI Training

Summary

AMD, in collaboration with OpenAI, Microsoft, and other industry leaders, has released the specification for the Multipath Reliable Connection (MRC) protocol. MRC addresses performance bottlenecks of RoCEv2 in hyperscale AI training clusters through intelligent packet spraying, selective retransmission, and network-signaled congestion control, aiming to improve bandwidth utilization and job resilience.

Key Takeaways

AMD's official blog announces the development of the MRC transport protocol in collaboration with OpenAI and others, targeting the challenges of training trillion-parameter AI models. MRC introduces key enhancements to address RoCEv2's limitations in single-path communication, link failure recovery, and inefficient retransmission (go-back-N).

Key innovations include: 1) Intelligent packet-spray load balancing leveraging ECMP for multi-path transmission; 2) Selective packet retransmission based on SACK/NACK, reducing network overhead; 3) Network-Signaled Congestion Control (NSCC) based on the UEC specification, replacing PFC to mitigate large-scale congestion spreading.

AMD contributed the NSCC congestion control algorithm and extended IB/RDMA transport semantics for backward compatibility. Its Pensando Pollara 400 AI NIC has been validated with MI350/355 GPU clusters, and the next-gen 'Vulcano' 800 NIC is being qualified for MI400 series.

Why It Matters

This signals an architectural shift at the networking layer of AI infrastructure. If MRC becomes an industry standard, it will reshape the design paradigm for hyperscale AI clusters, moving from reliance on lossless network hardware (PFC) towards more intelligent, software-defined transport protocols, directly impacting the cost and efficiency of future AI compute clusters.

PRO Decision

**Control Layer Shift**
Vendors: Assess opportunities to integrate or support the MRC protocol in smart NICs or switches, aiming to control the emerging AI-optimized transport software stack. Inaction risks losing relevance in future AI data center networking standards.
Enterprises: When planning future AI infrastructure, consider the network transport protocol (e.g., MRC vs. RoCEv2) as a key evaluation criterion, as it will impact cluster scale limits and operational complexity. Conduct technical validation within 12-18 months.
Investors: Monitor the shift in value from traditional network hardware (e.g., PFC-enabled switches) towards AI-optimized network software stacks and smart NICs/DPUs. Watch for adoption signals of MRC from the Ultra Ethernet Consortium (UEC) and major cloud providers.
Source: blog
View Original →

💬 Comments (0)