AMD and OpenAI Contribute MRC Protocol to OCP for Scalable AI Networking
Summary
Key Takeaways
The MRC protocol addresses challenges in large-scale AI training clusters, such as congestion, latency variation, and slow failure recovery in traditional single-path networks. It distributes packets across multiple paths to smooth traffic and enable near-real-time rerouting, maintaining GPU synchronization.
AMD played a formative role by co-authoring the MRC specification and contributing advanced congestion control technology. Critically, AMD has already deployed a pre-standard implementation of MRC on its programmable Pensando Pollara 400 AI NIC in test clusters with a major cloud provider, paving the way for its next-gen 800G Vulcano AI NIC.
Why It Matters
This signals a shift in AI infrastructure networking from chasing peak bandwidth to prioritizing "productive compute" and operational resilience. By open-sourcing MRC via OCP, it could accelerate Ethernet's role as a programmable control plane for AI clusters, challenging proprietary fabrics and merging networking with compute architecture.
PRO Decision
Vendors: Networking and DPU vendors must assess MRC's potential impact on existing RoCE/InfiniBand ecosystems and consider integrating support via hardware programmability or software stacks to avoid marginalization in the next-gen AI networking market.
Enterprises: Enterprises planning large-scale AI clusters should incorporate network resilience and protocol openness into infrastructure evaluation criteria, assessing GPU vendors' network interoperability and failure recovery capabilities.
Investors: Monitor the value migration from proprietary networking hardware towards programmable data planes (DPU/SmartNIC) and software stacks that support open standards, tracking adoption progress of MRC by major cloud providers.
Get 3-5 key AI infrastructure signals weekly →
💬 Comments (0)