A
AMD
2026-04-30
Architecture Shift Impact: Important Strength: High Conf: 80%

AMD Proposes New AI Infrastructure Networking Paradigm: From Lossless Fabrics to Intelligent Endpoints

Summary

AMD published a blog outlining seven key questions for building large-scale AI infrastructure, arguing that traditional lossless Ethernet or InfiniBand architectures face cost and complexity bottlenecks. It advocates shifting network intelligence and reliability functions from expensive, specialized switches to intelligent NICs, enabling reliable transport over standard (potentially lossy) Ethernet to reduce TCO and simplify operations.

Key Takeaways

AMD identifies the network as the core bottleneck for massive AI clusters (tens of thousands of GPUs). Traditional solutions (InfiniBand or complex RoCE configurations) rely on expensive and complex lossless network fabrics to meet AI workloads' stringent demands for low jitter, high bandwidth, and uninterrupted data transfer.

AMD proposes a new paradigm: an 'endpoint-intelligent networking architecture,' where endpoints (intelligent NICs) are smart enough to create reliable transport protocols over standard (potentially lossy) Ethernet fabrics. This eliminates the complexity of managing lossless fabrics and can reduce network costs by up to 58% (based on AMD internal analysis). The architecture emphasizes millisecond-level fault detection/isolation, comprehensive network observability, and support for open ecosystems and software programmability.

Why It Matters

This represents a potential architectural shift at the network layer for AI infrastructure. The control point is moving from expensive, specialized switching hardware towards intelligent NICs and software, potentially reshaping the economics of large-scale AI clusters and the vendor landscape. Widespread industry adoption would lower enterprise deployment barriers and challenge incumbent networking leaders.

PRO Decision

**Vendors**: Assess opportunities to establish control points in intelligent NICs and endpoint reliability software layers. Networking equipment vendors must address the impact of the 'dumb switch + smart endpoint' architecture on demand for high-end switches, or embed into the new layer via software and ecosystem partnerships.
**Enterprises**: Re-evaluate network architecture choices for large-scale AI clusters. When planning 10k+ GPU clusters, consider 'intelligent endpoints + standard Ethernet' as a cost-comparison option and conduct proof-of-concepts. Monitor the shift in network operations from managing complex lossless fabrics to automation based on intelligent endpoints.
**Investors**: Monitor the migration of value from specialized network hardware (e.g., high-end lossless switches) to intelligent NICs and AI networking software stacks. Watch for adoption of similar architectures by major cloud providers and large AI labs as a key signal for validating this paradigm.
Source: blog
View Original →

💬 Comments (0)