AMD MLPerf 6.0: MI350 GPUs Achieve 3.5x Leap with MXFP4, Debut Multi-Node Training
Summary
Key Takeaways
AMD's MLPerf Training 6.0 submission highlights three milestones:
- First production-ready MXFP4 training recipe on LLM benchmarks (Llama 2-70B, Llama 3.1-8B), powered by CDNA 4 based MI355X GPU (3nm, 185B transistors, 288GB HBM3E) delivering up to 10 PF MXFP4, supporting 520B parameter models on a single GPU.
- AMD Primus software debuted, combined with ROCm optimization, enabling 3.5x generational leap on Llama 2-70B vs MI300X, with additional 16-19% uplift within MI350 family over 7 months.
- First multi-node submission: FLUX.1 on 64 nodes (512 GPUs) via Oracle Cloud Infrastructure, matching NVIDIA's largest submission. 10 partners (Dell, HPE, Cisco, Supermicro, etc.) results within 6% of AMD official.
Why It Matters
AMD's MLPerf submission is a strategic encirclement of NVIDIA: using MXFP4 training recipe and Primus software to create software lock-in similar to CUDA. Primus as a distributed training framework will gradually reduce user flexibility with open-source frameworks like PyTorch FSDP. AMD downplays MXFP4 precision risks—low-precision training may cause convergence issues for 520B models. Multi-node training relies on InfiniBand, while NVIDIA's NVLink+NVSwitch provides lower latency. AMD's UBB8 nodes face tail latency and PFC/ECN congestion limits. The 10 partner results within 6% may not reflect identical software versions, risking fragmentation and lock-in.
PRO Decision
【Vendors】NVIDIA should counter by showcasing NVFP4 + NVLink 5.0 advantages in next MLPerf, emphasizing domain interconnect bandwidth (900GB/s vs AMD's 400Gb/s InfiniBand) reducing tail latency. Use CUDA 12.x FP4 native support and open-source FSDP2 to undermine Primus lock-in.
【Enterprises】CIOs must demand independent validation of MXFP4 training convergence accuracy and test InfiniBand vs NVLink throughput in multi-node setups. Audit Primus version compatibility and cloud portability; keep PyTorch FSDP as fallback.
【Investors】Look beyond PR: AMD's software ecosystem maturity and network interconnect remain weak. Watch for RoCEv2 breakthroughs and Primus community adoption. Short-term boost possible, but long-term risk from NVIDIA B300/B400 retaliation.
Get 3-5 key AI infrastructure signals weekly →
💬 Comments (0)