A
AMD
2026-06-17
Product Launch Impact: Important Conf: 85%

AMD MLPerf 6.0: MI350 GPUs Achieve 3.5x Leap with MXFP4, Debut Multi-Node Training

Summary

AMD submitted its most comprehensive MLPerf Training 6.0 results, including first multi-node training (FLUX.1 on 512 GPUs) and MXFP4 training recipe. MI355X delivers 3.5x generational leap over MI300X on Llama 2-70B, within 5% of NVIDIA B200. 10 ecosystem partners validated reproducibility.

Key Takeaways

AMD's MLPerf Training 6.0 submission highlights three milestones:

  • First production-ready MXFP4 training recipe on LLM benchmarks (Llama 2-70B, Llama 3.1-8B), powered by CDNA 4 based MI355X GPU (3nm, 185B transistors, 288GB HBM3E) delivering up to 10 PF MXFP4, supporting 520B parameter models on a single GPU.
  • AMD Primus software debuted, combined with ROCm optimization, enabling 3.5x generational leap on Llama 2-70B vs MI300X, with additional 16-19% uplift within MI350 family over 7 months.
  • First multi-node submission: FLUX.1 on 64 nodes (512 GPUs) via Oracle Cloud Infrastructure, matching NVIDIA's largest submission. 10 partners (Dell, HPE, Cisco, Supermicro, etc.) results within 6% of AMD official.

Why It Matters

AMD's MLPerf submission is a strategic encirclement of NVIDIA: using MXFP4 training recipe and Primus software to create software lock-in similar to CUDA. Primus as a distributed training framework will gradually reduce user flexibility with open-source frameworks like PyTorch FSDP. AMD downplays MXFP4 precision risks—low-precision training may cause convergence issues for 520B models. Multi-node training relies on InfiniBand, while NVIDIA's NVLink+NVSwitch provides lower latency. AMD's UBB8 nodes face tail latency and PFC/ECN congestion limits. The 10 partner results within 6% may not reflect identical software versions, risking fragmentation and lock-in.

PRO Decision

【Vendors】NVIDIA should counter by showcasing NVFP4 + NVLink 5.0 advantages in next MLPerf, emphasizing domain interconnect bandwidth (900GB/s vs AMD's 400Gb/s InfiniBand) reducing tail latency. Use CUDA 12.x FP4 native support and open-source FSDP2 to undermine Primus lock-in.
【Enterprises】CIOs must demand independent validation of MXFP4 training convergence accuracy and test InfiniBand vs NVLink throughput in multi-node setups. Audit Primus version compatibility and cloud portability; keep PyTorch FSDP as fallback.
【Investors】Look beyond PR: AMD's software ecosystem maturity and network interconnect remain weak. Watch for RoCEv2 breakthroughs and Primus community adoption. Short-term boost possible, but long-term risk from NVIDIA B300/B400 retaliation.

Source: blog
View Original →

Get 3-5 key AI infrastructure signals weekly →

💬 Comments (0)