AMD AMD Announces Breakthrough MLPerf Inference 6.0 Results, Showcasing Multinode Scaling and Multimodal Capabilities - AI Infrastructure Intelligence

Summary

AMD's MLPerf Inference 6.0 submission, powered by Instinct MI355X GPUs, surpassed 1 million tokens per second for the first time on models like Llama 2 70B and GPT-OSS-120B. The results highlight efficient multinode scaling, rapid enablement of new workloads (e.g., text-to-video model Wan-2.2-t2v), and reproducible performance across a broad partner ecosystem.

Key Takeaways

AMD's MLPerf 6.0 submission showcases several key advancements. On Llama 2 70B, the MI355X GPU delivered a 3.1x performance uplift over the previous MI325X, and at multinode scale (11 nodes, 87 GPUs), achieved over 1M, 1M, and 785k tokens/sec in Offline, Server, and Interactive scenarios respectively, with 93%-98% scaling efficiency.
The first-time GPT-OSS-120B submission showed competitive single-node performance against B200/B300 and was successfully scaled across nodes. AMD also expanded into multimodal AI with a first-time submission on the text-to-video model Wan-2.2-t2v. The results emphasize the maturity of the ROCm software stack and partner ecosystem reproducibility.

Why It Matters

This submission signals a shift in AI inference infrastructure competition from single-node performance to multinode cluster efficiency and rapid model enablement. AMD demonstrates full-stack capability for high-performance, scaled inference, paving the way for future rack-scale deployments (e.g., AMD Helios) and intensifying multi-vendor pressure in enterprise AI infrastructure procurement....

Sign up to view full strategic analysis

Sign Up Free

PRO Decision

🔒

Decision recommendations are available for Pro users

Upgrade to Pro $29/mo