N
NVIDIA
2026-06-16
Technology Integration Impact: Major Conf: 95%

NVIDIA Blackwell Sweeps MLPerf: NVLink and NVFP4 Redefine AI Training Economics

Summary

NVIDIA Blackwell dominates MLPerf Training 6.0, submitting across all seven benchmarks including MoE workloads. GB300 NVL72 delivers up to 1.6x faster training than GB200, with fifth-gen NVLink unifying 72 GPUs as one giant GPU. NVFP4 low-precision training and massive scale (8,192 GPUs) set new industry standards.

Key Takeaways

NVIDIA submitted results for all seven MLPerf Training 6.0 benchmarks, including new MoE workloads DeepSeek-V3 671B and GPT-OSS-20B. The Blackwell platform uses GB200 NVL72 and GB300 NVL72 rack-scale systems, with fifth-gen NVLink Switch connecting 72 GPUs into a unified compute and memory pool, acting as a single giant GPU.
NVLink's high bandwidth solves the all-to-all communication challenge in MoE training, unmatched by traditional networking. NVIDIA also demonstrated NVFP4 low-precision training, used to pretrain the 550-billion-parameter Nemotron 3 Ultra model while meeting accuracy requirements.
GB300 NVL72 delivers up to 1.6x faster training than GB200, driven by higher compute density, expanded memory, and higher power ceiling. Scaling to 8,192 GPUs on DeepSeek-V3, it's the largest Blackwell cluster in MLPerf. Networking options include Quantum InfiniBand and Spectrum-X Ethernet, with CoreWeave achieving fastest time using Spectrum-X. Reliability is ensured through manufacturing screening, self-healing engines, and NVRx fault recovery.

Why It Matters

This blog post is not just a benchmark victory; it signals that the control point in AI training has shifted from standalone GPUs to rack-scale interconnected systems. NVIDIA's fifth-gen NVLink and NVSwitch bind 72 GPUs into one logical unit, creating a control plane shift that locks users into NVIDIA's proprietary interconnect, eliminating flexibility to mix GPU vendors or network fabrics.
The hidden lock-in: once you adopt GB NVL72 racks, future upgrades must stay within NVIDIA's rack-scale ecosystem because NVLink is incompatible with standard InfiniBand or RoCEv2. Users lose network choice.
NVIDIA downplays the accuracy risks of NVFP4 low-precision training, which may require extra tuning for MoE models. The massive 8,192 GPU cluster also relies on Spectrum-X or Quantum InfiniBand networks, both NVIDIA-dominated, further deepening vendor lock-in. For enterprises seeking heterogeneous AI infrastructure, this full-stack approach sacrifices architectural flexibility.

PRO Decision

【Vendors】(Competitors like AMD, Intel, Google, Arista)

  • AMD should accelerate its Infinity Architecture and ROCm ecosystem, emphasizing open standards (CXL, Ethernet) flexibility against NVIDIA's proprietary NVLink lock-in, offering composable heterogeneous training.
  • Intel leverage Gaudi and Xeon with IPU and Ethernet, highlighting TCO advantages and avoiding vendor lock-in from NVIDIA's rack-scale systems.
  • Google's TPU and JAX ecosystem should promote its own OCS network openness and large-scale reliability, and attack NVFP4 accuracy trade-offs.

【Enterprises】(CIOs, Architects)
  • Conduct supplier concentration risk audit: assess dependence on NVIDIA NVLink and Spectrum-X. Demand cross-platform portability proof, e.g., model performance on AMD/Intel clusters.
  • In NVIDIA rack-scale system contracts, include unbundling clauses allowing future mix of different network protocols (e.g., RoCEv2) or GPUs, and specify NVFP4 accuracy degradation metrics for specific models.

【Investors】
  • Beware NVIDIA's increasing switching costs via rack-scale systems, boosting long-term pricing power but also antitrust risk. Monitor AMD/Intel for open interconnect alternatives.
  • Short-term bullish on NVIDIA, but mid-term assess whether open standards like CXL and UALink can weaken NVLink lock-in.

Source: NVIDIA新闻中心
View Original →

Get 3-5 key AI infrastructure signals weekly →

💬 Comments (0)