N
NVIDIA
2026-06-09
Technology Integration Impact: Major Conf: 85%

NVIDIA NVFP4: Native 4-Bit Training Boosts Throughput 1.73x, Locks Blackwell Ecosystem

Summary

NVIDIA introduces NVFP4, a native 4-bit format on Blackwell, enabling lossless mixed-precision pretraining in JAX/MaxText. Achieves 1.73x throughput gain over FP8 on Llama 3.1 405B (GB300). Techniques like micro-block scaling and Random Hadamard Transform boost performance but lock users into NVIDIA hardware.

Key Takeaways

NVIDIA integrates NVFP4 training recipe in MaxText for Blackwell (GB300) native hardware. Key innovations:

  • Micro-block scaling: 16-element blocks (vs MXFP4's 32), reducing outlier impact.
  • E4M3 block scaling: Mantissa bits instead of MXFP4's power-of-two; 8B experiment shows MXFP4 needs 36% more tokens to match NVFP4 loss.
  • Random Hadamard Transform: Applied only to WGRAD GEMM inputs to Gaussianize outliers.
  • 2D weight scaling: One FP8 scale per 16x16 weight block for FPROP/DGRAD consistency.
  • Stochastic rounding: Native Blackwell instructions.

Performance: Llama 3 8B on GB200: 2017 TFLOPS/GPU (1.35x), GB300: 2301 TFLOPS (1.31x); Llama 3.1 405B on GB200: 2241 TFLOPS (1.44x), GB300: 3633 TFLOPS (1.73x). Loss curve tracks FP8 with only +0.026 nats gap.

Why It Matters

NVFP4 is a defensive move against AMD/Intel, locking users into NVIDIA's ecosystem via proprietary format. Once users adopt the NVFP4 recipe in MaxText, migration to non-NVIDIA hardware (e.g., AMD MI300X) becomes costly due to lack of native support.

Hidden cost: Complex scaling and Hadamard Transform are not portable; users must buy more NVIDIA GPUs to match performance, increasing vendor lock-in.

Engineering limitations: NVFP4 only applies to MLP layers; attention remains high-precision, limiting gains for attention-heavy models. Random Hadamard Transform adds overhead that may cause tail latency jitter in large clusters.

PRO Decision

【Vendors】(AMD, Intel, Google TPU) Accelerate development of own 4-bit formats (e.g., AMD FP4, MXFP4) and promote open standards. Benchmark against NVFP4 to highlight lock-in risks. Collaborate with PyTorch for cross-platform compatibility.
【Enterprises】CIOs/architects demand independent benchmarks of NVFP4 for convergence under varied conditions. Assess cross-cloud portability cost if migrating to non-NVIDIA hardware. Insert format openness clauses in procurement contracts.
【Investors】NVFP4 is a tactic to entrench NVIDIA's training monopoly. If open standards like MXFP4 gain traction, NVIDIA's proprietary format becomes a liability. Monitor competitor progress in 4-bit training, especially AMD ROCm and Intel oneAPI.

Source: blog
View Original →

Get 3-5 key AI infrastructure signals weekly →

💬 Comments (0)