NVIDIA AgentPerf Benchmark: Blackwell Ultra Delivers 20x More Agents per Megawatt vs Hopper
Summary
Key Takeaways
NVIDIA announces first results from AgentPerf, the inaugural benchmark for agentic AI infrastructure. The GB300 NVL72 platform runs up to 20x more concurrent agents per megawatt than the HGX H200 when serving DeepSeek V4 Pro, a large MoE model.
Developed by Artificial Analysis, AgentPerf simulates real coding agent trajectories—reading files, editing code, executing commands—across 12+ languages. It measures how many agent tasks can be supported while meeting 20 and 60 tokens/s SLOs. Tool calls are simulated with CPU time to isolate accelerator performance.
Performance stems from full-stack co-design: GB300 NVL72 connects 72 GPUs in a rack-scale system for efficient MoE distribution; CUDA kernels overlap communication and compute; TensorRT LLM separates input processing from output generation. NVIDIA states the Vera Rubin architecture is now in full production.
Why It Matters
NVIDIA's AgentPerf benchmark is a defensive move to contain AMD Instinct and Intel Gaudi in the agentic AI inference race, and to encircle Google TPU by defining agents-per-watt as the new industry metric. This locks enterprise procurement into NVIDIA's efficiency yardstick.
Hidden lock-in: TensorRT LLM and CUDA create a software dependency that makes agentic workloads hard to migrate. NVLink and NVSwitch rack-scale integration further ties customers to NVIDIA hardware.
Omitted limitations: GB300 NVL72 is a rack-scale system requiring liquid cooling—incompatible with most air-cooled data centers. The 20x gain is vs H200, not absolute agent count. Tool call simulation ignores real-world tail latency and PFC/ECN congestion bottlenecks that degrade performance in production networks.
PRO Decision
[Vendors] (AMD, Intel): Collaborate with third-party benchmark orgs (e.g., MLPerf) to create agentic AI tests using diverse models (Llama 4, Grok) and real network latency, exposing NVIDIA's weaknesses in air-cooled environments. Promote ROCm and OpenVINO for agent workloads and emphasize cross-platform portability to break CUDA lock-in.
[Enterprises] (CIOs, Architects): Demand absolute agent counts (not just per-watt) and run independent benchmarks in your existing data center conditions. Evaluate AMD MI400 or Intel Gaudi 3 under the same power budget. Beware of TensorRT LLM version lock causing legacy model performance degradation.
[Investors]: AgentPerf reinforces NVIDIA's inference moat, but the 20x gain is partly generational (H200→GB300). Watch for AMD and Intel dedicated agentic AI accelerators and whether open benchmarks (e.g., MLPerf Agent) dilute NVIDIA's metric control.
Get 3-5 key AI infrastructure signals weekly →
💬 Comments (0)