H
Huawei
2026-06-24
Technology Integration Impact: Major Conf: 85%

Huawei and Hubei Mobile Validate AI Inference Acceleration: External KV Cache Boosts Throughput 372%

Summary

Huawei and Hubei Mobile completed the first operator AI inference acceleration trial, using OceanStor A800 storage and Ascend A3 supernode with UCM to externalize KV Cache to PB-level storage, achieving up to 372% TPS improvement for long-context inference on GLM-5.1 and MiniMax M2.5 models.

Key Takeaways

Huawei and Hubei Mobile announced the completion of the first operator AI inference acceleration trial at MWC Shanghai 2026. The core stack includes Huawei OceanStor A800 flash storage, Ascend A3 supernode, and UCM (Unified Cache Manager) for inference memory management.

The trial deployed vLLM-Ascend on Hubei Mobile's live network, testing MiniMax M2.5 and GLM-5.1 with 8K to 190K token sequences. Key metrics: TPS improved 372% on GLM-5.1 at 128K context, and 58% on MiniMax M2.5 at 64K context.

Technically, UCM offloads KV Cache from GPU HBM to external storage (OceanStor A800), providing PB-level capacity with tiered lifecycle management, breaking GPU memory limits for long-context inference at lower cost.

Why It Matters

Huawei's move is a defensive ecosystem encirclement targeting NVIDIA and AMD. By shifting the KV Cache control plane from GPU HBM to proprietary OceanStor A800 and UCM, Huawei builds a closed inference data plane, forcing full-stack lock-in.

Second-order thinking reveals vendor lock-in: adopting UCM ties inference data lifecycle, caching, and failover to Huawei's stack, making migration to NVIDIA H100/B200 costly and performance-crippling.

The press release deliberately omits tail latency for external storage access. Reading KV Cache from OceanStor A800 via PCIe/NVMe over Fabrics introduces higher latency than GPU HBM, likely worsening Time-To-First-Token (TTFT) for real-time applications, a critical flaw unaddressed.

PRO Decision

【Vendors】Competitors (NVIDIA, AMD, Intel): Immediately publish independent benchmarks highlighting TTFT and tail latency disadvantages of Huawei's solution in real-time interactive AI scenarios. Promote open KV Cache offload standards (e.g., CXL or NVMe over Fabrics) to break Huawei's proprietary UCM barrier, emphasizing cross-platform portability.

【Enterprises】CIOs & Architects: Conduct zero-trust technical audits. Demand TTFT, P99 latency, and failover time metrics from Huawei, and test equivalent workloads on non-Huawei GPU platforms (e.g., NVIDIA H100) to quantify lock-in costs. Prioritize open-source KV Cache management (e.g., vLLM native support) to maintain architectural flexibility.

【Investors】Capital Markets: Be wary of Huawei using this PR signal to mask Ascend ecosystem weaknesses in per-GPU compute and software maturity. This is essentially compensating compute gaps with storage stacking; long-term TCO may exceed pure-GPU solutions due to storage bandwidth and latency bottlenecks. Monitor how NVIDIA's NVLink/CXL ecosystem counters such disaggregation approaches.

Source: 华为官方
View Original →

Get 3-5 key AI infrastructure signals weekly →

💬 Comments (0)