H
Huawei
1970-01-01
Industry Signal Impact: Major Conf: 85%

Huawei Ascend 910C Trains 1.6T-Parameter MoE Model: First Full Pipeline on Domestic AI Chips

Summary

Huawei, in collaboration with research institutes, completed full-parameter post-training of DeepSeek-V4-Pro (1.6 trillion parameters, MoE) on an Ascend 910C cluster. Key metrics: stable 1,500 steps on 1,000 cards, 30% compute utilization, 14% operator efficiency gain, zero reliance on foreign GPUs. This marks the first end-to-end trillion-parameter training loop on domestic chips.

Key Takeaways

Huawei, in partnership with Hetao College, Harbin Institute of Technology (Shenzhen), and Shenzhen Big Data Research Institute, has completed full-parameter post-training of the DeepSeek-V4-Pro (1.6 trillion parameters, MoE architecture) on an Ascend 910C cluster. This is the first time a trillion-parameter MoE model has been fully trained on purely domestic computing hardware. Key metrics: stable 1,500 training steps on a 1,000-card cluster; compute utilization reaching 30% (compared to ~40% for equivalent foreign GPU clusters); core operator efficiency improved by 14%; zero dependence on foreign GPUs.
Technical breakthroughs include: a trillion-parameter memory sharding technique that distributes 1.6T parameters across thousands of 910C chips; Huawei's proprietary MindSpeed distributed acceleration suite with global dynamic load balancing; and a cluster management system supporting real-time fault isolation and minute-level checkpoint recovery. This milestone proves that domestic AI chips can handle the full training pipeline from pre-training to deep post-training. The upcoming Ascend 950 supernode in 2027 is positioned to become the dedicated compute base for China's top-tier large models, completing the domestic hardware-framework-model ecosystem.

Why It Matters

This is a control plane shift from NVIDIA's CUDA ecosystem to Huawei's MindSpeed and CANN stack. By completing trillion-parameter training, Huawei aims to lock enterprises into its proprietary software, making migration away from Ascend expensive due to deep operator optimization and memory management dependencies. The report downplays Ascend 910C's lower HBM bandwidth and SRAM capacity vs H100, resulting in 30% compute utilization vs 40%—requiring more cards for equivalent throughput, eroding TCO advantage. MoE's All-to-All communication may suffer tail latency and congestion issues on HCCS and RoCEv2 networks. Crucially, only 1,500 steps were shown without convergence curves or loss metrics, leaving training quality unverified.

PRO Decision

【Vendors (NVIDIA, AMD, Intel)】:Immediately release benchmark comparisons of convergence speed and loss for equivalent MoE models on H100/B200, highlighting NVLink's low tail latency in All-to-All communication. Accelerate open-source CUDA alternatives like Triton to reduce MindSpeed lock-in.
【Enterprises (CIOs & Architects)】:Conduct zero-trust audit: demand full loss curves and convergence reports from Huawei for the 1,500-step training, and run A/B comparisons with H100 clusters. Evaluate HCCS interconnect congestion control under MoE workloads; request RoCEv2 PFC/ECN configuration whitepapers. Prioritize cross-cloud portability by avoiding deep MindSpeed integration and maintaining PyTorch native API compatibility.
【Investors】:View this as PR-driven; the 30% utilization vs H100's 40% implies TCO disadvantage. Monitor whether Ascend 950's HBM3e bandwidth and SRAM capacity truly close the gap. Short-term, domestic substitution theme benefits Huawei suppliers (e.g., SMIC, CXMT), but long-term validation of convergence quality and software ecosystem maturity is critical.

Source: 拾光与亮
View Original →

Get 3-5 key AI infrastructure signals weekly →

💬 Comments (0)