Part 1: Event Overview
The May 2026 investment created several firsts: NVIDIA and AMD collaborating directly at the investment level for the first time, Intel CEO Lip-Bu Tan and Broadcom CEO Hock Tan personally joining as angel investors, and OpenAI co-founder John Schulman involvement—hinting at top AI labs strategic focus on inference infrastructure.
The core asset, SGLang, is an inference framework evolved from LightSeq and vLLM. Its core innovation—RadixAttention—achieves a qualitative leap in KV cache management, which is the fundamental reason all three chipmakers bet simultaneously.
Part 2: Deep Technical Analysis
2.1 KV Cache Management: Radix Tree vs Block Table
vLLM PagedAttention uses a block table scheme—a hash table plus fixed-size memory block approach. When handling multi-turn conversations or RAG scenarios, each request prefix requires independent storage. Though prefix hash matching enables some reuse, the matching granularity is limited by block boundaries, and KV cache is released immediately after request completion.
SGLang RadixAttention replaces block table with a Radix Tree. When 100 users query the same document simultaneously, RadixAttention computes prefill only once—other 99 requests directly reuse cached KV data.
Key difference: vLLM reuse granularity is limited by block size, while SGLang can be any arbitrary prefix length.
2.2 Performance Data
| Scenario | SGLang | vLLM | Improvement |
|---|---|---|---|
| H100 throughput | ~16,200 tok/s | ~12,500 tok/s | 29% ✅Verified |
| B200 decode | 2.25x baseline | - | ✅Verified |
| Prefix-heavy (RAG/multi-turn) | - | - | 6.4x ⚠️Vendor claim |
| DeepSeek V3 | - | - | 3.1x ⚠️High confidence |
SGLang v0.5 brings 2.25x decode throughput improvement for NVIDIA B200⚠️Vendor claim. On H100, SGLang achieves ~16,200 tok/s vs vLLM ~12,500 tok/s, a 29% gap⚠️Vendor claim.
2.3 DeepSeek V4/MoE Optimization
SGLang achieves 52,300 input tok/s and 22,300 output tok/s on 96 H100 clusters⚠️Vendor claim. Multi-Token Prediction (MTP) support accelerates decode by 1.8x⚠️Vendor claim.
DeepEP implements differentiated optimization: normal mode for prefill (high throughput), low-latency mode for decode (latency priority). This makes SGLang the preferred inference framework for DeepSeek V3/V4.
2.4 xgrammar
SGLang natively integrates xgrammar, achieving 10x speed in JSON structured output versus other open-source solutions⚠️Vendor claim. This is fundamental architectural reconstruction: structural constraints must be built into token sampling, not parsed post-generation.
Part 3: Three Giants Strategy Breakdown
3.1 NVIDIA: Fortifying the Ecosystem Moat
SGLang maximizes GPU utilization, meaning higher ROI on premium products. Investing in SGLang reinforces NVIDIA CUDA dominance by ensuring first-class support for NVIDIA hardware.
3.2 AMD: Breaking CUDA Dependency
ROCm ecosystem maturity still lags CUDA, and SGLang open-source, cross-platform nature provides AMD a path to compete—if the inference framework is good enough and neutral enough, hardware choice might shift from CUDA lock-in to cost-performance considerations.
3.3 Intel: A Showcase for Xe Architecture
Intel CEO personal angel investment is strategic endorsement. Intel Xe GPU needs a killer application to prove itself, and SGLang open-source nature allows Intel to showcase hardware capabilities without relying on commercial partners.
Part 4: Impact on Inference Engine Landscape
- vLLM: Largest ecosystem, 3x more GitHub contributors than SGLang
- SGLang: Fastest growth, 27K+ stars, 400K GPU deployments, users including Google, Microsoft, Oracle
Chipmakers now view inference engines as hardware capability amplifiers—investing in inference frameworks is essentially investing in their own market competitiveness.
Part 5: Weaknesses Analysis
5.1 Fragmentation Risk
Each chipmaker pushes its own inference optimization, inference engines are repeating AI framework fragmentation.
5.2 AI Attack/Risk
SGLang open-source, cross-platform nature could be used to circumvent chip export restrictions⚠️High confidence.
5.3 Defense: Standardization Significance
Inference layer standardization improves AI safety: when inference engines are mature and neutral, AI system behavior predictability and auditability improve.
Part 6: Forward-Looking Predictions
Short-term (3 months)
- DeepSeek V4 names SGLang as official inference engine⚠️High confidence
- vLLM community splits, some core developers may shift to SGLang
- NVIDIA accelerates SGLang-first GPU optimization roadmap internally
Medium-term (6 months)
- AMD Instinct + ROCm + SGLang forms competitive alternative in inference market
- NVIDIA GPU + SGLang ecosystem tightens, possible NVIDIA-exclusive optimizations
- Intel Xe GPU gains first large-scale production deployment via SGLang
Long-term (12 months)
- Inference engine layer becomes new chipmaker moat—whoever controls inference optimization influences model deployment choices
- Open-source neutrality faces tests: can SGLang maintain independence after three chipmakers investment?
- Inference layer standardization on the agenda, driven by chipmakers or regulators
Key Data Summary
| Data Point | Value | Confidence |
|---|---|---|
| Total Investment | $155M | ✅Verified |
| Valuation | $400M | ✅Verified |
| GitHub Stars | 27K+ | ✅Verified |
| GPU Deployments | 400K+ | ⚠️Vendor claim |
| DeepSeek V3 Improvement | 3.1x | ⚠️High confidence |
| B200 Decode Improvement | 2.25x | ⚠️Vendor claim |
| JSON Structured Output | 10x | ⚠️Vendor claim |
| H100 Throughput | 16,200 tok/s | ⚠️Vendor claim |
| Prefix-heavy Improvement | 6.4x | ⚠️Vendor claim |
💬 Comments (0)