Cloudflare's Trio of Patches Breaks ClickHouse Partition Bloat Lock Contention
Summary
Key Takeaways
Cloudflare stores over 100PB in ClickHouse. Its Ready-Analytics system uses a single massive table partitioned by day, but per-namespace retention needs forced a change to (namespace, day) partitioning in Jan 2025. Despite expecting no impact (queries filter by namespace), billing jobs slowed in Mar 2025. Investigation revealed that total part count grew to 160k per replica, causing severe contention on the MergeTreeData mutex: every query planner thread acquired an exclusive lock and copied the entire parts list. CPU flame graphs showed 45% time in filterPartsByPartition, but real-time traces showed >50% of query duration was waiting for that mutex. Three patches: 1) switched to std::shared_lock for concurrent reads; 2) deferred vector copy with a shared cache; 3) binary search on sorted namespace to skip irrelevant parts. Query latency dropped >50% and decoupled from part count. Patches merged upstream (PR #85535, v25.11).
Get 3-5 key AI infrastructure signals weekly →
💬 Comments (0)