Cloudflare Uncovers Critical ClickHouse Bottleneck That Nearly Disrupted $100M+ Billing Pipeline

By

Breaking: Hidden Bottleneck in ClickHouse Slows Cloudflare's Billing to a Crawl

Cloudflare has revealed that a previously undetected bottleneck deep within ClickHouse's internal architecture nearly derailed the company's multi-hundred-million-dollar billing pipeline. The slowdown, which followed a routine migration, caused daily aggregation jobs to run significantly slower, threatening the timely issuance of invoices and fraud detection systems.

Cloudflare Uncovers Critical ClickHouse Bottleneck That Nearly Disrupted $100M+ Billing Pipeline
Source: blog.cloudflare.com

"This was a high-stakes situation," said a senior engineer at Cloudflare. "Our billing pipeline powers hundreds of millions of dollars in usage revenue. A delay of even a few hours can make invoice reconciliation extremely difficult."

The incident, which the engineering team described as urgent, prompted a deep dive into ClickHouse's inner workings. All standard diagnostic checks—I/O, memory, rows scanned, parts read—appeared normal, making the bottleneck particularly elusive.

The Root Cause: A One-Size-Fits-All Retention Policy

Cloudflare's ClickHouse infrastructure stores over a hundred petabytes of data across dozens of clusters. To simplify onboarding for internal teams, the company built Ready-Analytics—a system that streams data into a single massive table, with each record sharing a standard schema and disambiguated by a namespace.

"The system was popular, but it had a critical flaw: a fixed 31-day retention policy," explained the engineer. "Some teams needed years of data for legal reasons, while others only needed a few days. This restriction forced many to use a more complex setup."

The fixed retention was implemented via partition drops, a legacy approach predating ClickHouse's native TTL features. However, during a migration, this retention logic inadvertently created a hidden bottleneck in ClickHouse's internal processing.

Discovery: The Hidden Culprit

After exhaustive investigation, engineers traced the slowdown to ClickHouse's partition pruning mechanism. In the Ready-Analytics table, the primary key (namespace, indexID, timestamp) caused the database to scan far more parts than necessary when applying the retention policy.

"We discovered that ClickHouse was not efficiently skipping partitions during the retention job," said the engineer. "It was a subtle interaction between the schema design and the aging of data." The issue manifested only after the migration because the data distribution changed.

Cloudflare Uncovers Critical ClickHouse Bottleneck That Nearly Disrupted $100M+ Billing Pipeline
Source: blog.cloudflare.com

Three Patches to the Rescue

Cloudflare's team wrote three patches to address the bottleneck. The first optimized the partition pruning logic to consider namespace-specific retention. The second improved the primary key's sorting order to minimize scans. The third enhanced ClickHouse's memory management during retention operations.

"Each patch was a surgical fix," the engineer noted. "Together, they restored the pipeline's performance and even improved it in some cases."

Background

Cloudflare has relied on ClickHouse for years, an open-source column-oriented OLAP database. The Ready-Analytics system, launched in early 2022, grew to over 2 PiB of data by December 2024, ingesting millions of rows per second. Its architecture was designed to reduce onboarding friction for hundreds of applications, but the rigid retention policy was a known limitation.

The migration that triggered the bottleneck was part of a broader infrastructure upgrade. The team had to act quickly to prevent revenue impact.

What This Means

This incident highlights how seemingly minor configuration decisions—like a fixed retention policy—can have cascading effects on large-scale systems. For ClickHouse users, it underscores the importance of monitoring internal metrics beyond standard query performance indicators.

"This case shows that even mature systems can harbor hidden bottlenecks," said the engineer. "We've since added custom monitoring for partition pruning efficiency and plan to upstream our patches."

For Cloudflare, the fix ensures the billing pipeline remains robust. The company also intends to introduce per-namespace retention capabilities, addressing the original limitation that forced many teams into complex setups.

Tags:

Related Articles

Recommended

Discover More

Arc Raiders Shifts to Semi-Annual Major Updates; First Content Drop 'Frozen Trail' Promises Expanded BattlegroundsFractional Work: A New Path for Burned-Out Middle ManagersHow to Create Your First AI Agent with the Microsoft Agent Framework in .NETMastering AI Networking: Why Marvell Technology Could Outperform Nvidia, Broadcom, and Micron in the Coming YearHashiCorp and Red Hat Unveil Vault Secrets Operator: The New Standard for Kubernetes Secret Lifecycle Management