How to Diagnose and Fix a CUBIC Congestion Control Bug in QUIC
Introduction
Congestion control is the backbone of reliable internet communication. CUBIC, the default congestion controller in Linux (RFC 9438), governs how TCP and QUIC connections probe for bandwidth and react to loss. When a subtle bug in CUBIC's app-limited exclusion logic was ported from a Linux kernel fix to Cloudflare's open-source QUIC implementation (quiche), it caused the congestion window (cwnd) to become permanently stuck at its minimum after a congestion collapse event. This guide walks you through identifying, understanding, and fixing that bug—a journey that ends with a near-one-line code change.

What You Need
- Basic understanding of TCP/QUIC congestion control algorithms (loss-based).
- Access to a QUIC implementation using CUBIC (e.g., quiche).
- A test environment capable of simulating heavy early packet loss (e.g., network emulator like netem).
- Familiarity with C/C++ codebases and Linux kernel congestion control logic.
- Debugging tools (logs, CWND monitoring, packet capture).
Step-by-Step Guide
Step 1: Recognize the Symptom – A Flaky Integration Test
Start by noticing a pattern: your integration tests that evaluate CUBIC under heavy packet loss early in the connection fail unpredictably—for instance, 61% of the time as seen in the original bug. The failure mode is that the connection never recovers from a congestion collapse; throughput stays near zero. This is not a typical steady-state or growth-phase test—it stresses the corner case where cwnd reaches its minimum and should recover. Logging shows cwnd remains at the minimum (often 1 or 2 packets) for the rest of the connection.
Step 2: Refresh on CUBIC's Core Logic
CUBIC, like all loss-based CCAs, increases cwnd when there is no loss (probing for bandwidth) and decreases it when loss is detected (assuming capacity exceeded). The congestion window is the sender's limit on outstanding bytes. After a loss event, CUBIC cuts cwnd and then follows a cubic function to grow back. If loss is severe (congestion collapse), cwnd may drop to its floor value. The bug prevented this recovery growth from ever happening.
Step 3: Understand the App-Limited Exclusion (RFC 9438 §4.2-12)
RFC 9438 describes an “app-limited” exclusion: when the sender has no data to send (e.g., application idle), CUBIC should not count idle periods as part of its growth logic. A Linux kernel change implemented this fix properly for TCP. However, when that same logic was ported to QUIC (which is not app-limited in the same way), it introduced a state machine bug. In QUIC, after a recovery period, the sender may still be considered “app-limited” falsely, preventing cwnd from growing.
Step 4: Identify the Precise Bug – CWND Pinned at Minimum
The root cause lies in a condition that checks whether the flow is app-limited after a loss recovery. If the flag (e.g., app_limited) remains set incorrectly, CUBIC's growth function (convex or concave) is skipped. The cwnd stays at its lowest value indefinitely. This only surfaces when the connection experiences heavy loss early—because that triggers recovery, and then the flag is never cleared. In normal operation, app-limited status is cleared when new data is sent, but in this scenario, the sender may not have pending data immediately after recovery (QUIC's stream scheduling).

Step 5: Locate the Culprit Code – The One-Line Fix
In the quiche code (or similar CUBIC implementation), look for the function that handles congestion window growth (e.g., cubic_update). Find the condition that checks if the flow is app-limited. The bug is that after completing recovery, the app-limited flag is not reset to false. The fix: add a line that sets app_limited = false when the recovery state ends (e.g., after entering the “open” state). In Linux TCP, this reset happened implicitly via other logic, but in QUIC it was missing.
Step 6: Apply the Fix and Test
Add the one line of code to clear the app-limited flag upon entering the open state from recovery. Recompile your QUIC implementation. Re-run the same integration test that previously had a 61% failure rate. Confirm that the test now passes consistently (100% success). Monitor cwnd logs: after a congestion collapse event, cwnd should now follow the cubic growth curve and recover to a normal level within a few RTTs.
Step 7: Verify Under Varying Conditions
Test edge cases: light loss, no loss, multiple loss events, app-limited scenarios (idle periods). Ensure that the fix does not break normal CUBIC behavior. Also test with real traffic if possible. The fix is minimal and should be safe, but always validate.
Tips and Best Practices
- Monitor cwnd closely: Add logging for cwnd and app-limited flags during development.
- Test corner cases: Most congestion control bugs hide in steady-state; stress early loss, idle periods, and recovery.
- Understand protocol differences: TCP and QUIC have different semantics for “app-limited”; don't blindly port code.
- Use RFC references: Re-read RFC 9438 Section 4.2-12 and related updates to avoid misinterpretation.
- Contribute upstream: If you find a similar bug in an open-source project, share your fix.
- Check for similar bugs: The same pattern may appear in other CUBIC ports (e.g., FreeBSD, Windows).
This guide is based on a real bug discovered in Cloudflare's quiche. The lesson: even a well-tested algorithm like CUBIC can have hidden behaviors when ported across transport protocols. Always test in the exact environment where the code will run.
Related Articles
- Hacktivist Group Claims Responsibility for Widespread Ubuntu Service Disruptions
- AMD Drops Surprise HDMI 2.1 FRL Patches for Linux GPU Driver—Higher Bandwidth on the Horizon
- Efficient Management of Non-Direct-Mapped Pages: Insights from the 2026 Linux Storage Summit
- How to Seamlessly Switch Between KDE Plasma and GNOME Desktop Environments on Linux
- Exploring Intel's Vulkan ANV Driver: Device Generated Commands and Descriptor Heap Support
- How to Clean Up Linux Kernel Configuration with Kconfirm
- Fedora Hummingbird Q&A: Understanding the Next-Generation Rolling Linux Distribution
- How to Apply Critical Security Patches Across Major Linux Distributions