Did you know that a small tweak to the Linux kernel's congestion control could break an entire QUIC implementation? At Cloudflare, our open-source QUIC library, quiche, relies on the CUBIC congestion controller—the same one used by most TCP connections on the internet. But when a kernel optimization aimed at fixing a TCP issue was ported to quiche, it triggered a bizarre bug: the congestion window (cwnd) got permanently stuck at its minimum value after a congestion collapse. This article unpacks the mystery of this elusive bug, from its rare test failures to the elegant one-line fix that saved the day. Here are five things you need to know about this fascinating bug story.
1. Understanding CUBIC: The Default Linux Congestion Controller
CUBIC, standardized in RFC 9438, is the default congestion control algorithm in Linux. It governs how most TCP and QUIC connections probe for available bandwidth, detect loss, and recover. At its core, CUBIC adjusts the congestion window (cwnd)—a cap on the number of bytes that can be in flight (sent but not acknowledged). A larger cwnd sends more data per round trip; a smaller cwnd throttles it. Every loss-based algorithm like CUBIC grows cwnd when the network is healthy and shrinks it when loss occurs, aiming to maximize throughput. This approach works well in steady-state, but unusual regimes—like recovery after severe loss—can expose hidden corner cases, as we'll see.

2. The Mysterious Test Failures: 61% Failure Rate
Our investigation began when the ingress proxy integration test pipeline started failing unexpectedly—61% of the time. These tests simulated heavy packet loss early in a QUIC connection using CUBIC. Normally, congestion controllers are tested in steady-state growth, but this scenario pushed the algorithm to its minimum cwnd after a congestion collapse. The bug caused cwnd to get permanently pinned at that minimum, never recovering even when the network improved. Such a failure is rare but critical: a congestion controller's job is to recover after collapse, and this one failed consistently in a specific corner case. The erratic failures hinted at a race condition or state machine issue in the CUBIC implementation within quiche.
3. The Linux Kernel Change That Broke Things
The root cause traces back to a Linux kernel change aimed at aligning CUBIC with RFC 9438 §4.2-12, which describes the app-limited exclusion. In TCP, if the sender is application-limited (not sending full window), the congestion window should not decrease due to loss. The kernel fix added a check to detect app-limited states and avoid reducing cwnd improperly. This made sense for TCP, but when the same logic was ported to quiche's QUIC implementation, it interacted poorly with QUIC's different acknowledgment handling. In QUIC, acknowledgments can be delayed or paced differently, causing the app-limited detection to fire incorrectly after a congestion event. As a result, cwnd stayed at its minimum because the recovery logic was prematurely triggered into a no-growth state.
4. Why QUIC Exposed the Bug Differently Than TCP
TCP and QUIC handle acknowledgments and congestion control differently. In TCP, the kernel has fine-grained control over packet delivery, while QUIC runs in userspace and relies on its own pacing and ACK processing. The app-limited exclusion fix assumed a specific timing of ACKs that matches TCP's stack, but in QUIC, ACKs can arrive less predictably—especially under loss. When the bug manifested, QUIC's CUBIC state machine entered a condition where it thought the connection was still app-limited even after data started flowing again, preventing cwnd growth. This discrepancy meant the bug was invisible in TCP tests but surfaced in QUIC. The fix required rethinking how quiche detects app-limited periods without relying on TCP-specific assumptions.

5. The Fix: A Near-One-Line Solution That Restored Stability
The happy ending came in the form of an elegant, near-one-line fix. Engineers realized that the problem was a state machine transition that permanently locked cwnd at its minimum when certain flags were set after a congestion collapse. By removing a single condition that incorrectly prevented cwnd increase during recovery, the bug disappeared. The fix broke the loop: after loss, the congestion window could now grow again once ACKs confirmed new data delivery. Testing showed a dramatic reduction in failures: from 61% down to 0%. This simple change highlights how a small kernel optimization can have unintended consequences when ported across protocols. The story serves as a reminder to test edge cases in congestion control—especially the recovery from collapse—and to question assumptions from TCP when implementing in QUIC.
In conclusion, the CUBIC bug discovered in Cloudflare's quiche library underscores the complexity of porting network algorithms between protocols. A well-intentioned Linux kernel fix for TCP introduced a subtle side effect in QUIC that permanently pinned the congestion window after severe loss. The solution was surprisingly simple—a single line change—but finding it required deep understanding of both CUBIC's state machine and QUIC's unique ACK dynamics. This experience has made our testing more robust and our implementation more resilient. For anyone working on congestion control, the lesson is clear: always test the uncommon regimes, because that's where bugs hide.