Cloudflare QUIC Death Spiral: How a Linux Kernel CPU Idle Optimization Broke QUIC Connections

Published: 2026-05-13 • Reading: 12 min • Tags: QUIC, Linux Kernel, CUBIC, Cloudflare, Congestion Control, Network Performance, Kernel Bug

In May 2026, Cloudflare engineers published a deep investigation into a bizarre failure mode affecting their quiche QUIC implementation. Their integration tests were failing 61% of the time — not due to packet loss or hardware failure, but because a Linux kernel optimization designed to handle idle periods was actively breaking QUIC connections.

The result was a genuine death spiral: CUBIC's congestion window (cwnd) got permanently pinned at the minimum floor of 2700 bytes (two full-sized packets), oscillating between congestion recovery and avoidance states 999 times in just 6.7 seconds — roughly one transition per round-trip time. This article breaks down exactly what happened, why it matters, and what it teaches us about the fragility of porting kernel-level network optimizations to userspace protocols like QUIC.

The Symptom: A Test That Fails 61% of the Time

The investigation began when Cloudflare's ingress proxy integration test pipeline started showing erratic failures. The test setup was straightforward:

Quiche HTTP/3 client and server running on localhost
RTT = 10ms
A 10 MB file download over HTTP/3 using CUBIC congestion control
30% random packet loss injected during the first two seconds
After two seconds, zero packet loss
10-second timeout to complete the download (expected: 4-5 seconds)

The expected behavior was simple: CUBIC would reduce its cwnd during the loss phase, then steadily ramp back up once loss stopped. Instead, around 60% of test runs failed to complete within the generous 10-second timeout. The same test with Reno congestion control passed 100% of the time.

The Anomaly: 999 State Transitions With Zero Loss

Cloudflare instrumented quiche's qlog output to visualize what was happening inside the congestion controller. The results were startling: after the 2-second loss period ended (with zero packet loss thereafter), the connection entered a rapid oscillation between congestion avoidance (normal operation) and recovery (loss recovery mode) — 999 transitions in approximately 6.7 seconds.

Each transition occurred roughly every ~14ms — suspiciously close to the connection's 10ms RTT. Throughout this entire period, cwnd was locked at 2700 bytes: two full-size packets. The congestion window simply never recovered.

The key clue was the oscillation period. Because this was a download (server-to-client), ACKs traveled client-to-server once per round trip. Every time those ACKs landed, bytes_in_flight dropped to zero, the server sent its next two-packet burst, and the cycle repeated. The death spiral was locked to the ACK clock itself.

Root Cause: When "Idle" Isn't Idle

The 2017 Linux Kernel Change

To understand the bug, we need to go back to 2017. A Linux kernel commit by Eric Dumazet, Yuchung Cheng, and Neal Cardwell addressed a real problem in TCP CUBIC: what happens when an application goes idle and then resumes sending?

CUBIC's growth function W_cubic(delta_t) is parameterized by delta_t = now - epoch_start. If the application is idle for seconds or minutes, delta_t becomes huge, producing an absurdly large target window. The fix was elegant: shift the epoch forward by the idle duration rather than resetting it — preserving the growth curve shape while accounting for the idle gap.

The kernel does this via the CA_EVENT_TX_START callback, triggered when bytes_in_flight transitions from 0 to non-zero after an idle period. In TCP, this is reliable because the kernel tracks all socket state internally.

The Port to QUIC: Where It Breaks

When CUBIC was ported to quiche, this idle-period adjustment was included. But QUIC runs in userspace — there's no kernel-level CA_EVENT_TX_START callback. Instead, the quiche implementation checks for the idle condition inside on_packet_sent():

// cubic.rs - on_packet_sent() (simplified)
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    if bytes_in_flight == 0 {
        let delta = now - self.last_sent_time;
        self.congestion_recovery_start_time += delta;
    }
    self.last_sent_time = now;
}

This check — if bytes_in_flight == 0 — is the trap. In kernel TCP, bytes_in_flight == 0 reliably signals a genuine idle period. But in QUIC userspace, when cwnd collapses to minimum (2 packets), every incoming ACK drives bytes_in_flight to zero, and every send triggers the "idle adjustment" — pushing congestion_recovery_start_time forward into the future on every single packet.

The Followup Fix That Never Made It

A second kernel commit about a week after the original fix acknowledged this issue:

"tcp_cubic: do not set epoch_start in the future. Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start is normally set at ACK processing time, not at send time."

The kernel fix was to clamp epoch_start so it's never pushed into the future. But the quiche port inherited the original buggy behavior — pushing recovery_start_time forward on every send, creating the death spiral oscillation.

The Death Spiral Mechanism

Here's the complete chain of events:

cwnd collapses to minimum floor (2 packets = 2700 bytes) during early loss phase
Server sends 2 packets, bytes_in_flight = 2700
Client receives packets, sends ACK
ACK arrives at server, application processes it, reads data from socket
bytes_in_flight drops to 0 between ACK processing and next send
Server has more data to send, calls on_packet_sent()
bytes_in_flight == 0 check triggers: congestion_recovery_start_time += delta
This pushes recovery start time into the future
CUBIC's state machine sees recovery_start_time > now and enters recovery state
Next ACK fires, ACK processing sees recovery_end_time > now and enters congestion avoidance
Send another 2 packets → bytes_in_flight = 0 → triggers idle adjustment again
Repeat 999 times

This only triggers when all three conditions align: (1) cwnd at minimum floor, (2) application always has data ready to send, (3) every ACK drains bytes_in_flight to zero. Outside this regime, bytes_in_flight == 0 is less likely on every send, making the bug invisible in most production scenarios.

The Fix: A Nearly One-Line Change

The fix was elegant in its simplicity. Rather than anchoring the idle detection on bytes_in_flight == 0 — which in practice can happen on every send when cwnd is tiny — the fix adds a minimum threshold: only treat the connection as "idle" if bytes_in_flight has been zero for a minimum duration (e.g., a small multiple of RTT):

// Fixed version — only adjust if genuinely idle
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    if bytes_in_flight == 0 {
        let delta = now - self.last_sent_time;
        // Only apply if idle gap exceeds minimum threshold
        if delta > self.min_idle_threshold {
            self.congestion_recovery_start_time += delta;
        }
    }
    self.last_sent_time = now;
}

This breaks the death spiral: when cwnd is at minimum and packets are flowing at line rate, the delta between sends is tiny (microseconds), well below the idle threshold. The congestion recovery time stays in the past, CUBIC's state machine stays in congestion avoidance, and cwnd can grow normally.

Alternative approaches considered by the engineers included:

Tracking a separate idle timer independent of bytes_in_flight
Moving the check to ACK processing time (matching the kernel's approach more closely)
Simply clamping recovery_start_time to never exceed now

Lessons for High-Performance Network Services

This bug offers several important lessons for anyone building or operating high-performance network services:

1. Kernel Assumptions Don't Survive Userspace

The kernel has visibility into socket state that userspace simply doesn't. CA_EVENT_TX_START is reliable in kernel TCP because the kernel tracks all socket state transitions. QUIC in userspace works with a different timing model — the gap between ACK processing and the next send can be microseconds, and bytes_in_flight == 0 is a normal transient state, not a signal of idleness.

2. Protocol Differences Matter at the Microsecond Scale

QUIC's multiplexing over a single connection means that even when one stream is flow-controlled, another stream may have data ready to send. This changes the idle semantics entirely compared to TCP, where a single stream's idle period is unambiguous.

3. The Congestion Collapse Recovery Regime Is Poorly Tested

Most congestion control testing exercises steady-state behavior. The regime where cwnd has collapsed to minimum floor and must climb back out is rarely tested end-to-end — but it's exactly the regime that matters during recovery. As Cloudflare demonstrated, bugs here are invisible in throughput dashboards and only surface under deliberate stress testing.

4. CPU Idle Management and Network Performance

While this specific bug was algorithmic, the broader class of issues it represents — where kernel idle optimizations interact poorly with network protocol behavior — includes well-known problems with CPU C-states. Modern x86 CPUs enter deep sleep states (C6, C7, C8) that can have wake-up latencies of 100-400 microseconds. For latency-sensitive protocols like QUIC, where the difference between a successful retransmission and a timeout is measured in milliseconds, deep C-state entry can trigger cascading failures:

CPU enters C-state during brief network idle
Incoming packet arrives → wake-up latency adds 100-400 µs
Packet processing delayed → QUIC PTO (Probe Timeout) logic misfires
Retransmissions triggered → more CPU work → more C-state transitions
Each retransmission consumes CPU, potentially forcing higher C-states to save power
Feedback loop: more retransmissions at higher latency → worse throughput → more C-state transitions

For high-performance QUIC deployments, engineers commonly tune kernel idle parameters:

# Disable deep C-states for network-facing cores
# On GRUB command line:
intel_idle.max_cstate=1 processor.max_cstate=1

# Or via sysfs at runtime:
echo 1 > /sys/module/intel_idle/parameters/max_cstate

# Tune the governor for performance
cpupower frequency-set -g performance

These adjustments prevent the CPU from entering deep C-states that introduce latency spikes fatal to QUIC's timing-sensitive retransmission logic.

Broader Implications for QUIC Performance Optimization

The QUIC death spiral highlights a fundamental tension in modern network protocol design. QUIC was designed to be userspace-friendly — running in application space rather than the kernel gives deployability and iteration speed. But it also loses the tight integration with kernel subsystems (congestion control, scheduler, idle management) that TCP enjoyed.

For production QUIC deployments, this means:

CPU pinning — Dedicate cores to QUIC processing, prevent them from entering deep C-states
Busy polling — Use SO_BUSY_POLL to reduce wake-up latency on network sockets
IRQ affinity — Bind network IRQs to the same cores handling QUIC processing
Congestion control tuning — Test CUBIC's edge cases with synthetic loss patterns
Userspace networking — Consider DPDK or XDP for extreme latency sensitivity

Conclusion

Cloudflare's QUIC death spiral is a cautionary tale about the gap between kernel and userspace networking. A well-intentioned 2017 kernel optimization for CUBIC's idle handling — perfectly correct in TCP — became an active liability when ported to QUIC's userspace implementation, because the semantics of "idle" are fundamentally different between the two environments.

The fix itself was simple: a threshold check to distinguish genuine idle periods from the normal transient state of a congested connection flushing and refilling its cwnd. But discovering the bug required deliberate stress testing of a rarely-visited corner of the congestion control state space, and the analysis revealed subtle interactions between the CUBIC algorithm, QUIC's userspace timing, and the ACK-clocked nature of TCP-friendly flow control.

For the broader community, this bug serves as a reminder that network protocol implementations carry implicit assumptions about their execution environment. Porting kernel networking code to userspace — whether for QUIC or any other protocol — requires careful re-examination of every assumption about timing, state visibility, and the definition of "idle."

The death spiral is closed. But the lessons it teaches about the fragility of kernel-to-userspace protocol porting will remain relevant for as long as network protocols migrate from kernel modules to application-space implementations.

Reference: When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug — Cloudflare Blog

← Back to blog