Skip to content

cloudwatchlogs: fix lost wakeup that leaves the force-flush timer dead#2166

Open
musa-asad wants to merge 2 commits into
aws:mainfrom
musa-asad:fix/flush-timer-lost-wakeup
Open

cloudwatchlogs: fix lost wakeup that leaves the force-flush timer dead#2166
musa-asad wants to merge 2 commits into
aws:mainfrom
musa-asad:fix/flush-timer-lost-wakeup

Conversation

@musa-asad

@musa-asad musa-asad commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Background

The CloudWatch Agent tails log files and publishes them to CloudWatch Logs. Each monitored file gets its own stream with a per-stream force-flush timer. This timer fires every force_flush_interval (default 5s) to send whatever events have accumulated in the batch, even if the batch has not reached the 1MB size limit. Without it, low-volume streams would only flush when full -- potentially holding events for minutes or longer.

The timer is a one-shot: after it fires, something must reset ("rearm") it for the next cycle. The rearm signal travels through an internal Go channel called resetTimerCh.

Key concepts

Term Meaning
Force-flush timer Per-stream timer that fires every force_flush_interval to flush pending events regardless of batch size.
Rearm Resetting the timer after it fires so it will fire again. Without rearming, the timer is spent and never fires again.
Unbuffered channel A Go pipe that transfers a message only when sender and receiver are ready simultaneously. If the receiver is not waiting, a non-blocking send is silently skipped.
Buffered channel (size 1) A Go pipe that holds one message even if no receiver is waiting yet. The next receiver picks it up whenever it is ready.

Problem

resetFlushTimer() sends the rearm signal via non-blocking select:

select {
case q.resetTimerCh <- struct{}{}:
default: // silently skipped if no receiver is ready
}

resetTimerCh is unbuffered. Two goroutines are involved: manageFlushTimer (owns the timer, listens on resetTimerCh) and the main loop (calls resetFlushTimer() after each send). The race:

  1. Timer fires. manageFlushTimer signals the main loop via flushCh.
  2. Main loop receives, calls send(), then calls resetFlushTimer().
  3. manageFlushTimer has not yet returned to its select -- it is still completing the flushCh send.
  4. resetTimerCh is unbuffered and manageFlushTimer is not listening. The rearm hits default: and is lost.
  5. manageFlushTimer returns to its select. Channel is empty. Timer is never rearmed.
sequenceDiagram
    participant TM as manageFlushTimer
    participant CH as resetTimerCh (unbuffered)
    participant S as main loop

    Note over TM: Timer fires
    TM->>S: flushCh signal ("flush now")
    Note over TM: Has not returned to select yet
    Note over S: send() completes, calls resetFlushTimer()
    S->>CH: try to send rearm signal
    Note over CH: Unbuffered: needs receiver ready NOW
    Note over CH: manageFlushTimer not listening
    CH--xS: DROPPED (hits default)
    Note over TM: Returns to select... channel empty
    Note over TM: Timer dead. Never fires again.
    Note over S: Events accumulate with no timer to flush them.
    Note over S: ~33 min later, batch hits 1MB size limit.
Loading

The batch sits silently until a size-flush or an agent restart. No error or warning is logged because the default: path produces no output.

Fix

Two changes:

1. Buffer resetTimerCh by 1.

// Before:
resetTimerCh: make(chan struct{})    // unbuffered -- rearm lost if receiver not ready

// After:
resetTimerCh: make(chan struct{}, 1) // buffer parks the rearm until receiver returns

One pending rearm is sufficient. The signal is idempotent: it always means "reset to force_flush_interval from now." If one is already buffered, a second is redundant and safely dropped by default:.

2. Add a rate-limited flush-stall warning.

The silent failure was unacceptable -- the agent gave no indication that a stream was stuck. Added maybeWarnFlushStalled() which emits a WARN when a batch has been pending for 2x the flush interval:

W! cloudwatchlogs flush stalled for (group/stream): oldest of 3 pending events
   has waited 12s (>= 2x flush_interval 5s) without being flushed;
   the flush timer may have failed to rearm.

Rate-limited to once per max(5 * flush_interval, 1 minute) per stream. With the buffer fix applied this warning should never fire. It exists as a safety net so that if a future change reintroduces a flush stall, the symptom surfaces immediately in the agent log.

Testing

Test What it verifies
TestResetTimerChIsBuffered cap(resetTimerCh) == 1. Fails deterministically on unbuffered.
TestForceFlushTimerRearmsAfterHeldSend Real queue/sender/batch: burst then held size-send then silence. Trailing batch force-flushes within a small multiple of flush interval.
TestForceFlushTimer_SingleLowVolumeEventFlushes Control: single event flushes at the expected interval (passes on both old and new code alike).
TestQueueFlushStallWarn Warn fires when batch stalls, respects the rate limit. Does not fire when flush is healthy.

Also validated end-to-end on a running agent with many low-volume file streams under logrotate copytruncate. Under identical scheduler stress, the unfixed build holds streams indefinitely. The fixed build flushes every stream within seconds.

Test evidence

Reverting the buffer fix causes the regression test to fail:

=== RUN   TestResetTimerChIsBuffered
    lostwakeup_test.go:35:
            Error:          Not equal:
                            expected: 1
                            actual  : 0
            Messages:       resetTimerCh must be buffered (cap 1) so a force-flush-timer
                            rearm issued while manageFlushTimer is outside its select is
                            parked, not dropped; cap=0 reintroduces the lost-wakeup bug
--- FAIL: TestResetTimerChIsBuffered (0.00s)

With the fix, all tests pass:

=== RUN   TestResetTimerChIsBuffered
--- PASS: TestResetTimerChIsBuffered (0.00s)
=== RUN   TestForceFlushTimerRearmsAfterHeldSend
REARM-RESULT trailing batch flushed 301ms after release (flushTimeout=300ms sends=2)
--- PASS: TestForceFlushTimerRearmsAfterHeldSend (0.74s)
=== RUN   TestForceFlushTimer_SingleLowVolumeEventFlushes
CONTROL-RESULT single-event flushed in 301ms (flushTimeout=300ms)
--- PASS: TestForceFlushTimer_SingleLowVolumeEventFlushes (0.30s)
=== RUN   TestQueueFlushStallWarn
--- PASS: TestQueueFlushStallWarn (0.00s)
PASS
ok      github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher   1.055s

@musa-asad musa-asad requested review from jefchien and okankoAMZ June 19, 2026 00:16
@musa-asad musa-asad self-assigned this Jun 19, 2026
@musa-asad musa-asad force-pushed the fix/flush-timer-lost-wakeup branch from 976f9fe to 950d65b Compare June 19, 2026 00:17
@musa-asad musa-asad added the ready for testing Indicates this PR is ready for integration tests to run label Jun 19, 2026
@musa-asad musa-asad marked this pull request as ready for review June 19, 2026 00:18
@musa-asad musa-asad requested a review from a team as a code owner June 19, 2026 00:18
… timer

Make resetTimerCh a buffered channel (cap 1) so a timer-reset signal is not
lost when the start() goroutine is busy, which could leave the force-flush
timer unarmed and delay flushing of a low-volume batch.

Also add a rate-limited warn log as a defense-in-depth signal: when a batch
has pending events but has not been flushed for at least 2x the configured
flush interval, the pusher emits a warn identifying the affected log group
and stream, the pending event count, and the batch age. The check is a
read-only observer on the event-arrival path (single-goroutine owned, no
locking) and is rate-limited so a persistently stalled stream logs at most
once per interval. Covered by TestQueueFlushStallWarn.
@musa-asad musa-asad force-pushed the fix/flush-timer-lost-wakeup branch from 95dff7b to 90c6a3d Compare June 19, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant