cloudwatchlogs: fix lost wakeup that leaves the force-flush timer dead#2166
Open
musa-asad wants to merge 2 commits into
Open
cloudwatchlogs: fix lost wakeup that leaves the force-flush timer dead#2166musa-asad wants to merge 2 commits into
musa-asad wants to merge 2 commits into
Conversation
976f9fe to
950d65b
Compare
… timer Make resetTimerCh a buffered channel (cap 1) so a timer-reset signal is not lost when the start() goroutine is busy, which could leave the force-flush timer unarmed and delay flushing of a low-volume batch. Also add a rate-limited warn log as a defense-in-depth signal: when a batch has pending events but has not been flushed for at least 2x the configured flush interval, the pusher emits a warn identifying the affected log group and stream, the pending event count, and the batch age. The check is a read-only observer on the event-arrival path (single-goroutine owned, no locking) and is rate-limited so a persistently stalled stream logs at most once per interval. Covered by TestQueueFlushStallWarn.
95dff7b to
90c6a3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
The CloudWatch Agent tails log files and publishes them to CloudWatch Logs. Each monitored file gets its own stream with a per-stream force-flush timer. This timer fires every
force_flush_interval(default 5s) to send whatever events have accumulated in the batch, even if the batch has not reached the 1MB size limit. Without it, low-volume streams would only flush when full -- potentially holding events for minutes or longer.The timer is a one-shot: after it fires, something must reset ("rearm") it for the next cycle. The rearm signal travels through an internal Go channel called
resetTimerCh.Key concepts
force_flush_intervalto flush pending events regardless of batch size.Problem
resetFlushTimer()sends the rearm signal via non-blocking select:resetTimerChis unbuffered. Two goroutines are involved:manageFlushTimer(owns the timer, listens onresetTimerCh) and the main loop (callsresetFlushTimer()after each send). The race:manageFlushTimersignals the main loop viaflushCh.send(), then callsresetFlushTimer().manageFlushTimerhas not yet returned to itsselect-- it is still completing theflushChsend.resetTimerChis unbuffered andmanageFlushTimeris not listening. The rearm hitsdefault:and is lost.manageFlushTimerreturns to itsselect. Channel is empty. Timer is never rearmed.sequenceDiagram participant TM as manageFlushTimer participant CH as resetTimerCh (unbuffered) participant S as main loop Note over TM: Timer fires TM->>S: flushCh signal ("flush now") Note over TM: Has not returned to select yet Note over S: send() completes, calls resetFlushTimer() S->>CH: try to send rearm signal Note over CH: Unbuffered: needs receiver ready NOW Note over CH: manageFlushTimer not listening CH--xS: DROPPED (hits default) Note over TM: Returns to select... channel empty Note over TM: Timer dead. Never fires again. Note over S: Events accumulate with no timer to flush them. Note over S: ~33 min later, batch hits 1MB size limit.The batch sits silently until a size-flush or an agent restart. No error or warning is logged because the
default:path produces no output.Fix
Two changes:
1. Buffer
resetTimerChby 1.One pending rearm is sufficient. The signal is idempotent: it always means "reset to
force_flush_intervalfrom now." If one is already buffered, a second is redundant and safely dropped bydefault:.2. Add a rate-limited flush-stall warning.
The silent failure was unacceptable -- the agent gave no indication that a stream was stuck. Added
maybeWarnFlushStalled()which emits a WARN when a batch has been pending for 2x the flush interval:Rate-limited to once per
max(5 * flush_interval, 1 minute)per stream. With the buffer fix applied this warning should never fire. It exists as a safety net so that if a future change reintroduces a flush stall, the symptom surfaces immediately in the agent log.Testing
TestResetTimerChIsBufferedcap(resetTimerCh) == 1. Fails deterministically on unbuffered.TestForceFlushTimerRearmsAfterHeldSendTestForceFlushTimer_SingleLowVolumeEventFlushesTestQueueFlushStallWarnAlso validated end-to-end on a running agent with many low-volume file streams under logrotate copytruncate. Under identical scheduler stress, the unfixed build holds streams indefinitely. The fixed build flushes every stream within seconds.
Test evidence
Reverting the buffer fix causes the regression test to fail:
With the fix, all tests pass: