Skip to content

docs(dlq): help on-call find what's failing when a DLQ alert fires#210

Open
scotwells wants to merge 1 commit into
mainfrom
docs/dlq-loki-triage
Open

docs(dlq): help on-call find what's failing when a DLQ alert fires#210
scotwells wants to merge 1 commit into
mainfrom
docs/dlq-loki-triage

Conversation

@scotwells

@scotwells scotwells commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What this does

When a Dead Letter Queue alert fires, whoever's on call needs to find which activity policy is failing, and why — quickly. Today the runbooks send them to a quick command that only shows the last few minutes of logs from whichever pods happen to be running right now. The failures that set off these alerts are usually older than that, so the most useful clues are often already gone — and there was no guidance on how to search our log history instead.

This gives on-call a clear, copy-paste path to the right logs, so they can go from "an alert fired" to "this specific policy is broken on this field" in a couple of steps.

What changed

  • A new short guide that walks through, per DLQ alert, exactly what to look for and how to find it in Grafana — including how to tell a genuine backlog apart from the same few events looping, and a couple of gotchas that have caught people out before.
  • Each DLQ runbook now points to that guide and shows the right query for its alert, replacing the old "tail the logs and grep" step that misses older failures.

Documentation only — no behavior changes.

Related

Part of the larger effort to fix the DLQ failures that were silently dropping activities — and to make them easy to see and triage:

The DLQ runbooks pointed at `kubectl logs --tail | grep`, which misses DLQ
failures that predate the current pods (the common case), can't aggregate or
count across time, and used a stale pod selector. There was no Loki/LogQL
guidance anywhere in docs/.

- Add docs/runbooks/dlq/querying-logs-with-loki.md: how to reach Loki, the base
  selector and structured log fields, triage recipes mapped to each DLQ alert,
  and gotchas (line truncation, query-window/retention limits, depth-is-a-metric).
- Replace the kubectl-logs steps in the five DLQ runbooks with LogQL queries
  that filter on the stable structured fields (errorType, policy, err, auditID).
@scotwells scotwells force-pushed the docs/dlq-loki-triage branch from 4370700 to 6d95771 Compare June 25, 2026 18:38
@scotwells scotwells changed the title docs(dlq): add Loki log-query guidance for DLQ triage docs(dlq): help on-call find what's failing when a DLQ alert fires Jun 25, 2026
@scotwells scotwells marked this pull request as ready for review June 25, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants