Skip to content

Report root cause when a stack update fails#3

Open
deorus wants to merge 1 commit into
mainfrom
better-failure-reporting
Open

Report root cause when a stack update fails#3
deorus wants to merge 1 commit into
mainfrom
better-failure-reporting

Conversation

@deorus

@deorus deorus commented Jun 20, 2026

Copy link
Copy Markdown

Addresses Doist/platform-backlog#734 — failed deploys currently surface only a single, often-wrong error line, leaving you to dig through the AWS Console. This action is the standard deploy step across ~30 Doist repos, so the improvement applies broadly, not just to Todoist.

What changed

  • On failure, scan the stack events for this update's ClientRequestToken and collect all genuinely-failed resources (any *_FAILED status, skipping cancellation cascades). Report the earliest failure as the root cause — the previous heuristic captured the newest failed event, which is usually a downstream symptom.
  • Under GitHub Actions, write a summary table to the job summary: Time (UTC) / Resource / Type / Status / Reason, plus a deep link to the stack's events tab in the AWS Console. Remaining failures are emitted as ::error:: annotations.
  • Falls back to a generic message + console link when no resource-level reason is available.

Notes

  • No new IAM permission — cloudformation:DescribeStackEvents was already required.
  • Tests cover failure detection, root-cause ordering, Markdown-cell escaping, and the console URL.

On a failed deploy, scan the stack events for this update and surface the
earliest failed resource as the error. Under GitHub Actions, write a summary
table of all failed resources plus a console deep link to the job summary.
@deorus deorus requested a review from artyom June 22, 2026 06:42
@deorus deorus marked this pull request as ready for review June 22, 2026 06:42

@doistbot doistbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR improves CloudFormation deploy failure reporting by scanning stack events for the update's ClientRequestToken, identifying the earliest failed resource as the root cause, and writing a summary table with console links to the GitHub Actions job summary.

Few things worth tightening:

  • Stack-level failures get dropped: The continue statement that skips stack events also hides pre-flight failures (missing IAM capabilities, export overlaps, etc.) where no resource fails, resulting in a generic error instead of the real cause. Remove the continue so stack-level _FAILED events are still collected — since stack events are newer than resource failures, sortedFailures will still place the true root cause first.
  • Edge cases in the root-cause message: A *_FAILED event with no ResourceStatusReason produces a malformed message (… STATUS: — see stack events). Treat an empty root.reason as missing and fall back to the generic message + console link. Separately, the console URL hardcodes console.aws.amazon.com, which is wrong for GovCloud and China partitions — derive the host from the stack ARN/partition instead.
  • Nondeterministic root-cause selection: Events collected into a map then sorted only by timestamp can shuffle equal-timestamp failures, making the reported root cause vary run to run. Use EventId for dedup instead of the composite key, and add a deterministic tiebreaker or preserve scan order so failures[0] is stable.
  • Polling and testing gaps: On busy stacks, pagination now continues through the full one-hour cutoff even after rollback — break early once the matching UPDATE_IN_PROGRESS event for this token is found. Also, the two highest-value functions (reportFailure and writeStepSummary) are untested despite being pure functions with an existing test harness; adding coverage for both the populated and empty cases would guard against regressions in the core logic this PR introduces.

I also included a few optional follow-up notes in the details below.

Optional follow-up note (1)
  • [P3] main.go:146: Since the stack's terminal event is always newer than the resource failures that caused it, it will be encountered first in this newest-to-oldest stream. You can avoid redundantly populating the failures map on every tick during a rollback by checking if terminal != "" && isFailure(...). This safely defers all map operations and string parsing until the final tick when the map is actually used.

Share FeedbackReview Logs

Comment thread main.go
types.ResourceStatusRollbackFailed:
terminal = evt.ResourceStatus
}
continue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 P1 The continue statement skips all stack-level events from being evaluated by isFailure. While this hides the redundant "The following resource(s) failed" stack event during typical deployments, it also drops stack-level pre-flight failures (e.g., missing IAM capabilities, export overlaps, or unresolvable parameters) where no underlying resources fail. This results in the action returning a generic UPDATE_ROLLBACK_COMPLETE error and dropping the actual root cause, which is a regression from the previous behavior.

Remove continue so that stack-level _FAILED events are still collected. Because stack-level events occur after resource failures, sortedFailures will still correctly place the true resource root cause first when both exist.

Comment thread main.go
log.Printf("%s%s (%s) %s: %s", githubErrPrefix, e.logicalID, e.resType, e.status, oneLine(e.reason))
}
root := failures[0]
return fmt.Errorf("%s (%s) %s: %s — see stack events: %s", root.logicalID, root.resType, root.status, oneLine(root.reason), consoleURL)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 ResourceStatusReason is optional in CloudFormation events, so a *_FAILED event can legitimately arrive with no reason. In that case this now returns Resource (Type) STATUS: — see stack events, even though the PR description says to fall back to a generic message when no resource-level reason is available. Please treat an empty root.reason as missing here and either omit the : %s fragment or fall back to the terminal-status message + console link.

Comment thread main.go
}

func eventsConsoleURL(region, stackID string) string {
return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 This hardcodes the commercial AWS Console domain, so the deep link is wrong for GovCloud and China regions even though the action itself can run there. Please derive the console host from the stack ARN/partition (or otherwise map the region to the correct console domain) instead of always using *.console.aws.amazon.com.

Comment thread main.go
reason: unptr(evt.ResourceStatusReason),
timestamp: unptr(evt.Timestamp),
}
failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 Replace this synthetic composite key with the event's unique ID (unptr(evt.EventId)). CloudFormation guarantees a stable EventId for every stack event, which provides a safer, allocation-free way to deduplicate events without relying on stringified timestamps.

Comment thread main.go

func sortedFailures(m map[string]failedEvent) []failedEvent {
out := slices.Collect(maps.Values(m))
slices.SortFunc(out, func(a, b failedEvent) int { return a.timestamp.Compare(b.timestamp) })

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 This only sorts by timestamp, but the failures were first collected into a map, so any two events with the same timestamp end up in arbitrary order. Since reportFailure() treats failures[0] as the root cause, the reported root failure can change run to run for equal-timestamp events. Keep scan order in a slice (using a separate set only for dedupe if needed), or add a deterministic tiebreaker before selecting the first failure.

Comment thread main.go
return cmp.Or(likelyRootCause, fmt.Errorf("%v, see AWS CloudFormation Console for more details", evt.ResourceStatus))
case types.ResourceStatusUpdateComplete:
return nil
case types.ResourceStatusUpdateRollbackComplete,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 This change stops returning immediately on rollback and now keeps paginating DescribeStackEvents until the one-hour cutoff. On stacks with a lot of recent history, that means extra API pages for unrelated older operations right on the failure path. Since events are newest-first, consider breaking once you've reached the current update's start event for this ClientRequestToken (for example the matching stack UPDATE_IN_PROGRESS event) instead of scanning the whole cutoff window.

Comment thread main.go
// table plus per-resource error annotations — and returns the most likely root
// cause as an error. The earliest failure is the root cause; later ones usually
// cascade from it.
func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 reportFailure is the core of this PR — it picks failures[0] as the root cause, formats the error message, logs failures[1:] as annotations, and falls back to a generic message when there are no resource-level failures. None of this is tested. A regression here (wrong element selected, broken fallback, malformed message) would directly reintroduce the wrong-root-cause problem this PR fixes. It's a pure function in the same package — with GITHUB_STEP_SUMMARY unset, a test can pass a []failedEvent and assert the returned error string for both the populated and empty cases.

Comment thread main.go
region, url.QueryEscape(stackID))
}

func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 writeStepSummary produces the primary new user-facing output — the Markdown table written to ~30 repos' job summaries. The table header, separator row, per-row timestamp formatting, and console link assembly are all untested logic that mdCell/eventsConsoleURL tests alone don't cover. It's testable with a temp file: call writeStepSummary with the temp path, read it back, and assert the rendered Markdown. A regression would produce garbled summaries across all consuming repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants