Skip to content

UID2-7223 Use 1 SSM session in AWS AMI build#2575

Merged
swibi-ttd merged 5 commits into
mainfrom
swi-fix-ami-timeout
Jun 2, 2026
Merged

UID2-7223 Use 1 SSM session in AWS AMI build#2575
swibi-ttd merged 5 commits into
mainfrom
swi-fix-ami-timeout

Conversation

@swibi-ttd
Copy link
Copy Markdown
Contributor

@swibi-ttd swibi-ttd commented Jun 2, 2026

Fixes the AWS operator AMI build (build-uid2-ami.yaml), which has been hanging/failing since ~28 May.

Likely cause: A GitHub runner-image update slowed per-task SSM session setup ~3–4×. Packer recycles an SSM session per Ansible task, so it intermittently exceeds the become-prompt timeout (Timeout waiting for privilege escalation prompt) or leaves the session idle long enough to hit SSM's 20-min idle reap → StartSession 403, no artifacts.

Fix (scripts/aws/uid2-operator-ami/ansible.cfg): run the whole playbook over one persistent SSM session instead of recycling one per task — pipelining = True + SSH ControlMaster/ControlPersist + keepalives. timeout = 60 kept as a backstop.

Test run 26807032561 — both UID2 & EUID AMIs built and passed E2E, PLAY RECAP ok=49 unreachable=0 (per-task cadence dropped ~45s → ~4–5s).

swibi-ttd and others added 5 commits June 2, 2026 14:37
The latest AL2023 base image (most_recent=true) regressed SSM session
stability: the Packer build hangs on a task transition until SSM's 20-min
idle timeout fires, then dies with StartSession 403 (no AMI produced).

Pin source_ami per region to the May 25 green build:
  us-east-1 (UID2)    ami-0236922087fa98b6e
  eu-central-1 (EUID) ami-08b013271cfc23534

The most_recent filter is left commented in source.pkr.hcl for an easy
revert once the SSM connection regression is resolved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pinning the base image to the May-25 known-good AL2023 (run 26799083499)
still failed identically (Timeout waiting for privilege escalation prompt,
20-min SSM idle reap). That exonerates the base image. packer (1.15.3) and
the session-manager-plugin (1.2.814.0) are also identical between the green
and failing runs, so the regression is in the GitHub runner image / AWS SSM
session-setup path, not the AMI. Reverting the pin to restore most_recent and
drop the per-region kernel-line inconsistency it introduced.

This reverts commit 3d35d185b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ist)

The runner-image regression slowed per-task SSM session setup ~3-4x; recycling
a session per task is what intermittently hangs the build. Enable Ansible
pipelining and SSH ControlMaster/ControlPersist so the whole playbook runs over
a single persistent connection (one SSM session instead of ~49), plus
ServerAlive keepalives so a slow task's session isn't reaped at SSM's 20-min
idle timeout. Keeps timeout=60 as a backstop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@swibi-ttd swibi-ttd changed the title raise ansible timeout to 60s Use 1 SSM session in AWS AMI build Jun 2, 2026
@swibi-ttd swibi-ttd marked this pull request as ready for review June 2, 2026 23:35
@swibi-ttd swibi-ttd changed the title Use 1 SSM session in AWS AMI build UID2-7223 Use 1 SSM session in AWS AMI build Jun 2, 2026
@swibi-ttd swibi-ttd merged commit 23eef5c into main Jun 2, 2026
9 checks passed
@swibi-ttd swibi-ttd deleted the swi-fix-ami-timeout branch June 2, 2026 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants