feat(aws-test-infra): add AWS test infrastructure provisioning action#139
feat(aws-test-infra): add AWS test infrastructure provisioning action#139sowmyav27 wants to merge 1 commit into
Conversation
Composite action that provisions and tears down AWS test infra (VPC + subnet + IGW + route table + security group + EC2 instances) for e2e workflows. Built as a Go binary using aws-sdk-go-v2 with hand-rolled mocks for unit testing. Replaces ~150 lines of duplicated Bash + aws-cli inline in loft-sh/vcluster-pro's e2e-selinux-support-matrix.yaml and prerelease-vcluster.yaml workflows. Build-from-source: action runs `go build` from src/ on every invocation. Mirrors run-ginkgo's pattern — assumes the consumer has Go available (via actions/setup-go). No separate release artifact lifecycle. Two subcommands: - provision: VPC/subnet/IGW/route table/SG/EC2/SSM-wait, with optional AMI architecture/virtualization-type filter for safety-net AMI lookup. - cleanup: best-effort direct teardown by ID, plus tag-based sweep that catches resources from runs that failed before exporting IDs. Sweep errors are logged + swallowed by default to match `set +e` semantics of the original Bash teardown; -strict-sweep opts back into hard fail. Outputs: vpc-id, igw-id, subnet-id, route-table-id, route-assoc-id, security-group-id, ami-id, primary-public-ip, instance-ids (CSV), named per-role instance IDs (primary/worker1/worker2), and a JSON map output (instance-id-by-role) for consumers using arbitrary role names or non-three-instance counts. Tests (29 top-level / 71 cases / 62.9% coverage) cover API call ordering, tag application on every resource, the partial-failure ResourceIDs contract that lets cleanup tear down failed provisions, tag-based sweep correctness, dependency-order strict checks for disassoc-before-delete pairs, ingress encoding round-trip, flag validation, and output format wiring.
| if err == nil { | ||
| online := 0 | ||
| for _, info := range out.InstanceInformationList { | ||
| if info.PingStatus == "Online" { |
There was a problem hiding this comment.
if info.PingStatus == ssmtypes.PingStatusOnline { is more accurate
| - name: AWS login (OIDC) | ||
| uses: aws-actions/configure-aws-credentials@v5.1.1 | ||
| with: | ||
| role-to-assume: arn:aws:iam::084374023943:role/e2e-test-executor |
There was a problem hiding this comment.
I would replace the account ID with some placeholder. It's a public repository and we should not expose any internal infra identifiers.
| // Values=<runID>. The order — instances → SGs → route tables → subnets → | ||
| // IGWs → VPCs — matches the dependency chain so deletes don't fail because | ||
| // of in-use checks. | ||
| func sweepByTag(ctx context.Context, logger *slog.Logger, c EC2API, waiter EC2Waiter, runID string) error { |
There was a problem hiding this comment.
If the DescribeSecurityGroups fails, we don't cleanup route tables, subnets, IGWs, or VPCs. The possible solution:
func sweepByTag(ctx context.Context, logger *slog.Logger, c EC2API, waiter EC2Waiter,
runID string) error {
logger.Info("running tag-based sweep", "run_id", runID)
tagFilter := []types.Filter{{Name: aws.String("tag:RunID"), Values: []string{runID}}
var errs []error
// Instances
instOut, err := c.DescribeInstances(ctx, &ec2.DescribeInstancesInput{
Filters: append(append([]types.Filter{}, tagFilter...),
types.Filter{Name: aws.String("instance-state-name"),
Values: []string{"pending", "running", "stopping", "stopped"
})
if err != nil {
errs = append(errs, fmt.Errorf("describe-instances (sweep): %w", err))
} else {
// ... terminate discovered instances (unchanged)
}
// Security groups
sgOut, err := c.DescribeSecurityGroups(ctx, &ec2.DescribeSecurityGroupsInput{Filters
tagFilter})
if err != nil {
errs = append(errs, fmt.Errorf("describe-security-groups (sweep): %w", err))
} else {
// ... delete discovered SGs (unchanged)
}
// ... same pattern for RouteTables, Subnets, IGWs, VPCs
return errors.Join(errs...)
}
| } | ||
|
|
||
| func runCleanup(ctx context.Context, logger *slog.Logger, name string, args []string) error { | ||
| fs := flag.NewFlagSet(name, flag.ExitOnError) |
There was a problem hiding this comment.
I'm not exactly sure about flag.ExitOnError. With malformed flags with no value or unexpected flags like -typo=foo the Go stdlib calls os.Exit(2) which makes Line 52 a dead code. Maybe flag.ContinueOnError would make more sense here.
Same for runProvision in provision.go
| return nil | ||
| } | ||
|
|
||
| func tagPairs(ts []types.TagSpecification) string { |
There was a problem hiding this comment.
This function looks obsolete and never used
| with: | ||
| go-version-file: .github/actions/aws-test-infra/src/go.mod | ||
| - name: Run tests | ||
| run: go test ./... |
There was a problem hiding this comment.
go test -v -race -count=1 ./...
| cd $(ACTIONS_DIR)/linear-release-sync/src && go test -v ./... | ||
|
|
||
| test-aws-test-infra: ## run aws-test-infra unit tests | ||
| cd $(ACTIONS_DIR)/aws-test-infra/src && go test -v ./... |
There was a problem hiding this comment.
go test -v -race -count=1 ./...
Summary
Fixes: ENGQA-702
A composite action that provisions and tears down AWS test infrastructure (VPC + subnet + IGW + route table + security group + EC2 instances) for e2e workflows. Built as a Go binary using
aws-sdk-go-v2, unit-tested with hand-rolled mocks.Replaces ~150 lines of duplicated Bash +
aws-clithat previously lived inline inloft-sh/vcluster-pro'se2e-selinux-support-matrix.yamlandprerelease-vcluster.yamlworkflows. A separate vcluster-pro PR migrates those workflows to use this action.Why
actions/setup-goin place.Design
Build-from-source: the action runs
go buildfromsrc/on every invocation. Mirrorsrun-ginkgo's pattern — assumes the consumer has Go available (viaactions/setup-go). No separate release artifact lifecycle, no SHA-256 dance, no two-step "merge then tag then PR" coordination. Tagaws-test-infra/v1is usable immediately after merge.Two subcommands:
provision: VPC/subnet/IGW/route table/SG/EC2/SSM-wait. Optional-ami-architectureand-ami-virtualization-typefilter to preserve the safety net the original Bash had (defaults tox86_64+hvminaction.yml).cleanup: best-effort direct teardown by ID, plus a tag-based sweep that catches resources from runs that failed before exporting IDs. Sweep errors are logged + swallowed by default to matchset +esemantics of the original Bash teardown;-strict-sweepopts back into hard failure.Outputs:
vpc-id,igw-id,subnet-id,route-table-id,route-assoc-id,security-group-id,ami-id,primary-public-ip,instance-ids(CSV).primary-instance-id,worker1-instance-id,worker2-instance-id(covers the common 3-instance case).instance-id-by-role(JSON map): for consumers using arbitrary role names or non-three-instance counts.Tests
29 top-level / 71 cases / 62.9% coverage. Cover:
if: always()cleanup stepsTest plan
test-aws-test-infra.yaml:go test ./...+ build verification)aws-test-infra/v1so the consumer PR (vcluster-pro engqa-aws-mig) can reference itworkflow_dispatchone2e-selinux-support-matrix.yamlin vcluster-pro after the consumer PR merges (matrix runs on all 3 distros, cleanup leaves no orphans)workflow_dispatchonprerelease-vcluster.yaml(Kind shared-HA + EC2 standalone paths both pass, teardown succeeds)