K8SPG-943: check operator panic in e2e test#1441
Conversation
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
There was a problem hiding this comment.
Pull request overview
This PR adds operator panic detection to E2E test cleanup phases. The change introduces a new check_operator_panic function that searches operator logs for panic messages and calls this function before destroying the operator in test cleanup scripts.
Changes:
- Added
check_operator_panicfunction toe2e-tests/functionsthat checks operator logs for "Observed a panic" messages - Updated 29 E2E test cleanup files to call
check_operator_panicbeforedestroy_operator
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| e2e-tests/functions | Adds new check_operator_panic function to detect panics in operator logs |
| e2e-tests/tests//99-.yaml | Updates 29 test cleanup scripts to check for operator panics before destroying the operator |
| check_operator_panic() { | ||
| local operator_pod=$(get_operator_pod) | ||
| if kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -q "Observed a panic"; then | ||
| echo "Detected panic in operator" | ||
| exit 1 | ||
| fi | ||
| } |
There was a problem hiding this comment.
The function doesn't verify that the operator pod exists before checking for panics. If get_operator_pod returns an empty string (no operator pod found), the kubectl logs command will fail, but the error will be masked by the pipeline to grep, causing the function to silently succeed without checking for panics.
Add validation that operator_pod is non-empty before attempting to retrieve logs. Consider also handling the case where kubectl logs fails due to the pod not existing or not being ready yet.
e2e-tests/functions
Outdated
| check_operator_panic() { | ||
| local operator_pod=$(get_operator_pod) | ||
| if kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -q "Observed a panic"; then | ||
| echo "Detected panic in operator" |
There was a problem hiding this comment.
detected panic but what is the panic? grep -q will suppress any output. we need to print the panic and the stacktrace if possible
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
| check_operator_panic() { | ||
| local operator_pod=$(get_operator_pod) | ||
| local panic_log | ||
| panic_log=$(kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -A 100 "Observed a panic") | ||
| if [ -n "$panic_log" ]; then | ||
| echo "Detected panic in operator:" | ||
| echo "$panic_log" | ||
| exit 1 | ||
| fi | ||
| } |
There was a problem hiding this comment.
- Problem: Several test cleanup files in this repository were not updated with
check_operator_panicwhile all other similar test directories were. The following cleanup files are missing the newcheck_operator_paniccall that was added to 29 other test suites:
e2e-tests/tests/backup-enable-disable/99-remove-cluster-gracefully.yamle2e-tests/tests/cert-manager-tls/99-remove-cluster-gracefully.yamle2e-tests/tests/demand-backup-offline-snapshot/99-cleanup.yamle2e-tests/tests/dynamic-configuration/99-remove-cluster-gracefully.yamle2e-tests/tests/pg-tde/99-remove-cluster-gracefully.yamle2e-tests/tests/backup-enable-disable/98-remove-datasource-cluster-gracefully.yaml
-
Why it matters: These test suites will not detect operator panics, creating inconsistent coverage.
-
Fix: Add
check_operator_panicbeforedestroy_operatorin each of these cleanup files, matching the pattern used in all other test suites in this PR.
e2e-tests/functions
Outdated
| check_operator_panic() { | ||
| local operator_pod=$(get_operator_pod) | ||
| local panic_log | ||
| panic_log=$(kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -A 100 "Observed a panic") |
There was a problem hiding this comment.
-
Problem: The
grep -A 100hard limit may silently truncate panic output if a stack trace exceeds 100 lines, which is common in Go panics with deep call stacks or concurrent goroutine dumps. -
Why it matters: A truncated panic log could make it harder to diagnose the root cause of the panic, and any goroutine dump lines beyond line 100 would be lost.
-
Fix: Use a larger context value (e.g.,
-A 500) or remove the limit entirely by piping throughawk '/Observed a panic/{found=1} found{print}'to capture everything from the match to the end of the log.
| panic_log=$(kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -A 100 "Observed a panic") | |
| panic_log=$(kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | awk '/Observed a panic/{found=1} found{print}') |
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
commit: ecd71ad |
CHANGE DESCRIPTION
Adds a check in e2e tests to ensure no panics are detected.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability