Improve AWS S3 source error logging and diagnostics#24959
Improve AWS S3 source error logging and diagnostics#24959saliagadotcom wants to merge 8 commits intovectordotdev:masterfrom
Conversation
…ing and non-retryable message deletion
…nd detect total parse failures
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25b6c6236a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/sources/aws_s3/sqs.rs
Outdated
| error_kind, | ||
| actionable_message: &actionable_message, | ||
| }); | ||
| self.state.delete_failed_message && !class.is_retryable() |
There was a problem hiding this comment.
Avoid deleting S3 messages on unclassified client errors
This branch deletes failed S3 notifications whenever delete_failed_message is enabled and class.is_retryable() is false, but the new classifier maps all non-whitelisted 4xx responses to RequestError (and other unmatched cases to Unknown), both treated as non-retryable. That means transient or newly introduced AWS 4xx error codes can now cause immediate SQS message deletion instead of retry, leading to permanent data loss for affected objects. Restrict auto-delete to explicitly terminal errors (for example AccessDenied/NoSuchKey/NoSuchBucket) and keep RequestError/Unknown retryable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
So a message should only be deleted into two scenarios when successfully processed and when sink rejects the batch and delete_failed_message is enabled. I will revert this.
Summary
The AWS S3 source has opaque error logging across several failure modes. Users see generic messages like "Failed to process SQS message: service error" with no AWS error codes, no actionable guidance, and incorrect error type classification (
PARSER_FAILEDfor network/auth errors). The retry backoff loop logs attrace!level, making persistent failures invisible in production.Changes
src/aws/error.rs— new file): Reusableextract_error_context()andclassify_error()utilities that pull AWS error codes, HTTP status, request IDs, and dispatch failure kind from anySdkError. Classifies errors into Auth/NotFound/Throttling/Connectivity/Configuration/ServiceError.sqs.rs,aws_sqs.rs): NewS3ObjectGetFailedinternal event with actionable messages per error kind (NoSuchKey, AccessDenied, NoSuchBucket, etc.). Non-retryable errors now delete the SQS message whendelete_failed_message = true, stopping infinite retry loops for deleted/inaccessible objects.S3ObjectProcessingFailedpromoted fromdebug!towarn!with key and error fields added.mod.rs): Logclient_concurrency,compression,multiline,delete_failed_message, andacknowledgementsat startup alongside existing fields. Improved Snafu display strings forConfigMissingandInvalidNumberOfMessages.sqs.rs): Now includes both the event region and the configured region with remediation guidance.sqs.rs): S3 object bucket/key added to deserialize error warnings. Detects when all frames fail deserialization (0 events produced) and emits a dedicated warning.sqs.rs): Backoff retry log promoted fromtrace!towarn!withqueue_url. Poison message detection viaApproximateReceiveCount > 5. S3 test event promoted fromdebug!toinfo!. SNS envelope unwrapping logged atdebug!.Vector configuration
Standard AWS S3 source configuration with SQS-based notifications:
How did you test this PR?
make check-clippymake check-fmtmake testChange Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.