-
-
Notifications
You must be signed in to change notification settings - Fork 2
Add outbox retry timeout #862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #862 +/- ##
==========================================
- Coverage 66.53% 66.51% -0.02%
==========================================
Files 382 382
Lines 20882 20896 +14
Branches 2714 2720 +6
==========================================
+ Hits 13893 13900 +7
- Misses 6022 6027 +5
- Partials 967 969 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Peter!
I believe the subscription should be waiting for an insert to the collection, not an update or deletion. If it isn't, I think that's a bug, but I could be wrong.
Also, I may be misunderstanding, but this back-off you've implemented, won't it affect the processing of all messages? I would have thought that we'd want to have a retry mechanism per message group so that we don't hold up all builds if one is having problems. It may be that I'm just not following the code correctly.
@Enkidu93 reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @ddaspit).
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/OutboxDeliveryService.cs line 170 at r1 (raw file):
{ // log error await messages.UpdateAsync(m => m.Id == message.Id, b => b.Inc(m => m.Attempts, 1));
Is Attempts no longer needed? Can we remove it?
04a6b85 to
dd1c724
Compare
pmachapman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the subscription should be waiting for an insert to the collection, not an update or deletion. If it isn't, I think that's a bug, but I could be wrong.
The other components that use a MongoSubscription, such as BuildService.GetNewerRevisionAsync() depend on watching for updates to a document, so I figured it was best to not modify MongoSubscription, and just stop the OutboxDeliveryService from modifying the document. It also makes conceptual sense to have a subscription fire when the objects are updated, I think, as the collection of objects in memory should match the objects in MongoDB, including any updates or deletions.
Also, I may be misunderstanding, but this back-off you've implemented, won't it affect the processing of all messages?
Only when the status code is Unavailable, Unauthenticated, PermissionDenied, or Cancelled (these result from some form of service outage). For all others, any new message arriving in the outbox will begin processing (including retrying the failed ones), as the timeout does not halt the subscription.
I would have thought that we'd want to have a retry mechanism per message group so that we don't hold up all builds if one is having problems. It may be that I'm just not following the code correctly.
I started down that path, but due the complexity of storing per-message group values and re-triggering processing from these values, I thought it was better to leave that change for a later replacement of the outbox system, than over-invest in the current implementation only to have it replaced by RabbitMQ (or similar) later this year. I tried to keep the existing message group processing (the break statements in ProcessMessagesAsync cause the next message group to begin processing after specific failure codes).
My main goals in this implementation are:
- To fix the bug that caused messages to retry every second (8b2916f)
- To retry message processing that has failed due to service outage, in a timely manner with a backoff (41e5e3f)
- To not have failed messages block the queue, unless the failure was from a service outage
- To continue to have messages processed when they arrive in the outbox
@pmachapman made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/OutboxDeliveryService.cs line 170 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Is
Attemptsno longer needed? Can we remove it?
Done. Thank you!
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other components that use a
MongoSubscription, such asBuildService.GetNewerRevisionAsync()depend on watching for updates to a document, so I figured it was best to not modifyMongoSubscription, and just stop theOutboxDeliveryServicefrom modifying the document. It also makes conceptual sense to have a subscription fire when the objects are updated, I think, as the collection of objects in memory should match the objects in MongoDB, including any updates or deletions.
Ok, I think I was trying to ask a different question than you are answering haha. I was looking at the code in the MongoSubscription and it should be listening for inserts if there are no documents when the subscription is started. I just wanted to make sure that your initial comment didn't mean that the subscription wasn't working properly or was getting stuck in situations where it should be catching inserts. But re-reading now, I don't think that's what you were saying, so nevermind 😅 - thank you for bearing with me!
Only when the status code is
Unavailable,Unauthenticated,PermissionDenied, orCancelled(these result from some form of service outage). ...
Right, OK, that makes sense.
I started down that path, but due the complexity of storing per-message group values and re-triggering processing from these values, I thought it was better to leave that change for a later ...
I see - that makes sense. That would have been hairy to implement for sure! And likely, there aren't many cases besides the cases of a general service outage where we would want to retry anyways.
@Enkidu93 reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit).
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be handling retries on a per message group basis, but I understand that it adds a lot of complexity. Can we add an issue to call out this deficiency in the outbox delivery service? Thanks.
@ddaspit reviewed 2 files and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @pmachapman).
Fixes #834
To test locally, stop the
serval-machine-enginecontainer, and start a build (for example using the API example). Messages will then retry with an exponential back off.Restarting
serval-machine-enginewill cause the messages succeed on retry when the timeout is hit. Previously, messages would not retry in a service outage like this, blocking the queue.Due to the way the subscription appears to work (watching the first message in the queue for update or deletion), the timeout is in effect as long as there are messages in the queue, and the sending of the first message in the queue has failed. This works correctly now that the attempts counter for that item is no longer incremented (which triggered the subscription, which ran process messages, which failed, which triggered the subscription... ad infinitum).
When the queue is empty, the arrival of the first message in the queue will commence processing, as per before.
This change is