Skip to content

Conversation

@pmachapman
Copy link
Collaborator

@pmachapman pmachapman commented Jan 29, 2026

Fixes #834

To test locally, stop the serval-machine-engine container, and start a build (for example using the API example). Messages will then retry with an exponential back off.

Restarting serval-machine-engine will cause the messages succeed on retry when the timeout is hit. Previously, messages would not retry in a service outage like this, blocking the queue.

Due to the way the subscription appears to work (watching the first message in the queue for update or deletion), the timeout is in effect as long as there are messages in the queue, and the sending of the first message in the queue has failed. This works correctly now that the attempts counter for that item is no longer incremented (which triggered the subscription, which ran process messages, which failed, which triggered the subscription... ad infinitum).

When the queue is empty, the arrival of the first message in the queue will commence processing, as per before.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.51%. Comparing base (4bafb5f) to head (dd1c724).

Files with missing lines Patch % Lines
...L.ServiceToolkit/Services/OutboxDeliveryService.cs 67.85% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #862      +/-   ##
==========================================
- Coverage   66.53%   66.51%   -0.02%     
==========================================
  Files         382      382              
  Lines       20882    20896      +14     
  Branches     2714     2720       +6     
==========================================
+ Hits        13893    13900       +7     
- Misses       6022     6027       +5     
- Partials      967      969       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Peter!

I believe the subscription should be waiting for an insert to the collection, not an update or deletion. If it isn't, I think that's a bug, but I could be wrong.

Also, I may be misunderstanding, but this back-off you've implemented, won't it affect the processing of all messages? I would have thought that we'd want to have a retry mechanism per message group so that we don't hold up all builds if one is having problems. It may be that I'm just not following the code correctly.

@Enkidu93 reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/OutboxDeliveryService.cs line 170 at r1 (raw file):

    {
        // log error
        await messages.UpdateAsync(m => m.Id == message.Id, b => b.Inc(m => m.Attempts, 1));

Is Attempts no longer needed? Can we remove it?

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the subscription should be waiting for an insert to the collection, not an update or deletion. If it isn't, I think that's a bug, but I could be wrong.

The other components that use a MongoSubscription, such as BuildService.GetNewerRevisionAsync() depend on watching for updates to a document, so I figured it was best to not modify MongoSubscription, and just stop the OutboxDeliveryService from modifying the document. It also makes conceptual sense to have a subscription fire when the objects are updated, I think, as the collection of objects in memory should match the objects in MongoDB, including any updates or deletions.

Also, I may be misunderstanding, but this back-off you've implemented, won't it affect the processing of all messages?

Only when the status code is Unavailable, Unauthenticated, PermissionDenied, or Cancelled (these result from some form of service outage). For all others, any new message arriving in the outbox will begin processing (including retrying the failed ones), as the timeout does not halt the subscription.

I would have thought that we'd want to have a retry mechanism per message group so that we don't hold up all builds if one is having problems. It may be that I'm just not following the code correctly.

I started down that path, but due the complexity of storing per-message group values and re-triggering processing from these values, I thought it was better to leave that change for a later replacement of the outbox system, than over-invest in the current implementation only to have it replaced by RabbitMQ (or similar) later this year. I tried to keep the existing message group processing (the break statements in ProcessMessagesAsync cause the next message group to begin processing after specific failure codes).

My main goals in this implementation are:

  • To fix the bug that caused messages to retry every second (8b2916f)
  • To retry message processing that has failed due to service outage, in a timely manner with a backoff (41e5e3f)
  • To not have failed messages block the queue, unless the failure was from a service outage
  • To continue to have messages processed when they arrive in the outbox

@pmachapman made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/OutboxDeliveryService.cs line 170 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is Attempts no longer needed? Can we remove it?

Done. Thank you!

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other components that use a MongoSubscription, such as BuildService.GetNewerRevisionAsync() depend on watching for updates to a document, so I figured it was best to not modify MongoSubscription, and just stop the OutboxDeliveryService from modifying the document. It also makes conceptual sense to have a subscription fire when the objects are updated, I think, as the collection of objects in memory should match the objects in MongoDB, including any updates or deletions.

Ok, I think I was trying to ask a different question than you are answering haha. I was looking at the code in the MongoSubscription and it should be listening for inserts if there are no documents when the subscription is started. I just wanted to make sure that your initial comment didn't mean that the subscription wasn't working properly or was getting stuck in situations where it should be catching inserts. But re-reading now, I don't think that's what you were saying, so nevermind 😅 - thank you for bearing with me!

Only when the status code is Unavailable, Unauthenticated, PermissionDenied, or Cancelled (these result from some form of service outage). ...
Right, OK, that makes sense.

I started down that path, but due the complexity of storing per-message group values and re-triggering processing from these values, I thought it was better to leave that change for a later ...
I see - that makes sense. That would have been hairy to implement for sure! And likely, there aren't many cases besides the cases of a general service outage where we would want to retry anyways.

:lgtm:

@Enkidu93 reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be handling retries on a per message group basis, but I understand that it adds a lot of complexity. Can we add an issue to call out this deficiency in the outbox delivery service? Thanks. :lgtm:

@ddaspit reviewed 2 files and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @pmachapman).

@pmachapman pmachapman merged commit 214b1e1 into main Feb 2, 2026
3 checks passed
@pmachapman pmachapman deleted the outbox_retry branch February 2, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Outbox is incorrectly retrying to process messages

4 participants