Properly check for expunged status during deletes by jmpesp · Pull Request #10257 · oxidecomputer/omicron

jmpesp · 2026-04-09T21:53:50Z

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.

There were TWO places not doing this correctly:

when detaching a volume from a pantry
when deleting local storage datasets

In these cases correct this by matching on the returned error of either ProgenitorOperationRetry::run or retry_operation_while_indefinitely (from progenitor-extras), and checking e.is_gone. If that matches then the saga can proceed.

Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.

Also! The check functions bailed with an error if a sled or disk was not in-service, instead of returning a boolean. This meant that if a sled or disk was not in-service, the retry loop would exit with a GoneCheckError instead of a Gone value. This would not properly match for call sites that checked is_gone for a returned error, leading to saga unwinds that were ultimately not necessary. Change those functions to return a bool corresponding to whether or not the thing is in-service, and map that into a GoneCheckResult inside gone check functions.

Fixes #10224

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone. There were TWO places not doing this correctly: - when detaching a volume from a pantry - when deleting local storage datasets In these cases correct this by matching on the returned error of either `ProgenitorOperationRetry::run` or `retry_operation_while_indefinitely` (from progenitor-extras), and checking `e.is_gone`. If that matches then the saga can proceed. Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too. Also! The check functions bailed with an error if a sled or disk was not in-service, instead of returning a boolean. This meant that if a sled or disk was not in-service, the retry loop would exit with a `GoneCheckError` instead of a `Gone` value. This would not properly match for call sites that checked `is_gone` for a returned error, leading to saga unwinds that were ultimately not necessary. Change those functions to return a bool corresponding to whether or not the thing is in-service, and map that into a `GoneCheckResult` inside gone check functions. Fixes oxidecomputer#10224

smklein · 2026-04-13T18:23:04Z

+                    let sled_in_service =
+                        Self::check_sled_in_service_on_connection(
+                            &conn,
+                            disk.sled_id(),
+                        )
+                        .await
+                        .map_err(|txn_error| txn_error.into_diesel(&err))?;


To confirm, this change is "functionally identical", right? just forced by the new signature of check_sled_in_service_on_connection?

(as in: old behavior if sled is not in service -> throw an error, and that's the new behavior too?)

Correct - the TransactionError::CustomError returned right after this highlighted line is what happened previously, and is in fact checked by the physical_disk_cannot_insert_to_expunged_sled test

smklein · 2026-04-13T18:40:09Z

+    // Do not match on `e.is_gone` here: if the Pantry is gone, then return an
+    // error. The attach may have succeeded for one of the previous calls but if
+    // the retry loop bails out after determining that the Pantry is gone, that
+    // attachment is now gone too.


I know you've scattered a bunch of these "Do not match on is_gone" comments through this PR, but I'm finding them a bit confusing.

It seems kinda like this is "obviously true" that:

If you are trying to allocate a resource on top of some service X

... and "X" is removed from existence, which throws an error

... you should not match on that error and say "allocation succeeded" anyway

Similarly, the other commentary about unwinding seems true, but more a property of "how we're using sagas" than the specifics of this call site, if that makes sense?

I guess, put another way: the situation where we do:

Try to delete a resource

Special-case the error to say "if the resource is already gone", return success anyway

Seems like it's the special-case that deserves commentary - not the other way around.

smklein · 2026-04-13T18:40:47Z

    )
    .await
    .map_err(|e| {
+        // Do not match on `e.is_gone` here: if the ensure fails due to the sled


See my note above; I would omit this comment.

smklein · 2026-04-13T18:40:57Z

    )
    .await
    .map_err(|e| {
+        // Do not match on `e.is_gone` here: If the ensure retry loop bailed out


See my note above; I would omit this comment.

smklein · 2026-04-13T18:41:24Z

-        osagactx
-            .datastore()
-            .check_sled_in_service(&opctx, sled_id)
+        sled_out_of_service_gone_check(osagactx.datastore(), &opctx, sled_id)


This is "functionally identical", just using a slightly different API to accomplish the same thing, right?

It is accomplishing the same thing, and also this is one of the bug fixes: previously, check_sled_in_service would return Ok(()) if a sled is in service, and an error if it isn't. This would bubble up as a GoneCheckError instead of a Gone, and code matching on e.is_gone wouldn't execute. Now it returns a Ok(true) if the sled is in service, an Ok(false) if it isn't, and an Err when there's a legit query error.

smklein · 2026-04-13T18:59:04Z

+        // In this case, if the particular disk hosting this local storage was
+        // expunged, or if the sled was expunged, then proceed with the rest of
+        // the saga.
+        Err(e) if e.is_gone() => Ok(()),


Should this be:

Suggested change

Err(e) if e.is_gone() => Ok(()),

Err(e) if e.is_gone() || e.is_not_found() => Ok(()),

Does local_storage_dataset_delete return a 404 if called twice in a row?

It should be idempotent for DELETEs.

smklein · 2026-04-13T19:00:01Z

+
+        // The pantry is stateless: if it is gone, then the Volume was
+        // destroyed, and we can proceed as if it was detached.
+        Err(e) if e.is_gone() => Ok(()),


Should this be:

Suggested change

Err(e) if e.is_gone() => Ok(()),

Err(e) if e.is_gone() || e.is_not_found() => Ok(()),

Does detach return a 404 if called multiple times in a row?

Same, it should be idempotent for detach.

smklein · 2026-04-13T19:02:09Z

I'm trying to review this carefully, but it looks like it's mixing together a bunch of small changes that overlap. I'm trying to be diligent here and make sure I understand what is fixing what.

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.

There were TWO places not doing this correctly:

when detaching a volume from a pantry

when deleting local storage datasets

This is in:

call_pantry_detach and
sdd_delete_local_storage, correct?

And it's referring to the "match-on-error, do an extra gone-check to allow deletion to succeed anyway, rather than stopping the saga", correct?

In these cases correct this by matching on the returned error of either ProgenitorOperationRetry::run or retry_operation_while_indefinitely (from progenitor-extras), and checking e.is_gone. If that matches then the saga can proceed.

Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.

These are places in:

sdd_delete_local_storage and
sis_ensure_local_storage

smklein · 2026-04-13T19:09:50Z

One more thing:

As described in my comment above, we have the four big behavior changes in this PR:

2x "let the gone check return Ok()"
2x "check the zpool status in addition to the sled for the gone check"

It would be great to get test coverage for these! They're all behavior changes that should be controllable if we can set up the DB correctly, right?

jmpesp · 2026-04-16T18:15:50Z

I'm trying to review this carefully, but it looks like it's mixing together a bunch of small changes that overlap. I'm trying to be diligent here and make sure I understand what is fixing what.

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.
There were TWO places not doing this correctly:

when detaching a volume from a pantry

when deleting local storage datasets

This is in:
* `call_pantry_detach` and

* `sdd_delete_local_storage`, correct?
And it's referring to the "match-on-error, do an extra gone-check to allow deletion to succeed anyway, rather than stopping the saga", correct?

In these cases correct this by matching on the returned error of either ProgenitorOperationRetry::run or retry_operation_while_indefinitely (from progenitor-extras), and checking e.is_gone. If that matches then the saga can proceed.

Correct!

Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.

These are places in:
* `sdd_delete_local_storage` and

* `sis_ensure_local_storage`

Also correct! :)

jmpesp requested a review from smklein April 10, 2026 20:25

jmpesp mentioned this pull request Apr 10, 2026

sagas hang after local disk deletion from expunged disk #10222

Open

smklein reviewed Apr 13, 2026

View reviewed changes

	Err(e) if e.is_gone() => Ok(()),
	Err(e) if e.is_gone() \|\| e.is_not_found() => Ok(()),

Conversation

jmpesp commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein commented Apr 13, 2026

Uh oh!

smklein commented Apr 13, 2026

Uh oh!

jmpesp commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants