Skip to content

Properly check for expunged status during deletes#10257

Open
jmpesp wants to merge 1 commit intooxidecomputer:mainfrom
jmpesp:properly_check_for_sled_and_disk_expunged
Open

Properly check for expunged status during deletes#10257
jmpesp wants to merge 1 commit intooxidecomputer:mainfrom
jmpesp:properly_check_for_sled_and_disk_expunged

Conversation

@jmpesp
Copy link
Copy Markdown
Contributor

@jmpesp jmpesp commented Apr 9, 2026

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.

There were TWO places not doing this correctly:

  • when detaching a volume from a pantry
  • when deleting local storage datasets

In these cases correct this by matching on the returned error of either ProgenitorOperationRetry::run or retry_operation_while_indefinitely (from progenitor-extras), and checking e.is_gone. If that matches then the saga can proceed.

Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.

Also! The check functions bailed with an error if a sled or disk was not in-service, instead of returning a boolean. This meant that if a sled or disk was not in-service, the retry loop would exit with a GoneCheckError instead of a Gone value. This would not properly match for call sites that checked is_gone for a returned error, leading to saga unwinds that were ultimately not necessary. Change those functions to return a bool corresponding to whether or not the thing is in-service, and map that into a GoneCheckResult inside gone check functions.

Fixes #10224

When sending a delete request to a remote service during a saga, the
usual pattern is to provide a "gone" check to bail out of the retry
loop: if the remote service was impacted by an expunge then it's
pointless to continue retrying a delete request, as something backing
that service is gone.

There were TWO places not doing this correctly:

- when detaching a volume from a pantry
- when deleting local storage datasets

In these cases correct this by matching on the returned error of either
`ProgenitorOperationRetry::run` or `retry_operation_while_indefinitely`
(from progenitor-extras), and checking `e.is_gone`. If that matches then
the saga can proceed.

Additionally, there are places where the gone check only checked if the
sled was still in-service, but in reality needed to also check if the
physical disk backing a zpool was expunged. Fix that too.

Also! The check functions bailed with an error if a sled or disk was not
in-service, instead of returning a boolean. This meant that if a sled or
disk was not in-service, the retry loop would exit with a
`GoneCheckError` instead of a `Gone` value. This would not properly
match for call sites that checked `is_gone` for a returned error,
leading to saga unwinds that were ultimately not necessary. Change those
functions to return a bool corresponding to whether or not the thing is
in-service, and map that into a `GoneCheckResult` inside gone check
functions.

Fixes oxidecomputer#10224
Comment on lines +69 to +75
let sled_in_service =
Self::check_sled_in_service_on_connection(
&conn,
disk.sled_id(),
)
.await
.map_err(|txn_error| txn_error.into_diesel(&err))?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, this change is "functionally identical", right? just forced by the new signature of check_sled_in_service_on_connection?

(as in: old behavior if sled is not in service -> throw an error, and that's the new behavior too?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - the TransactionError::CustomError returned right after this highlighted line is what happened previously, and is in fact checked by the physical_disk_cannot_insert_to_expunged_sled test

Comment on lines +142 to +145
// Do not match on `e.is_gone` here: if the Pantry is gone, then return an
// error. The attach may have succeeded for one of the previous calls but if
// the retry loop bails out after determining that the Pantry is gone, that
// attachment is now gone too.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you've scattered a bunch of these "Do not match on is_gone" comments through this PR, but I'm finding them a bit confusing.

It seems kinda like this is "obviously true" that:

  • If you are trying to allocate a resource on top of some service X
  • ... and "X" is removed from existence, which throws an error
  • ... you should not match on that error and say "allocation succeeded" anyway

Similarly, the other commentary about unwinding seems true, but more a property of "how we're using sagas" than the specifics of this call site, if that makes sense?

I guess, put another way: the situation where we do:

  1. Try to delete a resource
  2. Special-case the error to say "if the resource is already gone", return success anyway

Seems like it's the special-case that deserves commentary - not the other way around.

)
.await
.map_err(|e| {
// Do not match on `e.is_gone` here: if the ensure fails due to the sled
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my note above; I would omit this comment.

)
.await
.map_err(|e| {
// Do not match on `e.is_gone` here: If the ensure retry loop bailed out
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my note above; I would omit this comment.

osagactx
.datastore()
.check_sled_in_service(&opctx, sled_id)
sled_out_of_service_gone_check(osagactx.datastore(), &opctx, sled_id)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is "functionally identical", just using a slightly different API to accomplish the same thing, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is accomplishing the same thing, and also this is one of the bug fixes: previously, check_sled_in_service would return Ok(()) if a sled is in service, and an error if it isn't. This would bubble up as a GoneCheckError instead of a Gone, and code matching on e.is_gone wouldn't execute. Now it returns a Ok(true) if the sled is in service, an Ok(false) if it isn't, and an Err when there's a legit query error.

// In this case, if the particular disk hosting this local storage was
// expunged, or if the sled was expunged, then proceed with the rest of
// the saga.
Err(e) if e.is_gone() => Ok(()),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
Err(e) if e.is_gone() => Ok(()),
Err(e) if e.is_gone() || e.is_not_found() => Ok(()),

Does local_storage_dataset_delete return a 404 if called twice in a row?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be idempotent for DELETEs.


// The pantry is stateless: if it is gone, then the Volume was
// destroyed, and we can proceed as if it was detached.
Err(e) if e.is_gone() => Ok(()),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
Err(e) if e.is_gone() => Ok(()),
Err(e) if e.is_gone() || e.is_not_found() => Ok(()),

Does detach return a 404 if called multiple times in a row?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, it should be idempotent for detach.

@smklein
Copy link
Copy Markdown
Collaborator

smklein commented Apr 13, 2026

I'm trying to review this carefully, but it looks like it's mixing together a bunch of small changes that overlap. I'm trying to be diligent here and make sure I understand what is fixing what.

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.

There were TWO places not doing this correctly:

  • when detaching a volume from a pantry
  • when deleting local storage datasets

This is in:

  • call_pantry_detach and
  • sdd_delete_local_storage, correct?

And it's referring to the "match-on-error, do an extra gone-check to allow deletion to succeed anyway, rather than stopping the saga", correct?

In these cases correct this by matching on the returned error of either ProgenitorOperationRetry::run or retry_operation_while_indefinitely (from progenitor-extras), and checking e.is_gone. If that matches then the saga can proceed.

Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.

These are places in:

  • sdd_delete_local_storage and
  • sis_ensure_local_storage

@smklein
Copy link
Copy Markdown
Collaborator

smklein commented Apr 13, 2026

One more thing:

As described in my comment above, we have the four big behavior changes in this PR:

2x "let the gone check return Ok()"
2x "check the zpool status in addition to the sled for the gone check"

It would be great to get test coverage for these! They're all behavior changes that should be controllable if we can set up the DB correctly, right?

@jmpesp
Copy link
Copy Markdown
Contributor Author

jmpesp commented Apr 16, 2026

I'm trying to review this carefully, but it looks like it's mixing together a bunch of small changes that overlap. I'm trying to be diligent here and make sure I understand what is fixing what.

When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.
There were TWO places not doing this correctly:

  • when detaching a volume from a pantry
  • when deleting local storage datasets

This is in:

* `call_pantry_detach` and

* `sdd_delete_local_storage`, correct?

And it's referring to the "match-on-error, do an extra gone-check to allow deletion to succeed anyway, rather than stopping the saga", correct?

In these cases correct this by matching on the returned error of either ProgenitorOperationRetry::run or retry_operation_while_indefinitely (from progenitor-extras), and checking e.is_gone. If that matches then the saga can proceed.

Correct!

Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.

These are places in:

* `sdd_delete_local_storage` and

* `sis_ensure_local_storage`

Also correct! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to delete local disks on expunged sled

2 participants