Properly check for expunged status during deletes#10257
Properly check for expunged status during deletes#10257jmpesp wants to merge 1 commit intooxidecomputer:mainfrom
Conversation
When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone. There were TWO places not doing this correctly: - when detaching a volume from a pantry - when deleting local storage datasets In these cases correct this by matching on the returned error of either `ProgenitorOperationRetry::run` or `retry_operation_while_indefinitely` (from progenitor-extras), and checking `e.is_gone`. If that matches then the saga can proceed. Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too. Also! The check functions bailed with an error if a sled or disk was not in-service, instead of returning a boolean. This meant that if a sled or disk was not in-service, the retry loop would exit with a `GoneCheckError` instead of a `Gone` value. This would not properly match for call sites that checked `is_gone` for a returned error, leading to saga unwinds that were ultimately not necessary. Change those functions to return a bool corresponding to whether or not the thing is in-service, and map that into a `GoneCheckResult` inside gone check functions. Fixes oxidecomputer#10224
| let sled_in_service = | ||
| Self::check_sled_in_service_on_connection( | ||
| &conn, | ||
| disk.sled_id(), | ||
| ) | ||
| .await | ||
| .map_err(|txn_error| txn_error.into_diesel(&err))?; |
There was a problem hiding this comment.
To confirm, this change is "functionally identical", right? just forced by the new signature of check_sled_in_service_on_connection?
(as in: old behavior if sled is not in service -> throw an error, and that's the new behavior too?)
There was a problem hiding this comment.
Correct - the TransactionError::CustomError returned right after this highlighted line is what happened previously, and is in fact checked by the physical_disk_cannot_insert_to_expunged_sled test
| // Do not match on `e.is_gone` here: if the Pantry is gone, then return an | ||
| // error. The attach may have succeeded for one of the previous calls but if | ||
| // the retry loop bails out after determining that the Pantry is gone, that | ||
| // attachment is now gone too. |
There was a problem hiding this comment.
I know you've scattered a bunch of these "Do not match on is_gone" comments through this PR, but I'm finding them a bit confusing.
It seems kinda like this is "obviously true" that:
- If you are trying to allocate a resource on top of some service X
- ... and "X" is removed from existence, which throws an error
- ... you should not match on that error and say "allocation succeeded" anyway
Similarly, the other commentary about unwinding seems true, but more a property of "how we're using sagas" than the specifics of this call site, if that makes sense?
I guess, put another way: the situation where we do:
- Try to delete a resource
- Special-case the error to say "if the resource is already gone", return success anyway
Seems like it's the special-case that deserves commentary - not the other way around.
| ) | ||
| .await | ||
| .map_err(|e| { | ||
| // Do not match on `e.is_gone` here: if the ensure fails due to the sled |
There was a problem hiding this comment.
See my note above; I would omit this comment.
| ) | ||
| .await | ||
| .map_err(|e| { | ||
| // Do not match on `e.is_gone` here: If the ensure retry loop bailed out |
There was a problem hiding this comment.
See my note above; I would omit this comment.
| osagactx | ||
| .datastore() | ||
| .check_sled_in_service(&opctx, sled_id) | ||
| sled_out_of_service_gone_check(osagactx.datastore(), &opctx, sled_id) |
There was a problem hiding this comment.
This is "functionally identical", just using a slightly different API to accomplish the same thing, right?
There was a problem hiding this comment.
It is accomplishing the same thing, and also this is one of the bug fixes: previously, check_sled_in_service would return Ok(()) if a sled is in service, and an error if it isn't. This would bubble up as a GoneCheckError instead of a Gone, and code matching on e.is_gone wouldn't execute. Now it returns a Ok(true) if the sled is in service, an Ok(false) if it isn't, and an Err when there's a legit query error.
| // In this case, if the particular disk hosting this local storage was | ||
| // expunged, or if the sled was expunged, then proceed with the rest of | ||
| // the saga. | ||
| Err(e) if e.is_gone() => Ok(()), |
There was a problem hiding this comment.
Should this be:
| Err(e) if e.is_gone() => Ok(()), | |
| Err(e) if e.is_gone() || e.is_not_found() => Ok(()), |
Does local_storage_dataset_delete return a 404 if called twice in a row?
There was a problem hiding this comment.
It should be idempotent for DELETEs.
|
|
||
| // The pantry is stateless: if it is gone, then the Volume was | ||
| // destroyed, and we can proceed as if it was detached. | ||
| Err(e) if e.is_gone() => Ok(()), |
There was a problem hiding this comment.
Should this be:
| Err(e) if e.is_gone() => Ok(()), | |
| Err(e) if e.is_gone() || e.is_not_found() => Ok(()), |
Does detach return a 404 if called multiple times in a row?
There was a problem hiding this comment.
Same, it should be idempotent for detach.
|
I'm trying to review this carefully, but it looks like it's mixing together a bunch of small changes that overlap. I'm trying to be diligent here and make sure I understand what is fixing what.
This is in:
And it's referring to the "match-on-error, do an extra gone-check to allow deletion to succeed anyway, rather than stopping the saga", correct?
These are places in:
|
|
One more thing: As described in my comment above, we have the four big behavior changes in this PR: 2x "let the gone check return Ok()" It would be great to get test coverage for these! They're all behavior changes that should be controllable if we can set up the DB correctly, right? |
Correct!
Also correct! :) |
When sending a delete request to a remote service during a saga, the usual pattern is to provide a "gone" check to bail out of the retry loop: if the remote service was impacted by an expunge then it's pointless to continue retrying a delete request, as something backing that service is gone.
There were TWO places not doing this correctly:
In these cases correct this by matching on the returned error of either
ProgenitorOperationRetry::runorretry_operation_while_indefinitely(from progenitor-extras), and checkinge.is_gone. If that matches then the saga can proceed.Additionally, there are places where the gone check only checked if the sled was still in-service, but in reality needed to also check if the physical disk backing a zpool was expunged. Fix that too.
Also! The check functions bailed with an error if a sled or disk was not in-service, instead of returning a boolean. This meant that if a sled or disk was not in-service, the retry loop would exit with a
GoneCheckErrorinstead of aGonevalue. This would not properly match for call sites that checkedis_gonefor a returned error, leading to saga unwinds that were ultimately not necessary. Change those functions to return a bool corresponding to whether or not the thing is in-service, and map that into aGoneCheckResultinside gone check functions.Fixes #10224