Rollback Fixes#2213
Conversation
Similar to ostree, if we find any staged deployment while performing a rollback, we'll get rid of the staged deployment. The staged deployment still exists on disk and will be GC'd later Fixes: bootc-dev#2208 Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Add a case in the rollback test which tests rollback when there is a staged deployment present. This is a test for bootc-dev#2208 Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request implements dropping staged deployments on rollback to match ostree's behavior, updates boot entry sorting logic to sort based on the bootloader type (by sort-key for systemd-boot and by filename for GRUB), and adds corresponding unit and integration tests. The reviewer identified a robustness issue in rollback.rs where stale staged entries might not be cleaned up if host.status.staged is None and the rollback could fail unnecessarily if the staged deployment file is already missing. Additionally, a maintainability issue was found in status.rs where the comments in the GRUB sorting test contradict the actual assertions and sorting behavior.
| // Ostree will drop any staged deployment on rollback | ||
| // We follow the same approach for now | ||
| if let Some(..) = &host.status.staged { | ||
| println!("Removing currently staged deployment"); | ||
|
|
||
| boot_dir | ||
| .remove_dir_all(TYPE1_ENT_PATH_STAGED) | ||
| .context("Removing staged entries")?; | ||
|
|
||
| let transient_dir = | ||
| Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority()) | ||
| .context("Opening transient dir")?; | ||
|
|
||
| transient_dir | ||
| .remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME) | ||
| .context("Removing staged deployment file")?; | ||
| } |
There was a problem hiding this comment.
Robustness & Correctness Issue
- Stale Staged Entries: If
loader/entries.staged(TYPE1_ENT_PATH_STAGED) already exists on disk (e.g., from a previously interrupted update or rollback) buthost.status.stagedisNone, we do not clean it up before writing the new rollback BLS configs. This can cause stale entries to be carried over and swapped into the activeloader/entriesdirectory duringrename_exchange_bls_entries. - Ignorable Deletion Failure: If the staged deployment file
COMPOSEFS_STAGED_DEPLOYMENT_FNAMEis missing or already deleted,remove_filewill fail withNotFound, unnecessarily failing the entire rollback operation.
Recommendation
Always clean up TYPE1_ENT_PATH_STAGED if it exists using remove_all_optional, and ignore NotFound errors when deleting the staged deployment file.
// Clean up any existing staged entries directory to avoid carrying over stale files
boot_dir
.remove_all_optional(TYPE1_ENT_PATH_STAGED)
.context("Removing staged entries")?;
// Ostree will drop any staged deployment on rollback
// We follow the same approach for now
if let Some(..) = &host.status.staged {
println!("Removing currently staged deployment");
let transient_dir =
Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority())
.context("Opening transient dir")?;
if let Err(e) = transient_dir.remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME) {
if e.kind() != std::io::ErrorKind::NotFound {
return Err(anyhow::Error::from(e)).context("Removing staged deployment file");
}
}
}There was a problem hiding this comment.
boot_dir
.remove_all_optional(TYPE1_ENT_PATH_STAGED)
.context("Removing staged entries")?;
I think we would definitely want to return an error if we have a staged deployment and this doesn't exist. Basically means the system is in a weird state. It is the same with COMPOSEFS_STAGED_DEPLOYMENT_FNAME, host.staged is only set if COMPOSEFS_STAGED_DEPLOYMENT_FNAME exists. Also, in the staged op, writing COMPOSEFS_STAGED_DEPLOYMENT_FNAME is the very last operation we do
There was a problem hiding this comment.
Hmm a few principles here. First off we should be able to recover from most unexpected state ideally.
It'd actually be a good exercise (LLM assisted even) to look at just deleting files in the storage or in /run state and ensure that at least bootc switch is able to recover.
I didn't dig into this but on the staged state bits ensuring we e.g. log to the journal and continue seems like a good idea instead of just using ?.
Update `get_sorted_type1_boot_entries_helper` to implement sorting
logic based on bootloader type
- systemd-boot: Sort by sort-key (using BLSConfig::cmp which handles
sort-key ascending, then version descending)
- GRUB: Sort by filename in descending order (ignoring sort-key fields)
Unit Tests generated by ClaudeCode (Opus)
Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Instead of blindly selecting the "second" one in a list of sorted boot entries as the rollback and failing if there are more than one rollback candidate, sort the rollback candidates in the same order as the boot entries and take the first one as rollback. All the remaining deployments become `other_deployments`. This is especially useful if and when we implement pinned deployments for composefs Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
a266a4c to
e07ffbd
Compare
|
GHA seems to be having issues |
| // Ostree will drop any staged deployment on rollback | ||
| // We follow the same approach for now | ||
| if let Some(..) = &host.status.staged { | ||
| println!("Removing currently staged deployment"); |
There was a problem hiding this comment.
I'd like to try to centralize reporting more in the future and get away from just random println!. This isn't the only case of course.
One thing we could probably do instead here is log to the journal, which would have other benefits.
| // Ostree will drop any staged deployment on rollback | ||
| // We follow the same approach for now | ||
| if let Some(..) = &host.status.staged { | ||
| println!("Removing currently staged deployment"); | ||
|
|
||
| boot_dir | ||
| .remove_dir_all(TYPE1_ENT_PATH_STAGED) | ||
| .context("Removing staged entries")?; | ||
|
|
||
| let transient_dir = | ||
| Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority()) | ||
| .context("Opening transient dir")?; | ||
|
|
||
| transient_dir | ||
| .remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME) | ||
| .context("Removing staged deployment file")?; | ||
| } |
There was a problem hiding this comment.
Hmm a few principles here. First off we should be able to recover from most unexpected state ideally.
It'd actually be a good exercise (LLM assisted even) to look at just deleting files in the storage or in /run state and ensure that at least bootc switch is able to recover.
I didn't dig into this but on the staged state bits ensuring we e.g. log to the journal and continue seems like a good idea instead of just using ?.
cfs/rollback: Remove staged entry on rollback
Similar to ostree, if we find any staged deployment while performing
a rollback, we'll get rid of the staged deployment. The staged
deployment still exists on disk and will be GC'd later
Fixes: #2208
cfs/status: Implement bootloader-specific sorting
Update
get_sorted_type1_boot_entries_helperto implement sortinglogic based on bootloader type
sort-key ascending, then version descending)
Unit Tests generated by ClaudeCode (Opus)