Skip to content

Rollback Fixes#2213

Open
Johan-Liebert1 wants to merge 4 commits into
bootc-dev:mainfrom
Johan-Liebert1:rollback-fix
Open

Rollback Fixes#2213
Johan-Liebert1 wants to merge 4 commits into
bootc-dev:mainfrom
Johan-Liebert1:rollback-fix

Conversation

@Johan-Liebert1
Copy link
Copy Markdown
Collaborator

cfs/rollback: Remove staged entry on rollback

Similar to ostree, if we find any staged deployment while performing
a rollback, we'll get rid of the staged deployment. The staged
deployment still exists on disk and will be GC'd later

Fixes: #2208


cfs/status: Implement bootloader-specific sorting

Update get_sorted_type1_boot_entries_helper to implement sorting
logic based on bootloader type

  • systemd-boot: Sort by sort-key (using BLSConfig::cmp which handles
    sort-key ascending, then version descending)
  • GRUB: Sort by filename in descending order (ignoring sort-key fields)

Unit Tests generated by ClaudeCode (Opus)

Similar to ostree, if we find any staged deployment while performing
a rollback, we'll get rid of the staged deployment. The staged
deployment still exists on disk and will be GC'd later

Fixes: bootc-dev#2208

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Add a case in the rollback test which tests rollback when there is a
staged deployment present. This is a test for bootc-dev#2208

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
@Johan-Liebert1 Johan-Liebert1 requested a review from cgwalters May 26, 2026 10:14
@Johan-Liebert1 Johan-Liebert1 added the ci/tier-1 Run CI for tier-1 OS (centos-10) only label May 26, 2026
@bootc-bot bootc-bot Bot requested a review from jeckersb May 26, 2026 10:14
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements dropping staged deployments on rollback to match ostree's behavior, updates boot entry sorting logic to sort based on the bootloader type (by sort-key for systemd-boot and by filename for GRUB), and adds corresponding unit and integration tests. The reviewer identified a robustness issue in rollback.rs where stale staged entries might not be cleaned up if host.status.staged is None and the rollback could fail unnecessarily if the staged deployment file is already missing. Additionally, a maintainability issue was found in status.rs where the comments in the GRUB sorting test contradict the actual assertions and sorting behavior.

Comment on lines +149 to +165
// Ostree will drop any staged deployment on rollback
// We follow the same approach for now
if let Some(..) = &host.status.staged {
println!("Removing currently staged deployment");

boot_dir
.remove_dir_all(TYPE1_ENT_PATH_STAGED)
.context("Removing staged entries")?;

let transient_dir =
Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority())
.context("Opening transient dir")?;

transient_dir
.remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME)
.context("Removing staged deployment file")?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Robustness & Correctness Issue

  1. Stale Staged Entries: If loader/entries.staged (TYPE1_ENT_PATH_STAGED) already exists on disk (e.g., from a previously interrupted update or rollback) but host.status.staged is None, we do not clean it up before writing the new rollback BLS configs. This can cause stale entries to be carried over and swapped into the active loader/entries directory during rename_exchange_bls_entries.
  2. Ignorable Deletion Failure: If the staged deployment file COMPOSEFS_STAGED_DEPLOYMENT_FNAME is missing or already deleted, remove_file will fail with NotFound, unnecessarily failing the entire rollback operation.

Recommendation

Always clean up TYPE1_ENT_PATH_STAGED if it exists using remove_all_optional, and ignore NotFound errors when deleting the staged deployment file.

    // Clean up any existing staged entries directory to avoid carrying over stale files
    boot_dir
        .remove_all_optional(TYPE1_ENT_PATH_STAGED)
        .context("Removing staged entries")?;

    // Ostree will drop any staged deployment on rollback
    // We follow the same approach for now
    if let Some(..) = &host.status.staged {
        println!("Removing currently staged deployment");

        let transient_dir =
            Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority())
                .context("Opening transient dir")?;

        if let Err(e) = transient_dir.remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME) {
            if e.kind() != std::io::ErrorKind::NotFound {
                return Err(anyhow::Error::from(e)).context("Removing staged deployment file");
            }
        }
    }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    boot_dir
        .remove_all_optional(TYPE1_ENT_PATH_STAGED)
        .context("Removing staged entries")?;

I think we would definitely want to return an error if we have a staged deployment and this doesn't exist. Basically means the system is in a weird state. It is the same with COMPOSEFS_STAGED_DEPLOYMENT_FNAME, host.staged is only set if COMPOSEFS_STAGED_DEPLOYMENT_FNAME exists. Also, in the staged op, writing COMPOSEFS_STAGED_DEPLOYMENT_FNAME is the very last operation we do

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a few principles here. First off we should be able to recover from most unexpected state ideally.

It'd actually be a good exercise (LLM assisted even) to look at just deleting files in the storage or in /run state and ensure that at least bootc switch is able to recover.

I didn't dig into this but on the staged state bits ensuring we e.g. log to the journal and continue seems like a good idea instead of just using ?.

Comment thread crates/lib/src/bootc_composefs/status.rs
Update `get_sorted_type1_boot_entries_helper` to implement sorting
logic based on bootloader type

- systemd-boot: Sort by sort-key (using BLSConfig::cmp which handles
    sort-key ascending, then version descending)
- GRUB: Sort by filename in descending order (ignoring sort-key fields)

Unit Tests generated by ClaudeCode (Opus)

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Instead of blindly selecting the "second" one in a list of sorted boot
entries as the rollback and failing if there are more than one rollback
candidate, sort the rollback candidates in the same order as the boot
entries and take the first one as rollback.

All the remaining deployments become `other_deployments`. This is
especially useful if and when we implement pinned deployments for
composefs

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
@Johan-Liebert1
Copy link
Copy Markdown
Collaborator Author

GHA seems to be having issues

// Ostree will drop any staged deployment on rollback
// We follow the same approach for now
if let Some(..) = &host.status.staged {
println!("Removing currently staged deployment");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to try to centralize reporting more in the future and get away from just random println!. This isn't the only case of course.

One thing we could probably do instead here is log to the journal, which would have other benefits.

Comment on lines +149 to +165
// Ostree will drop any staged deployment on rollback
// We follow the same approach for now
if let Some(..) = &host.status.staged {
println!("Removing currently staged deployment");

boot_dir
.remove_dir_all(TYPE1_ENT_PATH_STAGED)
.context("Removing staged entries")?;

let transient_dir =
Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority())
.context("Opening transient dir")?;

transient_dir
.remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME)
.context("Removing staged deployment file")?;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a few principles here. First off we should be able to recover from most unexpected state ideally.

It'd actually be a good exercise (LLM assisted even) to look at just deleting files in the storage or in /run state and ensure that at least bootc switch is able to recover.

I didn't dig into this but on the staged state bits ensuring we e.g. log to the journal and continue seems like a good idea instead of just using ?.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/tier-1 Run CI for tier-1 OS (centos-10) only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ComposeFS backend] Series of rebases/updates into bootc rollback breaks bootc (persisting through reboots)

2 participants