Skip to content

Delete sitreps non-transactionally with improved pagination#10210

Open
smklein wants to merge 5 commits intomainfrom
sitrep-delete-better
Open

Delete sitreps non-transactionally with improved pagination#10210
smklein wants to merge 5 commits intomainfrom
sitrep-delete-better

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented Apr 2, 2026

Follow-up to #10143

Adds pagination during sitrep garbage collection.

While I was there, I realized that we actually don't need transactions on the delete pathway anymore.

  • The delete operation (within the garbage collection pass) deletes sitreps by deleting out-of-history fm_sitrep rows, which immediately orphans all other sub-tables within the sitrep.
  • All the "concurrent reads" of sitreps go through fm_sitrep_read_on_conn, which reads metadata from fm_sitrep last. If this sitrep can be read: it has not been deleted yet. If this sitrep cannot be read: it has been deleted, and prior reads can be discarded.
  • All the "concurrent inserts" of sitreps are contingent on the parent sitrep not being stale. So: because sitrep insert writes to the fm_sitrep table first, the child rows are "not orphaned" (won't be GC-ed). This protection lasts for the duration of fm_sitrep_insert, OR until the parent sitrep is marked stale - at which point insert should fail anyway.

All this is to say: the "read" and "insert" pathways function fine if a fm_sitrep row is deleted non-atomically before subsequent child rows. Therefore: no transaction is necessary here.

@smklein smklein requested review from hawkw and mergeconflict April 2, 2026 20:49
@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 2, 2026

All this is to say: the "read" and "insert" pathways function fine if a fm_sitrep row is deleted non-atomically before subsequent child rows. Therefore: no transaction is necessary here.

zoom zoom!

.get_result_async::<(i64, Option<Uuid>)>(&*conn)
.await
.map_err(|e| {
public_error_from_diesel(e, ErrorHandler::Server)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would kinda like an internal_context here saying which step we were trying to do when we failed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added

.get_result_async::<(i64, Option<Uuid>)>(&*conn)
.await
.map_err(|e| {
public_error_from_diesel(e, ErrorHandler::Server)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, it would be kinda nice if this said what went wrong...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

Comment on lines +2892 to +2902
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"reads_ok={}, reads_err={}, inserts={}, \
gc_runs={}",
self.reads_ok.load(Ordering::Relaxed),
self.reads_err.load(Ordering::Relaxed),
self.inserts.load(Ordering::Relaxed),
self.gc_runs.load(Ordering::Relaxed),
)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to use exhaustive destructuring here, so we get an error if we add another counter...

Suggested change
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"reads_ok={}, reads_err={}, inserts={}, \
gc_runs={}",
self.reads_ok.load(Ordering::Relaxed),
self.reads_err.load(Ordering::Relaxed),
self.inserts.load(Ordering::Relaxed),
self.gc_runs.load(Ordering::Relaxed),
)
}
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let Self { reads_ok, reads_err, inserts, gc_runs } = self;
write!(
f,
"reads_ok={}, reads_err={}, inserts={}, \
gc_runs={}",
reads_ok.load(Ordering::Relaxed),
reads_err.load(Ordering::Relaxed),
inserts.load(Ordering::Relaxed),
gc_runs.load(Ordering::Relaxed),
)
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +2883 to +2889
#[derive(Clone)]
struct Stats {
reads_ok: Arc<AtomicUsize>,
reads_err: Arc<AtomicUsize>,
inserts: Arc<AtomicUsize>,
gc_runs: Arc<AtomicUsize>,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but...why are these all Arced individually? why not Arc the whole Stats type?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh just test-writing laziness. I'll update it to use a single Arc

let guard = live_sitreps.lock().await;
if guard.is_empty() {
drop(guard);
tokio::task::yield_now().await;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the yield_now doing here? it might be nice if there was a comment explaining what this is intended for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. The TL;DR, is "there is nothing to read yet, because writer tasks haven't run".

Comment on lines +3092 to +3095
&log, "GC pass complete";
"sitreps_deleted" => result.sitreps_deleted,
"child_tables" => ?result.child_tables,
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&log, "GC pass complete";
"sitreps_deleted" => result.sitreps_deleted,
"child_tables" => ?result.child_tables,
);
&log,
"GC pass complete";
"sitreps_deleted" => result.sitreps_deleted,
"child_tables" => ?result.child_tables,
);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines +3116 to +3120
// Join all tasks — a panic in any task (from assert_sitreps_eq)
// means we detected a torn read.
for handle in handles {
handle.await.expect("task panicked");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this may not actually be necessary if the tests are being compiled with panic = "abort"? but probably good to do anyway

Comment on lines +2964 to +2978
// Insert an empty child so the original isn't
// current and can be deleted.
let child_id = SitrepUuid::new_v4();
let child = fm::Sitrep {
metadata: fm::SitrepMetadata {
id: child_id,
parent_sitrep_id: Some(sitrep_id),
time_created: Utc::now(),
creator_id: OmicronZoneUuid::new_v4(),
comment: "child".to_string(),
inv_collection_id: CollectionUuid::new_v4(),
},
cases: Default::default(),
ereports_by_id: Default::default(),
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it perhaps be more accurate to real life if we also stuffed cases into the child sitreps?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll use make_sitrep_with_cases

Comment on lines -2257 to -2258

/// Test that deeply orphaned child rows (whose fm_sitrep metadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentional deletion?

/// *complete* sitrep (no torn reads). Errors (e.g. `NotFound`) are
/// expected and fine — partial data is not.
///
/// Writers race with each other, causing `ParentNotCurrent` failures.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we expect that one of the racing writers will always win and the rest will fail with ParentNotCurrent or whatever, is that right? Is there any possibility that all writers will fail, such that the test wouldn't make progress?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if exactly one racing writer doesn't always win, then we have much worse problems :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants