Add mark-and-sweep GC for shared OCI cache#199
Conversation
The shared OCI cache at data_dir/system/oci-cache grew without bound
because neither the pull path nor the registry push path had a cleanup
hook. The image retention controller only touches data_dir/images, so
manifests and layer blobs that were no longer referenced lived forever.
This change adds a new lib/ocicachegc package that walks index.json and
every referenced manifest to build the live set of blob digests, then
deletes any file under blobs/sha256/ that is not in that set. Blobs
whose mtime is within the configured min_blob_age are kept; this grace
period is what lets the sweep run safely alongside concurrent pulls
(which write layer blobs before updating index.json) and registry
pushes.
Disabled by default. Enable via:
images:
oci_cache_gc:
enabled: true
interval: 1h
min_blob_age: 1h
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR adds a new garbage collection package for OCI cache management, but does not modify API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal), which are the specific areas the filter requires for monitoring. To monitor this PR anyway, reply with |
Previously walkDescriptor added subject.Digest to the live set as a leaf without descending, so the subject manifest's own config and layers could be swept. Recurse like manifests[] so the full referrer chain stays marked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The registry stores manifest and layer blobs for cache/* pushes in the shared OCI blob dir but skips triggerConversion, so those blobs are never rooted in index.json. With GC enabled this caused the sweep to delete cache blobs the registry was still serving from its in-memory tag map, breaking BuildKit cache exports. Track cache/* tag -> manifest digest in the registry and expose the set via LiveCacheManifestDigests. The GC takes a RootsProvider; on every sweep it walks those manifests' configs and layers as additional roots alongside index.json. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8507ca0. Configure here.
OCI v1.1 lets the index itself carry a subject descriptor. liveBlobs only iterated index.Manifests, so a blob reachable solely via the index-level subject was never marked and could be swept. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a tracer to the collector with spans around the sweep, the mark phase, and the blob sweep loop, plus span attributes capturing live, scanned, deleted, and skipped-recent counts. Records live blob count per successful sweep as a histogram metric so cache size is observable from metrics alone. Demotes the per-blob delete log to DEBUG (an ongoing maintenance event) and promotes the sweep summary to INFO only when blobs were actually deleted, leaving idle sweeps at DEBUG.

Summary
The shared OCI cache at
data_dir/system/oci-cachecurrently growswithout bound — neither the pull path (
layout.AppendImage) nor theregistry push path (
BlobStore.Put) ever remove blobs, and the imageretention controller only touches
data_dir/images. Over time thisaccumulates dead manifest, config, and layer blobs that are no longer
reachable from
index.json.This change adds a new
lib/ocicachegcpackage that walksindex.jsonand every referenced manifest to build the set of live blob digests,
then deletes any file under
blobs/sha256/that isn't in that set.Blobs whose mtime is within the configured
min_blob_ageare alwayskept; that grace period is what lets the sweep run safely alongside
concurrent pulls (which write layer blobs before updating
index.json)and registry pushes (which rename
<hex>.tmp→<hex>before themanifest trigger).
Config
Disabled by default. Opt-in via:
How it decides what's live
index.json.If the blob is a manifest or manifest index, recurse into its
config,layers,manifests, andsubjectreferences.Unparseable or missing referenced blobs are treated as opaque leaves —
they remain "live" but we don't descend into them. The collector never
deletes a blob it cannot prove is dead.
.tmpfiles and anything whose name is not a 64-hex-char blob digestare ignored by the sweep entirely.
Metrics
hypeman_oci_cache_gc_sweeps_total(counter, status)hypeman_oci_cache_gc_sweep_duration_seconds(histogram)hypeman_oci_cache_gc_deleted_blobs_total(counter)hypeman_oci_cache_gc_deleted_bytes_total(counter)Test plan
go test ./lib/ocicachegc/...passes (live set kept, orphans deleted, grace period honored, tmp/non-blob filenames ignored, manifest index traversal)go test ./cmd/api/config/...passes (new duration validators)go test ./lib/imageretention/...passes (unchanged)go build ./cmd/api/...cleango vet ./...cleanManual validation
deft-kernel-dev, ran the realhypemanbinary from a fresh scratch clone withimages.oci_cache_gc.enabled: trueand an isolated tempdata_dir.data_dir/system/oci-cachewith one live manifest/config/layer set, one old orphan blob, and one recent orphan blob.oci cache gc enabled,oci cache gc started, anddeleted unreferenced oci blobfor the old orphan digest.go mod download,make oapi-generate,make build,go run ./cmd/test-prewarm,go test -count=1 -tags containers_image_openpgp -timeout=20m ./...(pass,300s).Note
Medium Risk
Introduces a new background garbage collector that deletes files from
data_dir/system/oci-cache, which could remove needed blobs if liveness/age rules are wrong or misconfigured. Mitigated by opt-in config defaults, grace period (min_blob_age), conservative parsing behavior, and extensive tests.Overview
Adds an opt-in mark-and-sweep garbage collector for the shared OCI cache (
data_dir/system/oci-cache) via newimages.oci_cache_gcconfig (enabled,interval,min_blob_age) with validation and updated example configs.Implements
lib/ocicachegcto compute a live blob set by walkingindex.json(plus optional extra roots) and deleting unreferencedblobs/sha256/<digest>files older than the grace period, with OTel metrics/tracing and a single-sweep-at-a-time lock.Wires the collector into
cmd/api/main.goas a background service when enabled, and extendslib/registryto track BuildKitcache/*tag digests in-memory (LiveCacheManifestDigests) so GC treats them as additional roots; adds targeted unit tests for config, wiring, registry cache-tag tracking, and GC behavior.Reviewed by Cursor Bugbot for commit 5554c9d. Bugbot is set up for automated code reviews on this repo. Configure here.