Skip to content

feat: support ignore_cache for workflow cache and add lock-protected shared storage#4640

Open
huanghongbo-hhb wants to merge 10 commits intokoderover:mainfrom
huanghongbo-hhb:feat/shared-storage-cache-ignore-lock-v1
Open

feat: support ignore_cache for workflow cache and add lock-protected shared storage#4640
huanghongbo-hhb wants to merge 10 commits intokoderover:mainfrom
huanghongbo-hhb:feat/shared-storage-cache-ignore-lock-v1

Conversation

@huanghongbo-hhb
Copy link
Copy Markdown
Contributor

@huanghongbo-hhb huanghongbo-hhb commented Apr 20, 2026

Background

Workflow task execution already supports cache usage for both object storage and shared storage backends, but ignore_cache behavior was not consistently supported across them.

This PR adds:

  • ignore_cache support for object storage cache
  • ignore_cache support for shared storage cache
  • a versioned shared storage cache model that is safe under concurrent publish
  • Kubernetes Lease based coordination for shared cache current pointer updates

Expected behavior:

  • ignore_cache=false: restore previous cache and publish new cache
  • ignore_cache=true: skip restoring previous cache, but still publish new cache

Changes

Object storage cache

Added full ignore_cache support for object storage cache in workflow task execution:

  • when ignore_cache=true, skip download archive
  • still keep the later tar archive / cache upload logic

Covered job types:

  • build job
  • testing job
  • scanning job

Shared storage cache

Added dedicated restore/publish steps for shared storage cache and upgraded it to a versioned model:

  • add shared_cache_restore
  • add shared_cache_publish
  • store shared cache as:
    • current.json
    • snapshots/<version>

Behavior:

  • ignore_cache=false
    • run restore
    • run publish
  • ignore_cache=true
    • skip restore
    • still run publish

Implementation details:

  • shared cache is mounted to /zadig/cache-store
  • restore reads the snapshot referenced by current.json
  • publish creates a new snapshot version first
  • only the short critical section that validates base version and updates current.json is protected by a Kubernetes Lease
  • stale base validation prevents outdated tasks from rolling back the current cache pointer
  • old snapshots are cleaned up after publish

Kubernetes Lease coordination

This PR uses Kubernetes coordination.k8s.io/v1 Lease instead of injecting Redis credentials into workflow task pods for the shared cache publish lock.

The publish flow is:

  1. create snapshots/<version> without holding the lock
  2. acquire the Kubernetes Lease
  3. load restore metadata and current metadata
  4. validate baseVersion
  5. update current.json if the base is still current
  6. cleanup old snapshots
  7. release the Lease

This keeps the Lease critical section short and avoids holding a distributed lock while copying potentially large cache directories.

Workflow runtime RBAC

Workflow task pods already run with workflow-cm-sa. This PR extends the existing workflow-cm-manager Role with the minimum Lease permissions required by the jobexecutor:

- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "create", "update"]

The initialization paths also update existing workflow-cm-manager Roles so upgraded clusters receive the new Lease permission instead of only newly initialized clusters.

To cover long-lived namespaces that already existed before this change, workflow job creation now also performs an idempotent namespace-scoped RBAC repair when a job includes shared cache restore/publish steps:

  • ensure workflow-cm-manager exists in the target namespace
  • append the Lease rule if the Role already exists but is missing it
  • do nothing if the Role is already up to date

This avoids requiring manual RBAC patching for every historical environment namespace before shared cache publish can use Kubernetes Lease successfully.

Redis runtime env cleanup

The workflow task pod no longer needs Redis envs for shared cache locking, so this PR removes the extra Redis env injection from workflow job pods:

  • REDIS_HOST
  • REDIS_PORT
  • REDIS_USERNAME
  • REDIS_PASSWORD
  • REDIS_COMMON_CACHE_DB

ignore_cache behavior

ignore_cache only affects reading historical cache content.

It does not affect publishing the new cache result:

  • object storage still uploads cache
  • shared storage still publishes cache

Why Kubernetes Lease

Shared cache publish coordination happens inside workflow task pods running in Kubernetes. A Kubernetes Lease is a better fit for this cluster-local coordination than passing Redis credentials into every workflow task pod only for a lock.

The Lease is namespace-scoped, governed by Kubernetes RBAC, and only requires minimal permissions on leases.coordination.k8s.io.

The correctness of the shared cache publish flow still relies on the versioned snapshot model and baseVersion validation. The Lease only serializes the short current.json pointer update section.

Validation

Object storage cache

Verified that:

  • ignore_cache=false keeps both download archive and tar archive
  • ignore_cache=true skips download archive
  • tar archive is still preserved

Shared storage cache behavior

Verified that:

  • ignore_cache=false runs shared cache restore and publish
  • ignore_cache=true skips shared cache restore and still runs publish
  • Redis envs are no longer required by workflow task pods for shared cache locking
  • shared cache PVC is mounted correctly
  • shared cache directory contains:
    • current.json
    • snapshots/
  • restore reads the same version currently referenced by current.json

Shared storage concurrency validation

Triggered multiple concurrent tasks and observed that:

  • multiple tasks started from the same previous cache version
  • only one task can update the current pointer at a time
  • stale base validation skips outdated current pointer updates
  • current was not rolled back by concurrent outdated tasks

Local package validation

Ran:

/usr/local/go/bin/go test ./pkg/microservice/aslan/core/common/service/kube ./pkg/microservice/aslan/core ./pkg/microservice/jobexecutor/core/service/step ./pkg/tool/kube/lease

Result:

  • passed

Notes

Cache publish failure or skipped current pointer update does not fail the main workflow job. Shared cache is still treated as an acceleration mechanism rather than a hard requirement for overall job success.


This change is Reviewable

huanghongbo added 3 commits April 15, 2026 18:28
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
@huanghongbo-hhb huanghongbo-hhb changed the title Feat/shared storage cache ignore lock v1 feat: support ignore_cache for workflow cache and add lock-protected shared storage Apr 20, 2026
@landylee007 landylee007 requested a review from PetrusZ April 22, 2026 16:36
huanghongbo added 3 commits April 24, 2026 18:12
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
@huanghongbo-hhb
Copy link
Copy Markdown
Contributor Author

Validation update for the Lease-based shared cache publish flow:

I ran a concurrent workflow test against the same shared cache key and verified the expected stale-base protection behavior.

Observed result:

  • Two tasks restored from the same base version task-74-job-0-0-0-build and entered publish concurrently.
  • One task successfully published task-75-job-0-0-0-build.
  • The other task logged:
    Shared cache publish skipped current pointer update because base version task-74-job-0-0-0-build is stale, latest version is task-75-job-0-0-0-build.
  • Then another two tasks restored from the same base version task-75-job-0-0-0-build and entered publish concurrently.
  • One task successfully published task-77-job-0-0-0-build.
  • The other task logged:
    Shared cache publish skipped current pointer update because base version task-75-job-0-0-0-build is stale, latest version is task-77-job-0-0-0-build.

Conclusion:

  • Multiple tasks can run and reach publish concurrently.
  • Lease serializes the critical section.
  • baseVersion validation correctly prevents stale tasks from rolling back current.json.
  • Stale publish attempts are skipped without failing the main job.

This validates the intended behavior for the shared cache publish race scenario.

huanghongbo added 3 commits April 28, 2026 14:55
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant