feat: support ignore_cache for workflow cache and add lock-protected shared storage#4640
Open
huanghongbo-hhb wants to merge 10 commits intokoderover:mainfrom
Open
Conversation
added 3 commits
April 15, 2026 18:28
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
added 3 commits
April 24, 2026 18:12
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Contributor
Author
|
Validation update for the Lease-based shared cache publish flow: I ran a concurrent workflow test against the same shared cache key and verified the expected stale-base protection behavior. Observed result:
Conclusion:
This validates the intended behavior for the shared cache publish race scenario. |
added 3 commits
April 28, 2026 14:55
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
Signed-off-by: huanghongbo <huanghongbo@koderover.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
Workflow task execution already supports cache usage for both object storage and shared storage backends, but
ignore_cachebehavior was not consistently supported across them.This PR adds:
ignore_cachesupport for object storage cacheignore_cachesupport for shared storage cacheExpected behavior:
ignore_cache=false: restore previous cache and publish new cacheignore_cache=true: skip restoring previous cache, but still publish new cacheChanges
Object storage cache
Added full
ignore_cachesupport for object storage cache in workflow task execution:ignore_cache=true, skipdownload archivetar archive/ cache upload logicCovered job types:
Shared storage cache
Added dedicated restore/publish steps for shared storage cache and upgraded it to a versioned model:
shared_cache_restoreshared_cache_publishcurrent.jsonsnapshots/<version>Behavior:
ignore_cache=falseignore_cache=trueImplementation details:
/zadig/cache-storecurrent.jsoncurrent.jsonis protected by a Kubernetes LeaseKubernetes Lease coordination
This PR uses Kubernetes
coordination.k8s.io/v1 Leaseinstead of injecting Redis credentials into workflow task pods for the shared cache publish lock.The publish flow is:
snapshots/<version>without holding the lockbaseVersioncurrent.jsonif the base is still currentThis keeps the Lease critical section short and avoids holding a distributed lock while copying potentially large cache directories.
Workflow runtime RBAC
Workflow task pods already run with
workflow-cm-sa. This PR extends the existingworkflow-cm-managerRole with the minimum Lease permissions required by the jobexecutor:The initialization paths also update existing
workflow-cm-managerRoles so upgraded clusters receive the new Lease permission instead of only newly initialized clusters.To cover long-lived namespaces that already existed before this change, workflow job creation now also performs an idempotent namespace-scoped RBAC repair when a job includes shared cache restore/publish steps:
workflow-cm-managerexists in the target namespaceThis avoids requiring manual RBAC patching for every historical environment namespace before shared cache publish can use Kubernetes Lease successfully.
Redis runtime env cleanup
The workflow task pod no longer needs Redis envs for shared cache locking, so this PR removes the extra Redis env injection from workflow job pods:
REDIS_HOSTREDIS_PORTREDIS_USERNAMEREDIS_PASSWORDREDIS_COMMON_CACHE_DBignore_cache behavior
ignore_cacheonly affects reading historical cache content.It does not affect publishing the new cache result:
Why Kubernetes Lease
Shared cache publish coordination happens inside workflow task pods running in Kubernetes. A Kubernetes Lease is a better fit for this cluster-local coordination than passing Redis credentials into every workflow task pod only for a lock.
The Lease is namespace-scoped, governed by Kubernetes RBAC, and only requires minimal permissions on
leases.coordination.k8s.io.The correctness of the shared cache publish flow still relies on the versioned snapshot model and
baseVersionvalidation. The Lease only serializes the shortcurrent.jsonpointer update section.Validation
Object storage cache
Verified that:
ignore_cache=falsekeeps bothdownload archiveandtar archiveignore_cache=trueskipsdownload archivetar archiveis still preservedShared storage cache behavior
Verified that:
ignore_cache=falseruns shared cache restore and publishignore_cache=trueskips shared cache restore and still runs publishcurrent.jsonsnapshots/current.jsonShared storage concurrency validation
Triggered multiple concurrent tasks and observed that:
Local package validation
Ran:
/usr/local/go/bin/go test ./pkg/microservice/aslan/core/common/service/kube ./pkg/microservice/aslan/core ./pkg/microservice/jobexecutor/core/service/step ./pkg/tool/kube/leaseResult:
Notes
Cache publish failure or skipped current pointer update does not fail the main workflow job. Shared cache is still treated as an acceleration mechanism rather than a hard requirement for overall job success.
This change is