Skip to content

[SPARK-57075][INFRA] Share precompile Coursier cache with test/pyspark/sparkr jobs#56118

Draft
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:share-precompile-coursier-cache-dev6
Draft

[SPARK-57075][INFRA] Share precompile Coursier cache with test/pyspark/sparkr jobs#56118
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:share-precompile-coursier-cache-dev6

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add the precompile-coursier- cache as a restore-key fallback for the
test, pyspark, and sparkr jobs in .github/workflows/build_and_test.yml,
so they can reuse the dependency JARs already resolved by the precompile
job when their own Coursier cache misses.

Concretely, each of the three jobs' Cache Coursier local repository step
now has these additional fallback restore-keys (existing primary key and
prefix fallback unchanged):

restore-keys: |
  <existing-prefix>-coursier-
  precompile-coursier-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
  precompile-coursier-

Why are the changes needed?

The precompile job already resolves the full superset of dependencies
(it builds with -Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano) and populates ~/.cache/coursier,
but writes that cache under the key prefix precompile-coursier-. The
downstream test jobs read from ${matrix.java}-${matrix.hadoop}-coursier-,
pyspark-coursier-, and sparkr-coursier- respectively, so they cannot
see the precompile job's cache.

The precompile artifact tarball only bundles target/ directories
(.class files and assemblies); it does not include the resolved JARs.
So when a test job's own Coursier cache is cold (new branch, modified
pom.xml / plugins.sbt), SBT and Coursier still have to re-resolve
and re-download the dependencies from scratch even though the
precompile job already downloaded them in this same workflow.

Adding the precompile cache as a restore-key fallback lets the test
jobs benefit from that work in the cold-cache case. The change is
purely additive: existing per-job caches still take precedence via the
primary key and the first restore-key entry.

Does this PR introduce any user-facing change?

No. CI-only.

How was this patch tested?

YAML validates with python3 -c "import yaml; yaml.safe_load(...)". The
effectiveness of the cache fallback can only be observed on actual GHA
runs and will be evaluated by the CI on this PR.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

@zhengruifeng zhengruifeng changed the title [INFRA] Share precompile Coursier cache with test/pyspark/sparkr jobs [SPARK-57075[INFRA] Share precompile Coursier cache with test/pyspark/sparkr jobs May 26, 2026
@zhengruifeng zhengruifeng changed the title [SPARK-57075[INFRA] Share precompile Coursier cache with test/pyspark/sparkr jobs [SPARK-57075][INFRA] Share precompile Coursier cache with test/pyspark/sparkr jobs May 26, 2026
@zhengruifeng zhengruifeng force-pushed the share-precompile-coursier-cache-dev6 branch from dac4a74 to b759a65 Compare May 27, 2026 07:05
Add the `precompile-coursier-` cache as a restore-key fallback for the
`test`, `pyspark`, and `sparkr` jobs in `build_and_test.yml` so they can
reuse the dependency JARs already resolved by the `precompile` job
instead of re-downloading them when their own Coursier cache misses.

Generated-by: Claude Code (Claude Opus 4.7)
@zhengruifeng zhengruifeng force-pushed the share-precompile-coursier-cache-dev6 branch from b759a65 to fc6274b Compare May 27, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant