Skip to content

Clone specific software-layer-commit and implement CI to check merged status#1353

Open
casparvl wants to merge 22 commits intoEESSI:mainfrom
casparvl:improve_software_layer_scripts_workflow
Open

Clone specific software-layer-commit and implement CI to check merged status#1353
casparvl wants to merge 22 commits intoEESSI:mainfrom
casparvl:improve_software_layer_scripts_workflow

Conversation

@casparvl
Copy link
Collaborator

@casparvl casparvl commented Jan 13, 2026

This PR is an initial step in creating a workflow where we can use PRs to software-layer-scripts directly, and then once they are merged, just update the SHA checksum to a (Github-signed) merge commit, rebuild, and be done :)

Edit: better description...

This PR contains three changes:

  1. bot/build.sh now clones a particular commit, which is specified in bot/commit_sha
  2. CI job Verify software-layer-scripts / check_bot_build_checksum verifies the checksum of bot/build.sh against a reference hardcoded in the workflow file. This way, a malicious contributor would have to modify both bot/build.sh and the workflow file, which would (hopefully) stand out to a reviewer.
  3. CI job Verify software-layer-scripts / check_software_layer_scripts_commit which check that bot/commit_sha is a commit that is part of the upstream https://github.com/EESSI/software-layer-scripts , is on the main branch (i.e. has been merged), is a merge commit, and is signed with the public Github GPG for the web interface.

We might need to update the commit_sha already (not sure if there have been more merges to software-layer-scripts since I started this) before we merge this to make sure this PR doesn't actually revert us to using an older version.

@casparvl
Copy link
Collaborator Author

Let's do a small test build to see if the new build.sh works, i.e. if it correctly clones the software-layer-scripts repo from a given commit.

@casparvl
Copy link
Collaborator Author

Perfect. CI run on 2cd6082 shows

Commit c0a3ff09a3a38737af5a922fdf581aa7b2dd6c88 is NOT merged into origin/main.
Error: Process completed with exit code 1.

as expected, since this commit is on a feature branch but is not merged. Then, using a merge commit as in 6d954c4 the CI now passes.

@casparvl
Copy link
Collaborator Author

TODO: I guess in this same PR we should still add a check that verifies that the SHA-checksum of bot/build.sh itself has remained unchanged (there should be no reason to change it, since the sha-checksum is external to this file).

@casparvl
Copy link
Collaborator Author

Perfect. As expected, after changing the bot/build.sh in f1fdcca and fixing a typo in the workflow in c4b1f9a I get:

Computed checksum: bb805939ae22f3ca2e6fc85d13613aeb9b3fc81974a2e1ef3bfc85a7f3ae8a0f
Reference checksum: 9d33368cac2e38e10147eeb0aafc321651ebaa5912387ecef97683570906773a
ERROR: Checksum mismatch! The file bot/build.sh has been modified.

Changing the bot/build.sh back to it's original version in 0494884 and having the CI run on a subsequent merge commit 72fbb29 I now get

Computed checksum: 9d33368cac2e38e10147eeb0aafc321651ebaa5912387ecef97683570906773a
Reference checksum: 9d33368cac2e38e10147eeb0aafc321651ebaa5912387ecef97683570906773a
Checksum for bot/build.sh matches the reference value

@casparvl
Copy link
Collaborator Author

Ok, as expected, both Verify software-layer-scripts / check-bot_build_checksum and Verify software-layer-scripts / check_software_layer_scripts_commit pass on bce9bbc

Let's test again by changing the sha checksum...

@casparvl
Copy link
Collaborator Author

Ok, on bee1d29 we again have the expected failure:

Commit c0a3ff09a3a38737af5a922fdf581aa7b2dd6c88 is NOT merged into origin/main.
Error: Process completed with exit code 1.

Let's change the sha checksum back, and now change something in bot/build.sh to test the other CI job.

…f this causes the associated CI job to fail
@casparvl
Copy link
Collaborator Author

Again, we get the expected failure:

Computed checksum: 93705d4ae3517d9dfcac79d4e7a113e62977d9187b1af9bd4797c03367a9cdfb
Reference checksum: 9d33368cac2e38e10147eeb0aafc321651ebaa5912387ecef97683570906773a
ERROR: Checksum mismatch! The file bot/build.sh has been modified.
Error: Process completed with exit code 1.

@casparvl casparvl marked this pull request as ready for review January 19, 2026 14:54
@casparvl
Copy link
Collaborator Author

casparvl commented Jan 20, 2026

Discussed in support meeting. One issue came up - though it doesn't affect this specific PR, nor the CI checks in it.

The issue is related to how we might use the commit_sha introduced in this PR to make the bot refuse to deploy. The rationale is: if a bot has tarballs with a commit_sha that do not match the bot/commit_sha, those are invalid and should not be deployed. However, what may happen is this:

  • software-layer-scripts PR 1 introduces a change (e.g. to eb_hooks), commit X
  • software-layer PR 1 uses it to build some software, and proves that the new eb_hooks solves some issue
  • software-layer-scripts PR 1 is merged, creating a merge-commit Y
  • software-layer PR 1 updates bot/commit_sha to Y and rebuilds all tarballs
  • software-layer-scripts PR 2 introduces a different change, which gets accepted and merged with merge commit Z
  • software-layer PR 2 uses merge commit Z to build all tarballs
  • software-layer PR 2 gets deployed and merged
  • software-layer PR 1 gets synced with main. This means bot/commit_sha should be updated to Z (as Z happened after Y, and we don't want a merge of software-layer PR 1 to 'downgrade' the bot/commit_sha to an older version than the one on main). This now invalidates all builds done with bot/commit_sha=Y - even though those are perfectly valid builds!

I'm not yet sure how to avoid this. It is not possible to retain Y in SW PR1, as a merge of PR1 would then cause main to contain bot/commit_sha=Y, which we don't want if Z is newer than Y. So we have to come up with a way that if we then accept Z in bot/commit_sha in the PR 1, that builds with Y are still considered 'valid'.

Note that I don't think we should hold off the current PR: it's a first step that'll provide us with the ability to build in software-layer from PRs to software-layer-scripts, which makes this workflow smoother already. And it guards at least against merging this PR if the commit_sha doesn't belong to a signed merge commit. Those two aspects in themselves add value - even without any modifications to the bot-deploy procedure.

@casparvl
Copy link
Collaborator Author

Ok, just did a brainstorm with @bedroge . What we want is this:

  • Commit-sha in software-layer main branch always points to the tip of the main branch for software-layer-scripts. This ensures that a contributor who knows nothing about hooks (and doesn't need hook updates), gets the latest versions.
  • It should be clear which commit sha was used in a given bot build. I.e. the bot should report the commit sha back in its build report messages.
  • The bot status command should aggregate those commit SHAs that are reported in the build report messages, so that reviewers can easily check if all builds where done using the same commit SHA.
  • We should be able to do builds with commit SHA's of unmerged PRs (to prove they work), but we should not be able to deploy those.
  • Right before deploying, the bot should check if the commit sha that was used to create an artefact was indeed a merge commit on the main software-layer-scripts repo. It does NOT need to be the last merge commit. The reason is that if person A created a merge commit that fixes a build for software-layer PR 1, and person B later created a merge commit that fixes a build for software-layer PR 2, we don't want person A to have to redo all their builds. Note that this is also our current practice: the hooks may have been updated even after builds for a given software-layer PR were done, and before they were deployed. That essentially also means we are deploying something that was built with a version of the hooks not matching the current tip of the main branch of software-layer-scripts - and that's fine.

To get to this stage, we need the following:

  • Software-layer CI: IF commit-sha has changed, then it needs to be changed to the latest commit on software-layer-scripts main before it gets merged
  • Bot build command change: read commit_sha from software-layer-scripts clone in job dir, and report it in the build messages.
  • Bot status command change: aggregate reported commit_sha’s, so that we can (for now manually?) check that all are the same.
  • Bot deploy command expand with pre-deploy step. That step should read commit_sha from software-layer-scripts clone in job dir, check that this is a merge commit to software-layer-scripts main. Otherwise, don’t deploy. This should probably be done somewhere around here https://github.com/EESSI/eessi-bot-software-layer/blob/167da4aea201fdde72bee38ed7db3edd5cbe3f06/tasks/deploy.py#L677 . This same pre-check could later check if all reported commit_sha’s are the same (by running an internal bot:status, without reporting to the actual GH issue?).

@trz42
Copy link
Collaborator

trz42 commented Mar 10, 2026

I wonder what kind of problem this PR is trying to solve (didn't find some kind of problem description or issue related to this PR):

  • Is it a problem that causes inconsistencies in how we build software within a PR and/or across PRs? How often have we hit that problem before? (If we have no data on how often, I think a low-effort change could be to implement changes that collect the necessary data.)
  • Is it more a convenience type of problem that exists because the software-layer repo was split, and it has become a bit cumbersome to first update the software-layer-scripts contents and then later use that in PRs to software-layer?

To me the changes suggested are substantial (not so much in this PR itself, but particularly for bot commands). I would be reluctant to do these changes (now) if they address a mostly convenience type of issue.

Besides all the necessary changes in software-layer-scripts and eessi-bot-software-layer, would this PR mean that one has to change the commit_sha every time software-layer-scripts contents change?

@casparvl
Copy link
Collaborator Author

casparvl commented Mar 11, 2026

All of the above ;-)

  • Convenience: we've all seen how messy things are when easystacks are temporarily added to software-layer-scripts for testing. Personally, this annoys me enough to put in the effort, but the problem is indeed more extensive.
  • Tracability of builds: We claim very good tracibility of our builds, but it's actually very hard to figure out which version of the hooks was used for which build. The only way to know is to go into the reprod dirs and get the hooks file from there (but which git commit this corresponded to, is probably impossible to figure out). Aside from the hooks, it also means that the state of things like installation scripts (like EESSI-install-software.sh) can no longer be retreived.
  • Reviewing: There are very weird situations in software-layer PRs where you see a succesful build, right after a failing one. As a reviewer, we can currently only assume that it became succesful because of a change in the eb_hooks.py in the CVMFS repo. But we cannot check exactly what change was made, and whether that makes sense.
  • Scalability: with more and more people working on the software-layer-scripts we've had multiple cases where multiple people where modifying hooks at the same time. It's hard to make sure that a deployment from PR 1 doesn't override the changes from PR 2 if PR 2 got deployed first. With the proposed approach, this becomes a lot more explicit, as you'd get a git merge conflict in PR 1, that you simply have to resolve in your feature branch. This then also forces you to do a rebuild, as the CI that checks for consistency between the deployed eb_hooks and the one in the feature branch will fail until you redeploy.

Besides all the necessary changes in software-layer-scripts and eessi-bot-software-layer, would this PR mean that one has to change the commit_sha every time software-layer-scripts contents change?

Yes. If you need a newer version of software-layer-scripts to be used, you have to be explicit about it - that's the whole point, we want it to be explicit about which version of software-layer-scripts was used, so we have to tell it. But note that the commit_sha in software-layer main branch will always point to the tip of software-layer-scripts's main branch. So a naive user who doesn't know about this won't have to be bothered to change / update the commit_sha.

# This workflow verifies that the correct version of software-layer-scripts is used.
#
# First, check_bot_build_checksums checks if the bot/build.sh code that clones software-layer-scripts is untouched,
# as this normally shouldn't change (a change could mean a contributor is trying to inject something
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be toned down, changing the bot/build.sh script doesn't necessarily imply that something malicious is going on.

It's weird/unusual, sure, but let's not brand every possible change to bot/build.sh script as malicious up front

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's especially ironic that this very PR is also making changes to bot/build.sh ;-)

with:
fetch-depth: 1 # We only need the current revision to read bot/commit_sha
- name: Checkout software-layer-scripts (full history)
uses: actions/checkout@v4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also use specific commit, as above

uses: actions/checkout@v4
with:
repository: EESSI/software-layer-scripts
path: upstream-scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it difficult, just call this EESSI-software-layer-scripts ?

TOPDIR=$(dirname $(realpath $0))

# Clone a the commit from software-layer-script that corresponds to `bot/commit_sha`
commit_sha=$(cat ${TOPDIR}/commit_sha)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this file so it's clear from the filename what this refers to:

Suggested change
commit_sha=$(cat ${TOPDIR}/commit_sha)
software_layer_scripts_commit=$(cat ${TOPDIR}/software_layer_scripts_commit)

@boegel
Copy link
Contributor

boegel commented Mar 12, 2026

Besides all the necessary changes in software-layer-scripts and eessi-bot-software-layer, would this PR mean that one has to change the commit_sha every time software-layer-scripts contents change?

Yes. If you need a newer version of software-layer-scripts to be used, you have to be explicit about it - that's the whole point, we want it to be explicit about which version of software-layer-scripts was used, so we have to tell it. But note that the commit_sha in software-layer main branch will always point to the tip of software-layer-scripts's main branch. So a naive user who doesn't know about this won't have to be bothered to change / update the commit_sha.

This is the critical point for me: as long as the approach we're taking here doesn't affect most contributors, it's OK imho.

And I would say that's indeed the case here: only if you need updated hooks (or other scripts), you need to take action.

I like the approach being proposed here, as it removes a great deal of uncertainty: no more guessing which commit of the scripts was used, it's crystal clear.

I wouldn't hurt to make bot/build.sh print out which commit of software-layer-scripts it's using BTW.

We should also update our docs to reflect these changes, with a specific subsection that outlines the procedure if changes to hooks are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants