Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work of mucoll-spack to enable build of releases fully on github #17

Merged
merged 102 commits into from
Jan 10, 2025

Conversation

madbaron
Copy link
Contributor

@madbaron madbaron commented Nov 25, 2024

In preparation for the next software release, this PR updates mucoll-spack to enable the complete build of releases within the github workflows.

This relies on an intermediate spack environment (key4hep-dev-base) by @tmadlener that contains the external dependencies of the key4hep stack. This environment is used to build an intermediate docker image (called mucoll-minimal) that is subsequently picked up to build the muon collider stack.

The CI is configured so that whenever the dockerfiles, or the spack configuration is changed, only the affected images (and those downstream) will be re-built.
The resulting docker images are automatically published and linked to the github page.

The complete workflow takes about 8 hours to complete on the github runners.

Please note that deploying the workflows to re-build the releases requires (without using external triggers) the dockerfiles to be bundled together with mucoll-spack, which is different from what we did with 2.9.
Personally I think this is ok (perhaps even a little better for discoverability) but please let me know if you disagree.

Adding a few people as reviewers!


Some notes for the future:

  • the preparation of the mucoll-minimal image (which is really a key4hep-minimal) would be best moved to key4hep-spack as it is completely generic
  • we should set up a cron-triggered workflow to run as nightly build with the latest versions of all packages
  • the previous concretize workflows have been removed, as they are part of the docker image building process. They could be split off to a separate step of the workflow if useful
  • we can review the list of packages to be compiled (i.e. all event generators) to make the resulting image a little slimmer

@kkrizka
Copy link
Contributor

kkrizka commented Dec 24, 2024

@madbaron Do you have instructions on what you did to setup the GITHUB_TOKEN to push to your own GHRC? I've made sure that " Read and write permissions" are set under "Settings -> Actions -> General" and I saw you already set the permissions:packages field in the workflow YAML. However I still keep getting a 403 Forbidden when trying to push (example). Googling so far hasn't been help. I've managed to get it working using a classic token, but I don't think that's compatible with your GITHUB_TOKEN setup.

@@ -0,0 +1,77 @@
name: Re-build and publish MuColl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this works in your fork, but I don't understand how. Don't you first need to build the base image everytime as the FROM base image tag depends on the SHA of this commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the branch name, not the SHA. But don't we need something to trigger the building of the first mucoll-spack image in any new branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I propose that we disable the partial image build now and only enable the full image build for now? It will make testing changes that only affect the last image easier, since the partial builds fail by design.

We then revisit this once we have a series of central base images that branches can start from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty open to any problem, but can you elaborate a bit more?
I don't think I understand:

  • what the problem is with the current workflow
  • why only doing the full build would make it easier to debug things

Copy link
Contributor

@kkrizka kkrizka Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the workflow can be illustrated by my test brach. For illustration, I've "upgraded" the lcgeo package version. Only the sim:test image build is triggered as the change only affects that stage. However that depends on the minimal:test image, which does not exist. You can see the triggered and failed actions here.

This is the problem with the current workflow that I'm trying to highlight in this thread. It assumes the branch name in the new image tag for all images, even when the base image (e.g. minimal) does not (and is not) have to be rebuild. Long term, this has plenty of solutions (e.g. we fix a dependency on a key4hep external image tag or try to do something smart like using a main base image if the branch one is not found). But those are not available right now, which is why I'm proposing to stick to the long full builds for now to get something that works 100%. Do let me know if I missed something.

Regarding the "easier to debug things". Due to the problem described above, doing the full build is currently the only setup that can pass by design. Thus it is the only way to test that the committed contents work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I got your point much better now.

I think the only argument towards keeping the partial builds around was debug time once a branch exists (anyway none of these workflows can be listed as a requirement in PRs because the github tokens would not get write access) but that is likely largely solved by your buildcache updates.

I'll experiment with implementing the "fall back to main base image". If it turns out to be easy, I think it will be nice to have around anyway to save a few extra minutes.

@madbaron
Copy link
Contributor Author

madbaron commented Dec 24, 2024

The GitHub token setup should be ok (ok as in should work out if the box) as is.

However, if the registry contains an image with the same name but uploaded from another repository, the token won't get write access.
Could this be your case?

In that case you need to get rid of the previous image (I couldn't find a way to get the GitHub token to write over it).

@kkrizka
Copy link
Contributor

kkrizka commented Dec 24, 2024

Yes, the problem turned out to be an existing mucoll-spackpackage from my old MuonCollider-docker fork. The other way to solve it was to add the new repository to the package (instructions).

@kkrizka
Copy link
Contributor

kkrizka commented Jan 1, 2025

@madbaron @tmadlener Have you considered using buildcaches to prevent rebuilding the common system packages? I have a prototype mostly working in my buildcache branch. It currently fails building py-onnxruntime due to running out of space. But it correctly pulls packages built in previous attempts and thus is much quicker. For example, build-minimal takes 17 min when all packages are found in the cache.

@madbaron
Copy link
Contributor Author

madbaron commented Jan 4, 2025

Hi @kkrizka,
yes, @tmadlener brought this up a few times (this is how the EIC builds their images - but using some remote S3 storage).
When looking at the free buildcache hosted by github it was not obvious it would be enough, so we postponed looking into it while we were ironing out the multi-step container workflow.

But I see the workflow in your build-minimal example was successful until the end - so I guess we can pick it up?

I think we should get things moving. Shall we merge this one first (after addressing the remaining discussion points) and follow-up just after with your buildcache additions?
Or would you rather push your changes into this PR directly?

@kkrizka
Copy link
Contributor

kkrizka commented Jan 4, 2025

Hi @madbaron, yes it would be better to merge this and proceed on using buildcache in a separate PR. There is still some work left and there might be some discussion on the details (e.g. do we run spack standalone first and then again for docker?)

@kkrizka
Copy link
Contributor

kkrizka commented Jan 7, 2025

This now seems like a good start for the updated workflow. Shall we merge this now and follow up on any additions via smaller PRs?

@madbaron
Copy link
Contributor Author

madbaron commented Jan 7, 2025

Thanks.
I'll merge this as soon as the workflow finishes successfully in my branch (I got one of the checksums wrong when tagging the packages).

@kkrizka
Copy link
Contributor

kkrizka commented Jan 7, 2025

Actually might be worth waiting until key4hep/key4hep-spack#676 is merged so the ref hash becomes permanent. The current ref is broken as for 30 min ago as @tmadlener has been busy. It sounds like the merging of that is close though?

@madbaron madbaron merged commit c2bf588 into MuonColliderSoft:master Jan 10, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants