-
Notifications
You must be signed in to change notification settings - Fork 145
Notes from EasyBuild maintainer summit 2021
- Åke
- Alan
- Alex
- Bart
- Bob
- Caspar
- Damian
- Davide
- Fotis
- Kenneth
- Lars
- Miguel
- Mikael
- Sam
- Sebastian
- Simon
- Adam
- Åke
- Alan
- Alex
- Bart
- Bob
- Fotis
- Kenneth
- Lars
- Miguel
- Mikael
- Sebastian
- Sam
- Simon
- raised by Alan
- mostly due to visualisation stuff that's done very differently at JSC
- creates problems w.r.t. contributing back
- OpenGL wrapper around Mesa (next to X11)
- plan was to create a PR for that (but person responsible didn't find time yet)
- (Mikael) quite a complex bundle of stuff, could/should be broken up?
- (Alan) some additional magic on top to ensure right stuff gets used
- LINK TO PR?
- (Alan) also: Julia (see CSCS setup)
- not upstream yet, also used at JSC
- (Bart) work done by CSCS involved too much custom stuff, we ended up with this:
-
modluafooter = ''' append_path("JULIA_DEPOT_PATH", ":") append_path("JULIA_LOAD_PATH", ":") '''```
- (Bart) there is a JuliaPackage easyblock to install Julia packages as extensions:
https://github.com/easybuilders/CSCS/tree/master/easybuild/easyblocks has three
julia-related easyblocks:
julia.py
,juliapackage.py
, andjuliabundle.py
.
- (Damian) also Jupyter
- (Kenneth) why aren't these things not being contributed back?
- (Damian) partially due to differences in dependencies in local setups
- sticking closer to upstream could help
- people are mostly focused on what's needed for own site
- (Damian) partially due to differences in dependencies in local setups
- (Bart) ComputeCanada has a couple of customized easyblocks
- contributing those changes back takes time and effort
- internal stuff is sometimes not "good enough" for upstream
- example: GPU offloading in GCC, used to live in custom easyconfigs, now integrated in main gcc.py easyblock, so available by default in GCCcore.
- (Kenneth) could pairing up with a maintainer from another site help to get stuff upstreamed?
- Bart + CSCS for Julia?
- Mikael + JSC for visualisation stuff?
- once you fork, it gradually becomes more difficult to contribute back
- "Perfect is the enemy of good"
- (Fotis) see easybuild.experimental repository
- (Sebastian) asked people at JSC to open PRs to upstream stuff
- nobody actually did, only got requests to help out with other stuff
- there's definitely a barrier there...
- people are not familiar with GitHub integration (they're more familiar with GitLab)
- extending GitHub integration with support for GitLab could help
- quite different toolchains at JSC
- partially due to Parastation MPI (could be custom toolchain upstream)
- different names/versions for standard toolchains like gompi/gomkl
- raises the bar for easy contribution upstream
- (Damian) differences in toolchains grew historically
- toolchain versions are tied to JSC stages
- (Damian) JSC could work on sticking closer to upstream toolchain definitions
- toolchains seems to be more up-to-date upstream currently?
- (Caspar) similar problems at SURFsara before
- changing internal workflow a lot helped
- open PRs more quickly, use --from-pr to install (don't wait for merge)
- any reason why JSC doesn't take this approach?
- (Damian) probably just workflow we're used to
- new stage is an opportunity to change this and increase overlap with upstream
- (Sebastian) also thinking about custom toolchain for AMD
- AOCC, BLIS/libFLAME, ...
- (Damian) biggest concern is easyconfigs using compiler-only toolchain like GCCcore
- (Damian) things like SciPy-bundle are handled differently in JSC to avoid using MPI toolchain
- (Mikael) relates "diamond" toolchains with compiler+BLAS/LAPACK subtoolchain (no MPI)
- (Bart) FFT MPI is actually rarely used
- (Åke) VASP?
- (Bart) checked in detail, seems not
- could be a separate package with only FFTW wrappers for MKL
- (Damian) MKL is installed in two places: system + full toolchain (incl. FFTW wrappers)
- same version, so doesn't really cause problems
- (Kenneth) can we make changes centrally to make it easier for sites to contribute back?
- (Caspar) more structured/faster discussion on new toolchains?
- they're very fundamental, quicker turnaround time needed there
- (Damian) UCX is at system level at JSC (along with CUDA)
- so using system compiler to build UCX
- also related to Parastation, where vendor tests with OS compiler
- (Åke) hooks are the place to implement more or less simple site-specific changes
- could be done such that no changes are needed to easyconfig, so they can be sent upstream easily
- (Alex) two types of divergence: easyconfigs vs toolchains/easyblocks
- too hard to follow easyconfig differences
- chasing down more significant changes, such as easyblock and toolchain differences, is more feasible
- (Caspar) more structured/faster discussion on new toolchains?
- (Kenneth) working group on defining common toolchains
- try to stick to original timeframe (a by Jan, b by July)
- should we decouple version of GCCcore subtoolchain from GCC version?
- allow easier updating to different GCC while still in
develop
- allow easier divergence for sites that want to
- already done at JSC to some extent with MPI with module naming scheme (OpenMPI/4.1)
- does create some tension with reproducibility
- allow easier updating to different GCC while still in
- (Damian) plan to use FlexiBLAS for future (foss) toolchains?
- what about ScaLAPACK and FFTW?
- any problems with mixing BLIS for BLAS (via FlexiBLAS) vs FFTW via MKL
- different library names, so they don't conflict
- (Bart) planning to look into support for MKL backend for FlexiBLAS
- (Alan) FlexiBLAS provides a lot of flexibility
- MPI thing doesn't seem to be as mature?
- Alan reached out to them, no response?
-
some PRs take significantly more time
- new contributors
- complex software
-
will most likely break our easyconfig PR record again this year
-
85% are through our GitHub integration
- about 500 open for last few months
-
100 different contributors for those...so no silver bullet
-
-
allow reviewer to make "trivial" changes to PRs to get them merged
- fixing code style issues, adding sanity check command, ...
- anything that doesn't have impact on how the software is installed
- contributors usually don't mind
-
auto-closing inactive PRs => stalebot GitHub Action (https://github.com/probot/stale)
- right timeframes?
- auto-tagging PRs as "stale"
-
be stricter about PRs using system toolchain?
- relies on OS dependencies, limited value in EB ecosystem
-
active easyconfig maintainers are ~10
- finding more people with the right skills and level of attention is hard
-
would it have helped to have a "rulebook" for maintainers?
- (Sebastian...as a recently integrated maintainer) didn't feel that would be necessary
- Had contributor experience
- Attended biweekly meeting
- Open to asking other maintainers question
- (Caspar) would be nice to have a checklist for merging PRs
- Can take time to get back into things when you have not done maintenance for a while
- The "rules" may change over time (cfr. CUDAcore/CUDA, Python versionsuffix)
- (xx) make the bot spew out a --review-pr output
- only comparing against most similar easyconfig in develop
- (Sebastian...as a recently integrated maintainer) didn't feel that would be necessary
-
deprecate toolchains older than 2019
- make EB produce a warning
- start closing PRs that use the toolchains
-
make --new-pr ask the user to submit a test report?
-
let maintainer mark a PR as approved for testing
- let bot auto-test in "standard" environments
-
have bot mark PR's stale
-
do we need extra labels to make it easier to find PRs to work on?
- single 'status' label, partially managed automated by the bot?
-
extra labels to mark PR ready for auto testing by generoso?
-
any maintainers not using Octobox should be used yet
-
CI should over patches with no comments on top
-
--new-pr
can/should check a couple of common things (checksum, top comment patch)- trigger an automatic
--check-contrib
?- does not do exactly the same as the CI... but we should fix that!
- trigger an automatic
-
test suite errors could be improved sometimes
- see failure message when code style check fails
-
auto-reply from bot with checklist of things for contributor
- request test report after 1 day (if CI is passing)
-
(Adam) Real application testing
- sanity check commands are only scratching the surface
- test job on generoso
- ReFrame?
- work is being done on a shared library of tests
- buildtest?
- collection of links to application benchmarks: https://github.com/boegel/scicobe (needs to be updated)
- (Sam) more collections: https://c4science.ch/source/scitas-examples
- would require input from application developers/experts?
- examples:
- working correctly
- some cases raised by @Flamefire
- performance issues
- NAMD
- TensorFlow: https://github.com/easybuilders/easybuild-easyblocks/issues/2577
- working correctly
- Fotis: PRACE benchmark apps https://repository.prace-ri.eu/git/UEABS/ueabs
- Reviewing is (usually) more time-consuming
- Should we have a CI check that verifies that a new feature appears in documentation?
- Need to document requirements for maintainers to know when it's ok to merge
- Would training help?
- It's hard to find your way around
- An overview of the structure
- Workflow from eb command to parsing easyconfig to easyblock, etc.
- People who could help out with this (in order of being familiar with framework): Kenneth, Alan, Bart, Åke
- potential topics
- workflow from eb command to installation
- general overview
- example of implementing new configuration option (+ test to go along with it)
- toolchain support
- overview of framework tests
- session on reviewing PRs perhaps
- implementing easyblocks
refresh of https://easybuilders.github.io/easybuild-tutorial/2021-lust/implementing_easyblocks
- the QA system and its features
- (Alex) sometimes reviewer requests additional changes that are "out of scope" for that PR
- What about easyblocks?
- What are the requirements for test reports of easyblock PRs?
- You cannot (currently) ask the bot to test an easyblock PR
- Nice to know what easyconfigs are touched by an easyblock
-
ambertools
was a case that used theamber
easyblock and broke when this got updated - bot could check this for non-generic?
-
- Have a webpage or similar function where you can enter a easyblock name which will then generate a list of which other easyblocks uses that as its base
- What are the requirements for test reports of easyblock PRs?
- Slack is not searchable so we should keep away from using it for issues
- (Adam) issue template is good thing
- (Kenneth) should be optional
- add suggestion to ask on Slack in issue template
- Things to request in template
- eb --show-config
- name of easyconfig
- ...
- Add to docco to use the general easybuild repo for new issues, we will move them into the correct repo
- Document better how to interpret build logs and how to find the actual problem
- Search for _step etc
- See troubleshooting in the docco
- (Kenneth) arch-tagging? will it help
- (Kenneth) Maintainer sprint sessions bi-weekly, on non-eb-bi-weekly weeks
- Docs not being updated when we add features (in framework in particular)
- Current syntax is RST, differences with markdown are enough to be annoying
- Workflow with readthedocs is also a bit annoying if you want to preview
- Tutorial uses mkdocs...and is in a separate tutorial
- Allows for easy and instant local preview
- Should we also be hosting the docs on GitHub?
- Fotis: GitLab does direct rendering of .rst
- Alan: but the Sphinx stuff it won't deal with
- Move to another format is the biggest jump
- rst-to-myst looks like a good help here
- Starting point
- create a repo and do a page or two
- then do a cry for help, looking for volunteers for 1 page at a time
- Need a decent starting point
- Should have decent CI and contribution docs in place
- Will also need to port automated docs
- ACTION: Look for volunteers to help kickstart this
-
ARM and POWER are secondary platforms
- Don't hold back a PR if Intel/AMD work
- Open an issue though for non-working archs
- Do have access to both archs (ARM through EESSI)
- Doing these checks introduces additional latency
- Don't have to require these, but can add the capability to the bot to requests there
- Can we do a Gentoo-style tagging so we know what works where
- Keeping track of this in easyconfigs is a maintenance nightmare
- Can use regression tests to at least document this
- Alex: what about blacklisting stuff
- keeping track of known issues
- in easyconfigs? maintenance burden
- from regression tests as part of the release in the same way that indexing is done?
- Will delay a release
- Could do this step afterwards as part of the docs
- Still need the ability for EB to pick that up
- Will delay a release
- Would documentation be a better place to keep track of known issues?
- cfr. FlexiBLAS trouble on POWER
- keeping track of known issues
- Don't hold back a PR if Intel/AMD work
-
Consensus on:
- treating Arm and POWER as secondary platforms
- don't block PRs because of test failures on Arm/POWER
- document known issues (on non-x86 platforms) and let 'eb' pick them up and print warnings?
-
Can we get some (cloud) resources to support arch testing?
- JUSUF Cloud is perhaps an option for AMD, Sebastian will investigate
- Can we ask vendors for hardware?
- Admin is the challenge there
- Cloud credits are perhaps a better option
- Would cover ARM as well
-
Fotis: How do other projects manage multiple architectures builds?
- see upcoming packaging-con
-
expand boegelbot to test PRs
- POWER9 (emulated) at OSUOSL
- Graviton2 aarch64 @ AWS (using EESSI credits)
- aarch64 @ fosshost (to be requested)
- AMD + GPU @ JSC (JUSUF Cloud) [Sebastian]
- bot account @ Mikael's infrastructure?
- start working on this in a 5.x branch?
- Opportunity to change things we are not happy with
- HMNS has a couple of issues
- robot can be broken if people do not easyconfigs in the robot search path
- building your own software on top of someone elses stack is cumbersome, you need to fiddle with
MODULEPATH
- HMNS is non-unique which makes handling something like
gomkl
difficult (module clashes withfoss
) - Non-unique names in HMNS for things like OpenBLAS (stuff installed with foss vs gomkl toolchain) => separate HMNS with one extra level for math libs or add versionsuffix for non-foss toolchains
- (Bart) Having a bootstrap location for dependencies required for bootstrapping toolchains
- (Bart) Kill incomplete implementation of support for .yeb easyconfigs (YAML syntax)
- (Bart) cleanup in:
- easyblocks (old software versions)
- 32-bit support in framework
- macOS support
- Kenneth: let's not, basic functionality works
- Deprecating Python 2.7 makes sense but dropping support right now is probably too much
- Can also deprecate 3.5
- What about Lmod?
- Deprecating Lmod 7 might be a good idea
- Can we default to use
depends_on
for dependencies (with Lmod)? - Drop support for ancient Tcl-only implementation?
- new features for 5.0
- (Alex) versionless dependencies, let EB use what's available
- (Åke) not a fan of this...
- (Kenneth) adding the feature in framework and using it in easyconfigs in the central repo are two different things
- Can we leverage some of the code of
--try-update-deps
?
- (Sam) support for only specifying partial versions (Python 3.6.*)
- should that be reflected in generated module file, or not?
- (Alex) separate metadata files for easyconfigs (homepage, checksums, etc)
- metadata and checksums should be separate files (due to updates needed in PRs)
- (Kenneth) could help with maintenance?
- (Mikael) will it really?
- having things across multiple files may cause trouble in some contexts
- example: checksum added in checksums.json, only copying easyconfig file
- (Fotis) reduce the need for conditionals in easyblocks
- missing feature that results in lots of if statements
- some kind of lookup table to avoid if/else blocks
- exploiting repetitive patterns can be indication of a missing feature
- (Alex) versionless dependencies, let EB use what's available
- (Simon) cleaning up of easyconfigs for old software versions
- cfr. bintray cleanup
- mostly stuff with system toolchain?
- Contacts
- separate committee
- documented
- group contact + individual contacts
- Most of the details are in the slides
- Bot is currently installing in shared space
- Should do an install in
/tmp
first - Should automatically do the installation in the shared space after the PR is merged
- automation is hard as there are a lot of corner cases
- can we use singularity to do the installation in an overlay?
- EESSI uses fuse overlay (not singularity overlay)
- Should do an install in
- would like to start testing in singularity containers
- should create a repository where we define these test environments (hosted on GitHub)
- will allow us to test in multiple environments/OSes
- (Åke) have Ubuntu Focal minimal containers available
- would like to have access to logs
-
Crush the curve of open (easyconfig) PRs
- Deprecate old toolchains (< 2019)
- Close PRs for deprecated toolchains
- Set up bot to tag/auto-close stale issues/PRs (https://github.com/probot/stale)
-
Try to empower contributors more to make PRs ready
- Improve errors for failing CI tests (cfr. code style for easyconfigs)
- Make CI fail over common issues (like missing comments on top of patches)
-
Make life of maintainers easier
- document requirements to merge PRs (in different repos)
- Make boegelbot add comment with output of
eb --review-pr
(single easyconfig) - auto-label with PR status (new, CI passes, )
-
working group for migrating docs to mkdocs
- early starting point: see
mkdocs
branch in https://github.com/easybuilders/easybuild
- early starting point: see
-
expand farm of test platforms (boegelbot)
-
code-of-conduct
- Alan & Kenneth follow up
- odd number of committee members (3?)
- PR for code-of-conduct that all maintainers should agree on
-
EasyBuild 5.0
- project to track progress on major targets to tackle for EasyBuild 5.0
- set up 5.x branches
-
issue template for reporting bugs/questions/...