-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting 2024 02 13
Caspar van Leeuwen edited this page Feb 13, 2024
·
1 revision
- Tue 12 March 2024
- Tue 9 April 2024
attending:
- Caspar van Leeuwen, Satish Kamath, Maxim Masterov (SURF)
- Alan O'Cais (CECAM, UB)
- Kenneth Hoste, Lara Peeters (HPC-UGent)
- Thomas Röblitz (UiB)
- Julián Morillo (BSC)
- Danilo Gonzalez, Helena Vela Beltran, Susana Hernandez, Nadia Martins, Elisabeth Ortega (HPCNow!)
- Pedro Santos Neves, Bob Dröge (UGroningen)
- Neja Šamec (NIC)
project planning overview: https://github.com/orgs/multixscale/projects/1
- M12 deliverables have all been submitted
- available via https://www.multixscale.eu/deliverables
- Technical Report B
- Finalized?
- Financial still need to be finalized
- Finalized?
- Preparing presentations for review
- "Scientific software development of MultiXscale CoE" (WP2+WP3+WP4)
- "Technical aspects of MultiXscale CoE" (WP1+WP5)
- "Training" (WP6)
- "Dissemination" (WP7)
- "Impact" (WP8)
- Practice presentations this Friday
- All MultiXscale-related software in EESSI
- Stable NVIDIA GPU support
- Monitoring for CVMFS repositories in place
- E.g. e-mails to relevant people when something goes down.
- Should include performance of Stratum 1s
- Could use the Castiel2 CI infra for this to scrape json from monitoring page & run performance tests
- Prometheus on Stratum 1?
- Clear maintenance policy for EESSI infrastructure
- What is done, when, and by whom?
- Use of host MPI tested and documented
- Functionality is brittle, e.g. issues when non-matching hwloc versions
- May be a reason to add MPICH-based/compatible toolchain?
- LUMI could be a relevant use case to try
- Performance evaluation vs local software stacks (?)
- Publish in docs
- OS jitter aspect of CernVM-FS
- If we see poor EESSI performance, that is also valuable
- E.g. do you recover performance if you inject host MPI?
- Initial steps for AMD GPU support
- What is needed? Do we need similar steps as for NVIDIA, or different?
- Initial steps for RISC-V support
- Can we get a bot on some RISC-V hardware? Can we build a toolchain? Does CVMFS work? Can we build a compat layer?
- Technical development in improving/expanding RISC-V software ecosystem (compilers, MPI, low level libs)
- Focused on LLVM compiler to improve vectorization support
- OpenMPI is working on RISC-V
- Julián is looking into answering these questions, using HiFive Unmatched hardware @ BSC
- D1.3 "Report on stable, shared software stack" (due M24, end of 2024)
- "Final report on the development of a stable, shared software stack"
- should cover: included software, where EESSI is available, monitoring (!), ...
-
installing MultiXscale apps in EESSI (planning issue #3)
- target milestone moved to M15
- only 2-3 out of 5 installed in
software.eessi.io
- OpenFOAM v10 failing, v11 installed
- Didn't v11 have the same issue @Bob?
- problem with OpenFOAM example case, but underlying cause is bug in OpenMPI (PR for workaround is in place)
- OpenFOAM v10 failing, v11 installed
- OpenFOAM v10 + LAMMPS (only) installed in EESSI pilot
- walBerla is WIP, hanging sanity check (MPI issue)
- @Pedro: can we do header-only for now? Then plan meeting with Leonardo to hear how they use it / if this is useful at all?
-
- See also this software-layer issue
- Do we always compile natively?
- What about cross-compiling on non-GPU node?
- How do we deal with CPU + GPU architecture combinations we don't have access to?
- We can at least build OSU with GPU support
- We can also try GROMACS with GPU support and see where we get
-
Low level GPU tests issue
- OSU
- CUDA samples
- GPU burn?
-
need for a separate EESSI development repository (
dev.eessi.io
)- on top of
software.eessi.io
repository - => should have a dedicated planning issue on this
- on top of
Deliverables this year: 1.3, "Report on stable, shared software stack" (UB)
- Tests implemented for all MultiXscale key software
- Expanding overall application test coverage
- Separate (multinode) test step in build-and-deploy workflow
- Could use
dev.eessi.io
, before deploying tosoftware.eessi.io
- Could use
- Decide how we deal with security updates in compat layer
- We could provide a new compat layer. Decide on what the default is: existing, or new?
- Dashboard showing test results from regular runs on (at least) Vega and Karolina
- Connect better with scientific WPs
- POC for one of the key applications using the github action in CI (?)
- POC for getting devs of key applications to update their software in EESSI (?)
- Improvements to the bot (as needed)
- Handling issues from regular support
-
Start setting up dashboard #10
- Set up meeting with SURF visualization (and regular meetings for follow-up?)
-
Set up infra for periodic testing #36
- Waiting for Castiel to facilitate this
- may be heading towards EESSI for CD + Jacamar CI?
- may not work on all sites (like BSC which is reluctant to allow outgoing connections)
-
Integrate testing in build-and-deploy workflow #133
- Implemented in software-layer PR #467
- Needs some feedback from @Thomas on @Kenneth's review
- Implemented in software-layer PR #467
-
next steps for bot implementation
- GitHub App (v0.3): improve ingestion workflow (planning issue #99)
- cfr. effort in NESSI to bundle tarballs before deploying
- support for additional deployment actions like 'overwrite', 'remove'
- make bot fully agnostic to EESSI
- no hardcoding of EESSI software-layer in deploy step
- separate test step (in between build + deploy)
- also support GitLab repos (would be interesting in scope of EuroHPC, CASTIEL2, CI/CD)
- additional effort on bot fits under Task 5.4 (providing support to EESSI)
- GitHub App (v0.3): improve ingestion workflow (planning issue #99)
- None of the three ISC tutorial submissions were accepted, despite positive reviews
- EESSI tutorial: two reviews 7/7 and 6/7, one review 3/7 (motiviation: overlaps with CVMFS)
- EuroHPC summit:
- poster on MultiXscale (Eli)
- Vega, Karolina, maybe Meluxina have half day demo-labs. Trying to make sure EESSI is part of that. (https://www.eurohpcsummit.eu/demo-lab)
- maybe also an informal "BoF" session during lunch break (Barbara @ Vega informed us this may be possible)
- speaker on user support (Kenneth has been put forward)
-
Deliverables this year: 6.2, "Training Activity Technical Support Infrastructure" (UiB) (due M24)
- Magic Castle, EESSI, Lhumos training portal, ...
- cfr. Alan's tutorial last week, incl. snakemake
- Starting task 7.1: Scientific applications provisioned on demand (ends in M48)
- Use Terraform as a cloud agnostic solution to generate instances on supported cloud providers
- Starting task 7.4: Industry oriented training activities (ends in M48)
- Two training sessions targeted to HPC industry
- Starts task 7.3 starts in Q3: Sustainability (ends in M42)
- Bussiness plan
- Certification program
- "Design of certification program for external partners that would be providing official support"
Deliverables this year: 7.2, "Intermediate report on Dissemination, Communication and Exploitation" (HPCNOW) [M24]
- First periodic report #130
- Deliverables this year: 8.5, "Project Data Management Plan - final" (NIC)
- ongoing discussion on financial liability for COLA
- Alan and Kenneth are involved with meeting with EuroHPC hosting sites via CASTIEL2
- Meetings are on figuring out CI/CD stuff, leaning towards EESSI for CD
- next meeting to be planned (probably before EuroHPC Summit), trying to get CernVM-FS dev team involved in it
- Alan & Kenneth will present overview on required effort/infrastructure, workaround if non-native CernVM-FS installation is required
- SURF has ISO certification for Snellius and has CernVM-FS deployed!
- see https://www.surf.nl/files/2022-11/iso-certificaat-surf-2022.pdf (in Dutch) and "SURF services under ISO 27001 certification" at https://www.surf.nl/en/information-security-surf-services