Skip to content

Sync meeting 2024 02 13

Caspar van Leeuwen edited this page Feb 13, 2024 · 1 revision

Next meetings

  • Tue 12 March 2024
  • Tue 9 April 2024

Agenda/notes 2024-02-13

attending:

  • Caspar van Leeuwen, Satish Kamath, Maxim Masterov (SURF)
  • Alan O'Cais (CECAM, UB)
  • Kenneth Hoste, Lara Peeters (HPC-UGent)
  • Thomas Röblitz (UiB)
  • Julián Morillo (BSC)
  • Danilo Gonzalez, Helena Vela Beltran, Susana Hernandez, Nadia Martins, Elisabeth Ortega (HPCNow!)
  • Pedro Santos Neves, Bob Dröge (UGroningen)
  • Neja Šamec (NIC)

project planning overview: https://github.com/orgs/multixscale/projects/1

EU 1st period report & Review 19-02-2024

  • M12 deliverables have all been submitted
  • Technical Report B
    • Finalized?
      • Financial still need to be finalized
  • Preparing presentations for review
    • "Scientific software development of MultiXscale CoE" (WP2+WP3+WP4)
    • "Technical aspects of MultiXscale CoE" (WP1+WP5)
    • "Training" (WP6)
    • "Dissemination" (WP7)
    • "Impact" (WP8)
    • Practice presentations this Friday

WP1

WP 1 Goals for 2024

  • All MultiXscale-related software in EESSI
  • Stable NVIDIA GPU support
  • Monitoring for CVMFS repositories in place
    • E.g. e-mails to relevant people when something goes down.
    • Should include performance of Stratum 1s
    • Could use the Castiel2 CI infra for this to scrape json from monitoring page & run performance tests
    • Prometheus on Stratum 1?
  • Clear maintenance policy for EESSI infrastructure
    • What is done, when, and by whom?
  • Use of host MPI tested and documented
    • Functionality is brittle, e.g. issues when non-matching hwloc versions
    • May be a reason to add MPICH-based/compatible toolchain?
    • LUMI could be a relevant use case to try
  • Performance evaluation vs local software stacks (?)
    • Publish in docs
    • OS jitter aspect of CernVM-FS
    • If we see poor EESSI performance, that is also valuable
      • E.g. do you recover performance if you inject host MPI?
  • Initial steps for AMD GPU support
    • What is needed? Do we need similar steps as for NVIDIA, or different?
  • Initial steps for RISC-V support
    • Can we get a bot on some RISC-V hardware? Can we build a toolchain? Does CVMFS work? Can we build a compat layer?
    • Technical development in improving/expanding RISC-V software ecosystem (compilers, MPI, low level libs)
      • Focused on LLVM compiler to improve vectorization support
      • OpenMPI is working on RISC-V
    • Julián is looking into answering these questions, using HiFive Unmatched hardware @ BSC
  • D1.3 "Report on stable, shared software stack" (due M24, end of 2024)
    • "Final report on the development of a stable, shared software stack"
    • should cover: included software, where EESSI is available, monitoring (!), ...

Issues

  • installing MultiXscale apps in EESSI (planning issue #3)

    • target milestone moved to M15
    • only 2-3 out of 5 installed in software.eessi.io
      • OpenFOAM v10 failing, v11 installed
        • Didn't v11 have the same issue @Bob?
        • problem with OpenFOAM example case, but underlying cause is bug in OpenMPI (PR for workaround is in place)
    • OpenFOAM v10 + LAMMPS (only) installed in EESSI pilot
    • walBerla is WIP, hanging sanity check (MPI issue)
      • @Pedro: can we do header-only for now? Then plan meeting with Leonardo to hear how they use it / if this is useful at all?
  • Expanding on GPU support

    • See also this software-layer issue
    • Do we always compile natively?
    • What about cross-compiling on non-GPU node?
    • How do we deal with CPU + GPU architecture combinations we don't have access to?
    • We can at least build OSU with GPU support
    • We can also try GROMACS with GPU support and see where we get
  • Low level GPU tests issue

    • OSU
    • CUDA samples
    • GPU burn?
  • need for a separate EESSI development repository (dev.eessi.io)

    • on top of software.eessi.io repository
    • => should have a dedicated planning issue on this

Deliverables this year: 1.3, "Report on stable, shared software stack" (UB)

WP5

WP 5 Goals for 2024

  • Tests implemented for all MultiXscale key software
  • Expanding overall application test coverage
  • Separate (multinode) test step in build-and-deploy workflow
    • Could use dev.eessi.io, before deploying to software.eessi.io
  • Decide how we deal with security updates in compat layer
    • We could provide a new compat layer. Decide on what the default is: existing, or new?
  • Dashboard showing test results from regular runs on (at least) Vega and Karolina
  • Connect better with scientific WPs
    • POC for one of the key applications using the github action in CI (?)
    • POC for getting devs of key applications to update their software in EESSI (?)
  • Improvements to the bot (as needed)
  • Handling issues from regular support

Issues

  • Start setting up dashboard #10

    • Set up meeting with SURF visualization (and regular meetings for follow-up?)
  • Set up infra for periodic testing #36

    • Waiting for Castiel to facilitate this
    • may be heading towards EESSI for CD + Jacamar CI?
    • may not work on all sites (like BSC which is reluctant to allow outgoing connections)
  • Integrate testing in build-and-deploy workflow #133

  • next steps for bot implementation

    • GitHub App (v0.3): improve ingestion workflow (planning issue #99)
      • cfr. effort in NESSI to bundle tarballs before deploying
      • support for additional deployment actions like 'overwrite', 'remove'
    • make bot fully agnostic to EESSI
      • no hardcoding of EESSI software-layer in deploy step
    • separate test step (in between build + deploy)
    • also support GitLab repos (would be interesting in scope of EuroHPC, CASTIEL2, CI/CD)
    • additional effort on bot fits under Task 5.4 (providing support to EESSI)

WP6

  • None of the three ISC tutorial submissions were accepted, despite positive reviews
    • EESSI tutorial: two reviews 7/7 and 6/7, one review 3/7 (motiviation: overlaps with CVMFS)
  • EuroHPC summit:
    • poster on MultiXscale (Eli)
    • Vega, Karolina, maybe Meluxina have half day demo-labs. Trying to make sure EESSI is part of that. (https://www.eurohpcsummit.eu/demo-lab)
    • maybe also an informal "BoF" session during lunch break (Barbara @ Vega informed us this may be possible)
    • speaker on user support (Kenneth has been put forward)
  • Deliverables this year: 6.2, "Training Activity Technical Support Infrastructure" (UiB) (due M24)
    • Magic Castle, EESSI, Lhumos training portal, ...
    • cfr. Alan's tutorial last week, incl. snakemake

WP7

  • Starting task 7.1: Scientific applications provisioned on demand (ends in M48)
    • Use Terraform as a cloud agnostic solution to generate instances on supported cloud providers
  • Starting task 7.4: Industry oriented training activities (ends in M48)
    • Two training sessions targeted to HPC industry
  • Starts task 7.3 starts in Q3: Sustainability (ends in M42)
    • Bussiness plan
    • Certification program
      • "Design of certification program for external partners that would be providing official support"

Deliverables this year: 7.2, "Intermediate report on Dissemination, Communication and Exploitation" (HPCNOW) [M24]

WP8:

  • First periodic report #130
  • Deliverables this year: 8.5, "Project Data Management Plan - final" (NIC)

CASTIEL2

  • ongoing discussion on financial liability for COLA
  • Alan and Kenneth are involved with meeting with EuroHPC hosting sites via CASTIEL2
Clone this wiki locally