Skip to content

Sync meeting 2024 10 08

Caspar van Leeuwen edited this page Nov 11, 2024 · 1 revision

MultiXscale WP1+WP5 sync meetings


Next meetings

  • Tue 12 Nov 2024 10:00 CET (prep for special project review?)
  • Tue 10 Dec 2024 10:00 CET (post-mortem of special project review?)

Agenda/notes 2024-10-08

attending:

  • Kenneth (HPC-UGent)
  • Caspar, Xin, Maksim, Satish (SURF)
  • Nadia, Susana, Pedro Frenandez, Eli (HPCNow!)
  • Neja (NIC)
  • Alan (CECAM)
  • Pedro Santos Neves, Bob (RUG)
  • Julian (BSC)
  • Richard (UiB)
  • 2024Q3 quarterly report
    • Deadline 25th Oct
    • Every institute: please fill in the hours and bullet-point work done tables
    • w.r.t. project review
      • important to show that we're "on target" w.r.t. PM effort in 2024
      • each partner should report PM efforts for Jan-Sept 2024 to Neja (incl. breakdown per WP) by 20 Oct
        • still missing from BSC, HPCNow, UB, RUG
        • Alan will know on Thu for UB
  • Upcoming deliverables (M24):
    • D1.3, M24: Report on stable, shared software stack (UB)
      • who: Kenneth (UGent), Caspar (SURF), Pedro (RUG), UiB (Richard/Thomas), UB (Alan)
      • TODO:
        • GPU support -> see tiger team
        • monitoring -> see ongoing effort by RUG
        • compare the performance against on-premise software stacks to identify potential performance limitations
          • mostly on Snellius @ SURF, since that's easier
        • stability -> zero reports so far of EESSI "network" being down
        • @Alan: ask UK guy for quote "this is a game changer for small sites" (feedback during EESSI intro on 4 Oct'24)
    • D6.2, M24: Training Activity Technical Support Infrastructure (UiB)
      • dev.eessi.io should be covered here (more than in D1.3)
      • => @ Thomas/Ricard to pull this
    • D7.2, M24: Intermediate report on Dissemination, Communication and Exploitation (HPCNow)
    • D8.5, M24: Project Data Management Plan - final (NIC)
  • Upcoming milestones (M24):
    • Milestone 4, M21: First training event supported by the Ambassador Program. [WP1/WP5/WP6] (UB)
    • Milestone 5, M24: WP4 Pre-exascale workload executed on EuroHPC architecture. [WP2/WP3/WP4] (NIC)
      • probably via Espresso (Jean-Noël?), cfr. ongoing development
      • do we need to request a reservation for this? (Vega?)
        • may be useful to iterate quickly
      • doesn't necessarily need to be on a EuroHPC system
  • WP status updates
    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
      • [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
        • dev.eessi.io: Tiger team is making very good progress. See meeting notes
          • Key results:
            • Building specific commit of ESPResSo works, see test PR #1
          • Key TODO's:
            • get ingestion of builds into dev.eessi.io CernVM-FS repository to work [Pedro+Bob]
            • documentation
            • let Jean-Noël/Rudolph play with it (test setup)
            • also make it work for GROMACS, see test PR #2
          • What do we REALLY need from this before the project review?
            • ESPResSo development builds set up in dev.eessi.io?
        • NVIDIA GPU support Tiger team making really good progress. See meeting notes
          • Key results:
            • Bot uses accel:nvidia/cc80 arguments to install in correct prefix.
            • First builds in accel prefix: CUDA, UCX-CUDA, UCC-CUDA, NCCL, CUDA-Samples, OSU-Microbenchmarks, LAMMPS w/ CUDA, ESPRESSO w/ CUDA
            • Check for missing installations in accel builds in CI
          • TODO:
            • Actual GPU nodes in build cluster (now cross-compiling, not running GPU tests)
            • Adapt bot to accept arguments to allocate/build on GPU nodes
            • cuDNN (strip non-redistributable files + support local installation in host_injections)
            • Decide on and expand combinations of CPU & GPU architectures
            • enhance script(s) in software-layer repo
              • auto-detect GPU model/architecture (enhance archdetect)
              • pick up accel directive from the bot and change software installation prefix accordingly
              • install GPU software in proper location: ESPResSo (?), LAMMPS, MetalWalls (?), TensorFlow, PyTorch, ...
          • proper NVIDIA GPU support is due by M24 (deliverable D1.3)
            • => we shouldn't wait for dev.eessi.io being operational
          • we need to plan who will actively contribute, and how [Kenneth,Lara]
        • need to review description of Task 1.1, make sure all subtasks are covered
        • "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..."
          • ESPResSo + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel)
          • Who does what, and on which system?
        • "increase stability of the shared software stack ... pro-actively by developing monitoring tools"
          • proper monitoring for CVMFS network (S0 + S1s)
          • active work-in-progress by RUG, see also meeting notes
      • [RUG] T1.2 Extending support - D1.4 due M30 (June'25)
        • zen4 almost on par with the rest.
        • optimized installations for AMD Genoa Zen4 (~64% done) + A64FX (~23% done) are still a work-in-progress
          • Intel Sapphire Rapids & NVIDIA Grace (for JUPITER) to start
            • Who, When, Where?
        • AMD ROCm (see planning issue #31 + support issue #71)
          • effort led by Pedro/Bob (RUG)
            • Any progress to mention?
      • [SURF] T1.3 Test suite - D1.5 due M30 (June'25)
        • V0.4.0 released.
          • New tests: CP2K, LAMMPS, PyTorch
          • Tutorial: mpi4py, also in docs
        • WIP: test for MetalWalls
        • WIP: use an eessi_mixin class to make test development for the EESSI test suite easier, as implemented in this pr. Many of the steps in the docs are mandatory, and could be inherited from a mixin class.
      • [BSC] T1.4 RISC-V (due M48, D1.6)
        • Julian is working on getting CernVM-FS deployed natively on the RISC-V hardware they have at BSC => Progress?
        • ... Other updates?
      • [SURF] T1.5 Consolidation (starts M25 - Jan'25)
        • continuation of effort on EESSI (T1.1, etc.) (not started yet)
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
      • [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
        • dashboard to present test results is work-in-progress @ SURF
        • Ingestion script for ingesting test suite results into ElasticSearch instance in private repo
          • Test suite runs on Vega, Karolina, AWS, Azure, ... now ingested on daily basis using that script
        • Deployment of dashboard:
          • Long term: Could be in SURF HPCV 'internal' services, but this is still being set up. Would support authentication, but requires a backend change in the dashboard. Should not be too invasive.
          • Short term: can we deploy somewhere in a VM and whitelist?
          • Alternative: if we can share all the data, things are much easier and we can deploy anywhere, in any VM, tomorrow (well, figure of speach)
      • [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26)
        • Changed meeting format a bit: based on a board
        • Total: 86 issues (28 open, 58 closed)
    • [UB] WP6 Community outreach, education, and training
      • deliverables due: D6.2 (M24 - Dec'24), D6.3 (M30 - June'25)
      • HPCWire nomination in category "Best HPC Programming Tool or Technology"
      • overview of systems in EESSI docs
      • [Alan] invited speaker for Nordic Industry Days (early Sept'24, Copenhagen)
        • ... How did it go?
      • [Thomas] presentation @ CernVM workshop on EESSI (16-18 Sept 2024, Geneva)
        • ... How did it go?
      • [Richard] public webinar Introduction to EESSI (3 Oct 2024)
        • ... How did it go?
      • [Alan] First ambassador event: "Introduction to EESSI" on 4 Oct 2024 (see news post on MultiXscale website)
        • ... How did it go?
      • CECAM webinar (17 Oct 2024)
      • EuroHPC User Days (22-23 Oct 2024, Amsterdam)
        • link to agenda
        • attending: Kenneth/Lara (UGent), Thomas/Ricard (UiB), Bob?/Pedro? (RUG), Caspar (SURF)
        • paper submitted to get a talk slot
          • Tue 22 Oct 14:00-15:30 (IJ LOUNGE)
        • in touch with organisation w.r.t. participation in CoE session
          • "Walk-in networking sessions focusing on specific EuroHPC user needs: provide your feedback and get some advice"
          • Wed 23 Oct 10:30-12:00
          • EESSI demos, printed handouts
          • Raspberry Pi prize
            • "How much faster is Vega than a cluster of Raspberry Pis?"
        • bring your MultiXscale T-shirt!
        • Caspar will bring a monitor (from home :)) for EESSI demo
        • Kenneth will print flyers at UGent.
          • One flyer on MultiXscale => Kenneth / Lara will make them
          • One on EESSI itself => Kenneth / Lara will make them
          • One on each of the scientific use cases => Kenneth will contact the scientific WPs
      • Netherlands eScience Center (Dutch national center of expertise for research software, ~60 RSEs) got in touch with Bob to give a talk (31 Oct'24, Amsterdam)
        • unclear if that's a public event, but can do a write-up afterwards
        • Caspar: one of my colleagues also got in touch with them (unrelated to the event) to see if EESSI could be interesting to them
      • [Eli/HPCNow!] EESSI Birds-of-a-Feather session accepted at Supercomputing'24 (Atlanta, US)
        • can reuse material from BoF session @ ISC'24 in Hamburg
      • [Pedro] submitted talk for SURF Advanced Computing Days (12 Dec'24, Utrecht)
        • talk not accepted yet
      • [Eli?] EESSI tutorial at HiPEAC 2025 accepted (20-22 Jan'25)
        • we need to start promoting this
      • [Jean-Noël] Espresso summer school
        • 2025?
    • [HPCNow] WP7 Dissemination, Exploitation & Communication
      • podcast interview for EuroHPC podcast
        • Kenneth can ask Jothi @ NCC Belgium to be interviewee
        • Susana will ask Apostolos what the turnaround time is once recording is provided
        • sound recording tutorial on Mon 14 Oct 2024 (Communications coffee break)
      • T7.1 Scientific applications provisioned on demand (lead: HPCNow) (started M13, finished M48)
        • Updates ... (Pedro, HPCNow)?
      • Task 7.2 - Dissemination and communication activities (lead: NIC)
        • Updates ... ?
      • Task 7.3 - Sustainability (lead: NIC, started M18, due M42)
        • Updates ... ?
      • Task 7.4 - Industry-oriented training activities (lead: HPCNow)
        • Updates ... ?
    • [NIC] WP8 (Management and Coordination)
      • Ammendment (@Neja / @Alan, can you summarize the key points of what was submitted?)
        • Submitted 10th of September, EuroHPC has 45 days to respond
        • Travel budget: move part of PMs to travel budget for some partners
        • CI/CD
          • added to task in WP1
          • effort was relocated for this from WP5
        • Tweaks on descriptions on the Scientific WPs
          • Removed OpenFOAM from the table of applications
          • Added ALL to the table of applications (going into LAMMPS)
          • Clearer picture that Kokkos will be the key library to achieve scalability
      • next General Assembly meeting
        • 23-24 Jan'25 in Barcelona/Sitges
          • coupled to HiPEAC'25 (20-22 Jan 2025)
          • We need to promote the workshop at HiPEAC more!
          • registration is quite pricey, so we'll need to limit who actually attends?
      • Project review
        • Discuss the plan, Todo's, etc...
        • agenda is available on shared drive
        • (internal) deadline for presentations : 30 Oct
        • internal review round/meeting shortly after
        • presentations have to be submitted to project reviewers ~1 week before review

Other topics

  • CI/CD call for EuroHPC
    • is 100% funded (not 50/50 EU/countries)
    • not published yet
  • request for success story by CASTIEL2
    • status: rounds of editing going on, should be published soon [Neja,Alan,Caspar]
    • @Neja: do you know if this has been published by now?

Notes of previous meetings

Clone this wiki locally