Skip to content

PyTorch working group

Alex Domingo edited this page Oct 9, 2025 · 5 revisions

EasyBuild PyTorch working group

Goal

How can we streamline process to support new PyTorch versions?

Action points

(based on discussion on 20251009 - by KH)

  • [Kenneth] implement support for --test-step-mode in framework
    • skip, minimal, basic (default), full as possible values?
      • --test-step-mode=skip as replacement for --skip-test-step?
    • [Alexander] make PyTorch easyblock aware of it
      • at least for basic mode
      • current PyTorch test step would then correspond to full
  • [Alex?] define clear policy w.r.t. merging PyTorch easyconfig PRs
    • minimal set of test installation that should pass
    • break up process to add support for new PyTorch versions
      • experimental easyconfig (passing basic test step)
      • mature easyconfig (passing full test step)
      • report problems in issue, follow up in subsequent PRs (adding more patch files, etc.)
  • [Kenneth] implement support for experimental easyconfigs
    • experimental easyconfig parameter that can be set to True
      • don't include those easyconfig files in EasyBuild release?
    • auto-add -EXPERIMENTAL to module name + auto-hide module file
    • for PyTorch: passing basic test step is sufficient to merge as experimental easyconfig file
  • [Loris? Alex?] evaluate performance benefits of from-source PyTorch installation
    • vs container images
    • vs pip install torch installation

Meetings

20251113

(Thu 13 Nov'25 15:00 CET)

attendees:

  • ...

Notes

  • ...

20251009 - kickoff meeting

attendees:

  • Alex Domingo (Vrije Universiteit Brussel, Belgium)
  • Loris Ercole (CECAM)
  • Alexander Grund, a.k.a. Flamefire (ZIH, TU Dresden, Germany)
  • Kenneth Hoste (HPC-UGent, Belgium)
  • Adam Huffman (University of Oxford, UK)
  • Emmanuel Kiefer (LuxProvide, Luxembourg)
  • Jure Pečar (EMBL, Germany)
  • Lara Peeters (HPC-UGent, Belgium)
  • Jörg Saßmannshausen (Imperial College London, UK)

Summary of current situation

  • PyTorch test suite is a PITA
    • it's taking forever...
      • can be over 24h on older systems (AMD Rome @ HPC-UGent)
      • @ TU Dresden: over 33h on 64-core AMD Rome with 6x H100
    • handful of flaky tests (sometimes even hanging tests)
    • failing tests shouldn't be ignored, because they can indicate real problems
      • can point to problem with wrong dependency version (or to a bug in a dependency being used)
  • growing complexity in build
    • cfr. Triton (which by itself is a small nightmare)
    • depends on specific versions of NCCL & cuDNN
  • PyTorch easyblock is quite complex
    • easyblock PR #3803 helps a bit by adding CI for get_test_results function used by PyTorch easyblock

Impact

  • lots of wasted time
    • failing tests => not completing installations
    • work to try and fix tests
  • no recent PyTorch versions supported by EasyBuild

Ideas/questions

  • does it really make sense to run full PyTorch test suite to verify an installation?
    • something more lightweight?
      • => "integration test" scripts found by Alex
      • run more reasonable battery of tests by default
      • support for running full test suite for those that want to
      • only "end-to-end" tests, maybe in combination with small part of PyTorch test suite
    • many tests in PyTorch test suite are:
      • only testing a niche feature
      • flaky/poorly written/make assumptions
    • it's pretty easy to run PyTorch test suite for an existing PyTorch installation
    • PyTorch test suite consists of a bunch of groups of tests
      • we could identify ones that are reasonable to run in default mode
      • try to focus on core features
    • framework feature to opt-in to running full PyTorch test suite
      • eb --test-step-mode=intense PyTorch.eb
      • --skip-test-step could (eventually) be replaced with --test-step-mode=skip
      • list set of tests in separate file like test-pytorch-2.7-intense.yml
        • eb --test-step-input=PyTorch:/tmp/test-pytorch-2.7-intense.yml
        • test_step_input = {
            'basic': 'basic.yml',
            'intense': 'intense.yml',
          }
    • support for specifying how a max. time for test step
      • eb --test-step-max-time=1h PyTorch.eb
      • where would we get reasonable accurate info on this?
        • depends on hardware, available resources, PyTorch version, etc.
      • collecting timing info for tests would be helpful, so each site can figure out for themselves which excessively long tests to skip
    • maybe update PyTorch easyblock to allow:
      • python -m easybuild.easyblocks.pytorch run-test-suite
    • set clear target to include new PyTorch easyconfig in test suite
      • at least 2 (common) GPU generations
        • jsc-zen3 test bot w/ A100
        • H100 somewhere?
      • don't block PR when a couple of PyTorch tests fail for some people
    • streamline process to get PyTorch easyconfigs merged
      • PyTorch is becoming a common dependency, so lots of other easyconfig PRs are being blocked...
      • clear policy on what should be achieved before merge would help
      • try to get more people up to speed on how to maintain PyTorch easyblock/easyconfigs
      • have a way to quickly merge updated PyTorch easyconfig in repo, but not included in EasyBuild release
        • separate experimental/easyconfigs folder?
        • only let EasyBuild pick up on it when it's told to be allowed
          • eb --use-experimental-easyconfigs
          • also -EXPERIMENTAL as versionsuffix?
            • add automatically to module name being installed?
            • also make it a hidden module file?
          • experimental = True in easyconfig file
            • don't include these in EasyBuild release
          • also auto-add -EXPERIMENTAL to install path?
  • initially use pre-built wheels for PyTorch
    • from PyPI? from NVIDIA (for CUDA-aware installs)?
    • PyTorch wrapper to allow for in-place update from wheel to from-source installation?
      • only really works for pure Python packages that only do import torch
      • Alex' experiment with torchvision
        • see https://github.com/easybuilders/easybuild/issues/921#issuecomment-3386620931
        • playing with wheels vs from-source installations of PyTorch/torchvision
        • using pre-built wheel of PyTorch with torchvision from-source on top works, but you then can't swap to a from-source PyTorch
          • only affects stuff that link to libtorch.so
          • so swapping PyTorch install with pre-built wheel with from-source built implies also reinstalling torchvision & co
          • bundle stuff together that links to PyTorch library
            • PyTorch-bundle-PyPI (wheel installs) vs PyTorch-bundle-EasyBuild (from source builds)
          • how can we identify things that link to PyTorch library?
        • using pre-built wheel didn't show any significant performance degradation
  • how well does our from-source installation perform vs prebuilt binary wheels?
  • can we get input from PyTorch developers?

Notes

  • issue with having to use other libuv @ LuxProvide
    • not using internal libuv in tensorpipe
    • for specific problems/questions: open issue/discuss in Slack
  • is there a way to figure version of dependencies expected by PyTorch
    • yes, sort of, see comments in recent PyTorch PRs like easyconfigs PR #23923
    • can we extract versions that were used from pre-built wheels?
      • maybe a range of versions?

Action points

  • (Kenneth/Alex) implement support for --test-step-mode in framework
    • skip, minimal, basic (default), full as possible values?
      • --test-step-mode=skip as replacement for --skip-test-step?
    • (Alexander) make PyTorch easyblock aware of it
      • at least for basic mode
      • current PyTorch test step would then correspond to full
  • define clear policy w.r.t. merging PyTorch easyconfig PRs
    • minimal set of test installation that should pass
    • break up process to add support for new PyTorch versions
      • experimental easyconfig (passing basic test step)
      • mature easyconfig (passing full test step)
      • report problems in issue, follow up in subsequent PRs (adding more patch files, etc.)
  • implement support for experimental easyconfigs
    • experimental easyconfig parameter that can be set to True
      • don't include those easyconfig files in EasyBuild release?
    • auto-add -EXPERIMENTAL to module name + auto-hide module file
    • for PyTorch: passing basic test step is sufficient to merge as experimental easyconfig file
  • evaluate performance benefits of from-source PyTorch installation
    • vs container images
    • vs pip install torch installation

Next meeting

  • Thu 13 Nov'25 15:00 CET
    • OK for Xavier, Alex, Jörg (TBC: Alexander)

Clone this wiki locally