Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI move to ALPS (daint-gpu -> alps_gh200) #1225

Merged
merged 27 commits into from
Feb 11, 2025
Merged

CI move to ALPS (daint-gpu -> alps_gh200) #1225

merged 27 commits into from
Feb 11, 2025

Conversation

rasolca
Copy link
Collaborator

@rasolca rasolca commented Dec 2, 2024

Still open points:

  • C_API slowness (can be moved to separate PR as now is barely enough to pass)
  • MPI pool and MPI mode
    Quick test:
    • 30+0 hanged BacktransformationBandToTridiagTestMC
    • 30+1, 31+0 and 31+1 went through
      Note: we are seeing some hangs on eiger. Didn't track them, but IIRC at least a couple of cases in the same backtransformation.
  • HDF5 -> separate PR

@rasolca rasolca self-assigned this Dec 2, 2024
@rasolca rasolca changed the base branch from master to ci/alps_1 December 3, 2024 14:21
@rasolca rasolca marked this pull request as ready for review December 4, 2024 12:54
@rasolca rasolca marked this pull request as draft December 4, 2024 12:55
@rasolca
Copy link
Collaborator Author

rasolca commented Dec 4, 2024

cscs-ci run

Base automatically changed from ci/alps_1 to master December 6, 2024 16:21
@rasolca rasolca force-pushed the ci/alps_2 branch 2 times, most recently from fe656de to cbef80a Compare December 10, 2024 14:42
@rasolca
Copy link
Collaborator Author

rasolca commented Dec 12, 2024

cscs-ci run

5 similar comments
@rasolca
Copy link
Collaborator Author

rasolca commented Dec 13, 2024

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 17, 2024

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 17, 2024

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 17, 2024

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 17, 2024

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Jan 29, 2025

cscs-ci run

1 similar comment
@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca rasolca force-pushed the ci/alps_2 branch 2 times, most recently from a9d1caf to 3219200 Compare January 30, 2025 10:43
@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

6 similar comments
@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Jan 30, 2025

cscs-ci run

@rasolca rasolca requested a review from RMeli February 4, 2025 13:38
@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.06%. Comparing base (28bc43f) to head (9e773eb).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1225   +/-   ##
=======================================
  Coverage   95.06%   95.06%           
=======================================
  Files         139      139           
  Lines        8573     8573           
  Branches     1107     1107           
=======================================
  Hits         8150     8150           
  Misses        236      236           
  Partials      187      187           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 4, 2025

cscs-ci run

@msimberg
Copy link
Collaborator

msimberg commented Feb 5, 2025

C_API slowness (can be moved to separate PR as now is barely enough to pass)

Just to add a bit of info on my investigation of this from today:

It seems like yes, the pika runtime takes some time to start and stop. However, the biggest chunk of time comes from creating and destroying cuSOLVER handles (of which we create 16 by default). I'm able to reproduce slow timings simply by creating and destroying n cuSOLVER handles, without starting or stopping the pika runtime. In particular, it's slower:

  • with multiple ranks on the same node, even if they use different GPUs
  • with MPS on the same node, with ranks sharing GPUs
  • with more cuSOLVER handles
  • when using MPS, though this seems to mostly be the first call to CUDA (i.e. likely just a single startup overhead, not every time a cuSOLVER handle is created)

@msimberg
Copy link
Collaborator

msimberg commented Feb 7, 2025

I've opened two PRs related to the slowness of API tests: #1268 and #1269. The former attempts to address starting and stopping the runtime (which includess initializing/finalizing DLA-Future and CUDA pools/cuSOLVER handles) unnecessarily frequently. The latter leaves one core free for non-pika threads, which I later found also has a big impact on test times.

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 10, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 11, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 11, 2025

cscs-ci run

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 11, 2025

cscs-ci run

@rasolca rasolca merged commit b14ffba into master Feb 11, 2025
4 checks passed
@rasolca rasolca deleted the ci/alps_2 branch February 11, 2025 22:57
@rasolca rasolca restored the ci/alps_2 branch February 11, 2025 23:06
@rasolca rasolca deleted the ci/alps_2 branch February 11, 2025 23:06
@msimberg msimberg restored the ci/alps_2 branch February 12, 2025 10:24
@msimberg msimberg deleted the ci/alps_2 branch February 12, 2025 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants