Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New strategy for defining architecture in distributed tests #3880

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Oct 29, 2024

closes #3897

@glwagner glwagner marked this pull request as ready for review October 29, 2024 16:01
@glwagner
Copy link
Member

glwagner commented Nov 5, 2024

Adding the Manifest won't help for the tests because the test environment generates a new manifest every time, I learned

@simone-silvestri
Copy link
Collaborator Author

Ok, good to know. I am trying different approaches, but everything seems to fail; at least I can remove the manifest.

@simone-silvestri
Copy link
Collaborator Author

Finally, this works. It would need approval to get the tests back online.

Copy link
Member

@ali-ramadhan ali-ramadhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 8G
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@simone-silvestri simone-silvestri Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120G is much more than we need for those tests. I noticed that the agent starts much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in also when the cluster is busy.


[targets]
test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"]
test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this the crucial part?

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. For future generations, can you please write a little bit about what you tried and what ended up working? I can't tell if all the changes are necessary, though the end result is fairly clean. Mostly I am wondering about slurm_mem. I'm also curious why we cannot call precompile_runtime inside runtests.jl and it is necessary to call it before Pkg.test(). This has implications for the CI of other packages.

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Nov 7, 2024

I think it is equivalent. I am trying to precompile inside the runtests.

Btw, having again access to GPU distributed tests highlighted a bug related to distributed architectures specifically for the set! function, which I am fixing in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The MPI we use in the distributed tests is not CUDA-aware
3 participants