-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New strategy for defining architecture in distributed tests #3880
base: main
Are you sure you want to change the base?
Conversation
…nto ss/fix-gpu-tests
Adding the Manifest won't help for the tests because the test environment generates a new manifest every time, I learned |
Ok, good to know. I am trying different approaches, but everything seems to fail; at least I can remove the manifest. |
Finally, this works. It would need approval to get the tests back online. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 8G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120G is much more than we need for those tests. I noticed that the agent starts much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in also when the cluster is busy.
|
||
[targets] | ||
test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"] | ||
test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this the crucial part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. For future generations, can you please write a little bit about what you tried and what ended up working? I can't tell if all the changes are necessary, though the end result is fairly clean. Mostly I am wondering about slurm_mem
. I'm also curious why we cannot call precompile_runtime
inside runtests.jl
and it is necessary to call it before Pkg.test()
. This has implications for the CI of other packages.
I think it is equivalent. I am trying to precompile inside the runtests. Btw, having again access to GPU distributed tests highlighted a bug related to distributed architectures specifically for the |
closes #3897