New strategy for defining architecture in distributed tests #3880

simone-silvestri · 2024-10-29T07:53:48Z

test/utils_for_runtests.jl

…nto ss/fix-gpu-tests

glwagner · 2024-11-05T16:38:02Z

Adding the Manifest won't help for the tests because the test environment generates a new manifest every time, I learned

simone-silvestri · 2024-11-05T17:51:22Z

Ok, good to know. I am trying different approaches, but everything seems to fail; at least I can remove the manifest.

simone-silvestri · 2024-11-06T11:29:17Z

Finally, this works. It would need approval to get the tests back online.

ali-ramadhan

Looks good to me!

glwagner · 2024-11-06T17:55:55Z

.buildkite/distributed/pipeline.yml

    commands:
      - "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
    agents:
-      slurm_mem: 120G
+      slurm_mem: 8G


120G is much more than we need for those tests. I noticed that the agent starts much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in also when the cluster is busy.

glwagner · 2024-11-06T17:56:07Z

Project.toml


 [targets]
-test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"]
+test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"]


Was this the crucial part?

glwagner

This looks good. For future generations, can you please write a little bit about what you tried and what ended up working? I can't tell if all the changes are necessary, though the end result is fairly clean. Mostly I am wondering about slurm_mem. I'm also curious why we cannot call precompile_runtime inside runtests.jl and it is necessary to call it before Pkg.test(). This has implications for the CI of other packages.

simone-silvestri · 2024-11-07T08:41:28Z

I think it is equivalent. I am trying to precompile inside the runtests.

Btw, having again access to GPU distributed tests highlighted a bug related to distributed architectures specifically for the set! function, which I am fixing in this PR.

simone-silvestri added 3 commits October 29, 2024 08:46

fix pipeline

5227a13

mpi test and gpu test

1225061

do we need to precompile it inside?

1652c6b

simone-silvestri mentioned this pull request Oct 29, 2024

Segmentation fault filling halo regions with Partition(y=2) #3878

Open

simone-silvestri added 5 commits October 29, 2024 11:49

precompile inside the node

9323203

try previous climacommon version

37b17ff

go even more back

2ac8cde

use the ClimaOcean implementation

0eb2720

using the ClimaOcean implementation

50d0ec3

glwagner reviewed Oct 29, 2024

View reviewed changes

test/utils_for_runtests.jl Outdated Show resolved Hide resolved

glwagner reviewed Oct 29, 2024

View reviewed changes

test/utils_for_runtests.jl Outdated Show resolved Hide resolved

glwagner reviewed Oct 29, 2024

View reviewed changes

test/utils_for_runtests.jl Outdated Show resolved Hide resolved

glwagner marked this pull request as ready for review October 29, 2024 16:01

simone-silvestri and others added 13 commits October 30, 2024 13:39

see if this test passes

6e183bd

Merge branch 'main' into ss/fix-gpu-tests

bd84d38

maybe precompiling before...

c56b15b

Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…

371a45b

…nto ss/fix-gpu-tests

double O0

e30973f

back to previous clima_common

e4cb16e

another quick test

0c1f01c

change environment

bec1cd1

correct the utils

75546af

Merge branch 'main' into ss/fix-gpu-tests

5f49ec0

this should load mpitrampoline

9b334af

Fix formatting

f8c6401

Go back to latest climacommon

1dc42bb

simone-silvestri mentioned this pull request Nov 5, 2024

The MPI we use in the distributed tests is not CUDA-aware #3897

Open

simone-silvestri added 3 commits November 5, 2024 09:54

try adding Manifest

5a870e7

Manifest from julia 1.10

9e63f56

we probably need to initialize on a GPU

59548f8

simone-silvestri added 2 commits November 5, 2024 10:06

these options should not create problems

642cfd9

let's see if this differs

4cee49a

simone-silvestri added 8 commits November 5, 2024 18:57

just version infos

a46b25d

fiddling with O0

4dffbe5

why are we using 8 threads?

9c3c6cd

memory requirements are not this huge

3b28ecb

speed up the precompilation a bit, to revert later

7126c7c

might this be the culprit?

733ab2b

revert to 8 tasks to precompile

2dbf1a0

final version?

a4b129a

simone-silvestri requested review from glwagner and navidcy November 6, 2024 11:11

return to previous state of affairs

29f7d69

simone-silvestri added 3 commits November 6, 2024 12:32

reinclude enzyme

b174313

set cuda runtime version

0283e6a

will this help in finding cuda?

b4c1f2a

simone-silvestri mentioned this pull request Nov 6, 2024

For output readers, make reader_kw default to an empty NamedTuple #3902

Open

make sure we don't run OOM

bc53a97

ali-ramadhan approved these changes Nov 6, 2024

View reviewed changes

glwagner reviewed Nov 6, 2024

View reviewed changes

glwagner approved these changes Nov 6, 2024

View reviewed changes

simone-silvestri added 2 commits November 7, 2024 09:13

bugfix in set!

811bfdb

try precompile inside runtests

cd86a6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New strategy for defining architecture in distributed tests #3880

New strategy for defining architecture in distributed tests #3880

simone-silvestri commented Oct 29, 2024 •

edited

Loading

glwagner commented Nov 5, 2024

simone-silvestri commented Nov 5, 2024

simone-silvestri commented Nov 6, 2024

ali-ramadhan left a comment

glwagner Nov 6, 2024

simone-silvestri Nov 7, 2024 •

edited

Loading

glwagner Nov 6, 2024

glwagner left a comment

simone-silvestri commented Nov 7, 2024 •

edited

Loading

New strategy for defining architecture in distributed tests #3880

Are you sure you want to change the base?

New strategy for defining architecture in distributed tests #3880

Conversation

simone-silvestri commented Oct 29, 2024 • edited Loading

glwagner commented Nov 5, 2024

simone-silvestri commented Nov 5, 2024

simone-silvestri commented Nov 6, 2024

ali-ramadhan left a comment

Choose a reason for hiding this comment

glwagner Nov 6, 2024

Choose a reason for hiding this comment

simone-silvestri Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

glwagner Nov 6, 2024

Choose a reason for hiding this comment

glwagner left a comment

Choose a reason for hiding this comment

simone-silvestri commented Nov 7, 2024 • edited Loading

simone-silvestri commented Oct 29, 2024 •

edited

Loading

simone-silvestri Nov 7, 2024 •

edited

Loading

simone-silvestri commented Nov 7, 2024 •

edited

Loading