Replies: 2 comments 2 replies
-
To make it easier to debug this, you could try running an equivalent workload with just Once you get your setup working, https://wlandau.github.io/crew.cluster/ + |
Beta Was this translation helpful? Give feedback.
-
I think I finally got to the bottom of this. It does seem to be a memory issue on the R session running the targets pipeline. Thank you for helping me with the diagnosis of this issue and sorry for not cathing this before positn here! |
Beta Was this translation helpful? Give feedback.
-
Help
Description
Dear all,
My setup is {[email protected]} + {[email protected]} on slurm on an AWS parallel cluster.
My pipeline is
And I use a minimally modified default template (sh->bash, name)
If I run this with a low number of workers, say around 100
targets::tar_make_clustermq(reporter = "summary", workers = 100L)
this works great although a bit slow.If I run this with a many more workers
targets::tar_make_clustermq(reporter = "summary", workers = 1600L)
the same pipeline has a quite high chance of failing at some pointI suspect that the high number of individual small workers is either maxing out networking or I/O (no lustre FS) on the launching instance but I cannot pin it down exactly.
I tried to fix the situation with
but the same issue occurs.
My next attempt was to leverage the
rep_workers
arguments oftar_map_rep
to see if that reduces the worker-host communication and gets rid of the proble but this leads toI can, however, open PSOCK clusters manually via
parallel::makePSOCKcluster(workers)
on compute nodes.I would appreciate any ideas on how to diagnose this further.
Beta Was this translation helpful? Give feedback.
All reactions