[help] Dynamic branched targets hang, never complete #1360

rsangole · 2024-10-28T21:33:13Z

rsangole
Oct 28, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

Hi Will,

Hope you're doing well. I've been spinning my wheels on something for the past many days, and I feel like I'm stuck in a rut. I don't have a reprex for you (run_bespoke_model() in the example below is a proprietary function), so I'm doing my best to explain my situation.

The objective of the code below is to allow me to create any number of model specifications in models_list (think, hyper parameters, for example). This gets split into a dynamic target one_model_spec which feeds into a dynamic target list_model_results. The model results are generated using an internally built package containing run_bespoke_model(). run_bespoke_model() can take a long time to run, which is why I've set things up in this fashion... everything remains cached, I can quickly add new options to models_list and execute the pipeline, without having to run all of the models each time.

The result of run_bespoke_model() for each model-spec is quite a large a nested list-of-list objects.

Note: After much trial and error, I realized that:

Not having format="qs" simply didn't work for the large list_model_results and targets would keep spinning it's wheels.
storage = "worker", retrieval = "worker" are needed, else the run-times compound manyfold
Not sure about the impact of memory = "transient"
Not sure about the impact of garbage_collection = TRUE, but probably doesn't hurt?

The issue

If I run this entire pipeline as a script outside targets, it runs perfectly fine, and finishes ~(2-3 mins + 1-10 min/model-spec)
If I run this in targets, 5 or 6 of the list_model_results target will solve fine... then the remaining will just 'run forever'. I've waited 20-40 minutes, without any results, when I know that that particular model-spec shouldn't take more than 1-2 minutes.
I can see on my CPU consumption graph that many workers show 100% activity, but the pipeline never finishes

I'm not sure why this is happening, and I'm lost as to how to fix things.

library(targets)
library(crew)

targets::tar_option_set(
    packages = c(...),
    controller = crew::crew_controller_local(
        workers = future::availableCores() - 1
    ),
    storage = "worker",
    retrieval = "worker",
    garbage_collection = TRUE,
    memory = "transient",
    format = "qs"
)

list(
    # A static target, length 1
    # This list defines all the model specifications 
    # 1, 2, ...., n : n is on the order of 20 model specs
    tar_target(
        models_list,
        {
            list(
                model1 = list("foo" = "bar", "and" = "other_model_params"),
                model2 = list("foo" = "bar", "and" = "other_model_params"),
                # ...
                modeln = list("foo" = "bar", "and" = "other_model_params")
            )
        }
    ),
    # A static target, length 1
    # Splitting 1 list into a list of lists
    tar_target(
        list_of_lists,
        purrr::map(seq_along(models_list), ~ models_list[.x])
    ),
    # Dynamic target, length n
    # Create dynamic targets, one for each model-specification
    tar_target(
        one_model_spec,
        list_of_lists,
        iteration = "list",
        pattern = map(list_of_lists)
    ),
    # Dynamic target, length n
    # Run the model for each model-specification
    tar_target(
        list_model_results,
        run_bespoke_model(one_model_spec),
        iteration = "list",
        pattern = map(one_model_spec)
    ),
    # Further processing
    # ....
)

Example

See the execution time increase significantly from seconds... to 20 minutes... to now fully "hung" execution (over 1 hour).

Session Info

Running in a docker container, on a M1 Ultra machine (20 cores, 128GB RAM)

r$> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-redhat-linux-gnu
Running under: Red Hat Enterprise Linux 9.4 (Plow)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] dplyr_1.1.4   targets_1.8.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.3         knitr_1.48        rlang_1.1.4       xfun_0.48         processx_3.8.4    renv_1.0.11       generics_0.1.3    data.table_1.16.2 glue_1.8.0        backports_1.5.0   clipr_0.8.0       ps_1.8.0          fansi_1.0.6       tibble_3.2.1      base64url_1.4    
[17] yaml_2.3.10       lifecycle_1.0.4   compiler_4.4.1    codetools_0.2-20  igraph_2.1.1      fs_1.6.4          pkgconfig_2.0.3   rstudioapi_0.17.1 R6_2.5.1          reprex_2.1.1      tidyselect_1.2.1  utf8_1.2.4        pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    withr_3.0.1      
[33] tools_4.4.1       secretbase_1.0.3

rsangole · 2024-10-29T16:07:22Z

rsangole
Oct 29, 2024
Author

Some additional context - this isn't new code. It used to run fine, setup this way, roughly a month/month and a half ago. I did have intermittent hangups, but I managed them be refiring the pipeline. After updating all my packages recently, I have this issue.

13 replies

wlandau Nov 4, 2024
Maintainer

You could do it that way if run_bespoke_model() itself needs to run in parallel but you don't want to split up the iterations into individual targets. It just depends on what kinds of efficiency tradeoffs you want.

By the way, how many targets are in your pipeline? Around the ~5000-20000 range is often comfortable for a pipeline, but at 100k targets, you may hit a lot of overhead.

ps: I tried this by not specifying anything for controller in the tar_option_set, and also by setting deployment = main for the target in question, but this didn't work.

What specifically went wrong, and what would you have wanted to see?

rsangole Nov 4, 2024
Author

You could do it that way if run_bespoke_model() itself needs to run in parallel but you don't want to split up the iterations into individual targets. It just depends on what kinds of efficiency tradeoffs you want.

I think for my use-case at the moment, this approach works fine.

By the way, how many targets are in your pipeline? Around the ~5000-20000 range is often comfortable for a pipeline, but at 100k targets, you may hit a lot of overhead.

My pipeline isn't that big at all. On the order of 200 targets at the moment.

What specifically went wrong, and what would you have wanted to see?

I expected to the see the 2nd graph when running run_bespoke_model() within a targets pipeline. It shows me that all my cores are being actively used by furrr::future_map.

In my comment above:

The 1st graph shows the CPU utilization if I call run_bespoke_model() within a targets framework - CPU usage remains flat, the target never completes. (For the entirety of that screen shot, the targets pipeline was running).
The 2nd graph shows the CPU utilization if I call run_bespoke_model() within a normal R script - CPU usage is high as expected, the function completes within minutes. (The red dotted line is where I execute the R script.)

run_bespoke_model <- function(...){
    # Create workers
    future::plan(future::multisession, workers = num_cores)
    # Execute all model specs
    furrr::future_map(....)
    # Close workers
    future::plan(future::sequential)
}


## _targets.R

...
 tar_target(
        model_results,
        run_bespoke_model(all_model_specs)
    ),
...

wlandau Nov 4, 2024
Maintainer

I expected to the see the 2nd graph when running run_bespoke_model() within a targets pipeline. It shows me that all my cores are being actively used by furrr::future_map.

Do you have a reprex for that piece?

rsangole Nov 18, 2024
Author

@wlandau I've been working on developing a reprex. But haven't had luck getting it to do what I need it to do. Will keep you posted.

wlandau Nov 19, 2024
Maintainer

Just curious: does development targets (version >= 1.9.0) solve any of the issues you mentioned?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] Dynamic branched targets hang, never complete #1360

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[help] Dynamic branched targets hang, never complete #1360

rsangole Oct 28, 2024

Help

Description

The issue

Example

Session Info

Replies: 1 comment · 13 replies

rsangole Oct 29, 2024 Author

wlandau Nov 4, 2024 Maintainer

rsangole Nov 4, 2024 Author

wlandau Nov 4, 2024 Maintainer

rsangole Nov 18, 2024 Author

wlandau Nov 19, 2024 Maintainer

rsangole
Oct 28, 2024

Replies: 1 comment 13 replies

rsangole
Oct 29, 2024
Author

wlandau Nov 4, 2024
Maintainer

rsangole Nov 4, 2024
Author

wlandau Nov 4, 2024
Maintainer

rsangole Nov 18, 2024
Author

wlandau Nov 19, 2024
Maintainer