queue start: worker doesn't process all experiments automatically #10562

lefos99 · 2024-09-20T09:04:30Z

Bug Report

Description

When using dvc queue to queue multiple experiments, the worker picks up one or two experiments, runs them successfully, and then stops without processing the remaining experiments in the queue. I have to manually run dvc queue start again to resume processing the remaining queued experiments.

Example:

dvc queue status

Task     Name                                 Created       Status
9294987  tumor_dysp_weight_hsv_rgb_min_mean   Sep 19, 2024  Queued
22c83f3  tumor_dysp_weight_hsv_rgb_mean_mean  Sep 19, 2024  Queued
6f3c8e9  tumor_dysp_weight_hsv_rgb_min_min    Sep 19, 2024  Queued
f1ce1d2  tumor_dysp_weight_hard_min_mean      Sep 19, 2024  Queued
3775500  tumor_dysp_weight_hard_mean_mean     Sep 19, 2024  Success
10f8332  tumor_dysp_weight_hard_min_min       Sep 19, 2024  Success

Worker status: 0 active, 0 idle

If I run dvc queue start again it picks them up e.g.

dvc queue status

Task     Name                                 Created       Status
9294987  tumor_dysp_weight_hsv_rgb_min_mean   Sep 19, 2024  Running
22c83f3  tumor_dysp_weight_hsv_rgb_mean_mean  Sep 19, 2024  Queued
6f3c8e9  tumor_dysp_weight_hsv_rgb_min_min    Sep 19, 2024  Queued
f1ce1d2  tumor_dysp_weight_hard_min_mean      Sep 19, 2024  Success
3775500  tumor_dysp_weight_hard_mean_mean     Sep 19, 2024  Success
10f8332  tumor_dysp_weight_hard_min_min       Sep 19, 2024  Success

Worker status: 1 active, 0 idle

Reproduce

Queue multiple experiments (e.g., 6 experiments) with dvc exp run --queue.
Run dvc queue start to process the experiments.
Observe that only a couple of experiments are processed, while others remain queued.
Manually run dvc queue start again to continue processing the remaining experiments.

Expected

The worker should automatically continue to process all queued experiments without requiring manual intervention after the first batch.

Environment information

Output of dvc doctor:

DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.5
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.7
Supports:
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.9.0, boto3 = 1.35.16),
        ssh (sshfs = 2024.6.0)
Config:
        Global: /fastdata/users/lefos/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Caches: local
Remotes: local, s3
Workspace directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/6a4a3cd6bb6194a5a62079d63dd05e18

Additional Information (if any):

Running dvc queue status shows the remaining experiments as queued even after some have been successfully processed.
The issue is resolved temporarily by running dvc queue start again, but the worker doesn't continue automatically.

The text was updated successfully, but these errors were encountered:

shcheklein · 2024-09-21T23:50:40Z

Is there a small repo to reproduce this? How long does it take to run a single experiment? Does dvc exp show show them running by chance (can it be a reporting problem like this #10427 ?)

I've tried the following:

dvc.yaml:

stages:
  run:
    cmd: sleep 1; echo ${i} > metric.json
    outs:
      - metric.json

params.yaml:

i: 1

Running like:

dvc exp run --queue -S "i=range(0,10,1)"

then:

dvc queue start

And after a while, I'm getting all the experiments done.

Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks!

lefos99 · 2024-09-23T08:55:28Z

Thanks for the reply, @shcheklein ! 😸

How long does it take to run a single experiment?

In order to run a single experiment it takes around 46 mins.

Does dvc exp show show them running by chance?

I didn't check dvc exp show, but dvc queue shows them in the queue, like the experiments are in the queue but the worker is not running. I also checked and the experiments indeed didn't run. (the expected output files were not there.)

dvc queue status

Task     Name                                 Created       Status
9294987  tumor_dysp_weight_hsv_rgb_min_mean   Sep 19, 2024  Queued
22c83f3  tumor_dysp_weight_hsv_rgb_mean_mean  Sep 19, 2024  Queued
6f3c8e9  tumor_dysp_weight_hsv_rgb_min_min    Sep 19, 2024  Queued
f1ce1d2  tumor_dysp_weight_hard_min_mean      Sep 19, 2024  Queued
3775500  tumor_dysp_weight_hard_mean_mean     Sep 19, 2024  Success
10f8332  tumor_dysp_weight_hard_min_min       Sep 19, 2024  Success

Worker status: 0 active, 0 idle

Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks!

I totally understand, but the problem is that sometimes it happens and sometimes it is not. Nevertheless, I will try to reproduce it with a small repo and report back.

lefos99 · 2024-09-23T10:49:00Z

Actually, reproducing it was easier than I thought.

mkdir dvc_playground && cd dvc_playground
git init && dvc init && git add -A && git commit -m "First commit"
mkdir out data
echo lala >> data/dummy.txt && dvc add data/dummy.txt

echo "from time import perf_counter
from time import sleep
start = perf_counter()
print('Hello everyone')
sleep(2)
end = perf_counter()
print('I am done')
with open('out/how_much_time.txt', 'w') as f:
    f.write(str(f'Time has passed is {end-start}'))
" > run_train.py

echo "stages:
  train:
      cmd: python run_train.py
      deps:
        - data/dummy.txt
      outs: 
        - out/how_much_time.txt
"> dvc.yaml

git add -A && git commit -m 'Add DVC stuff'

Then one can run in order to remove old experiments in the queue and start 30 new experiments with 6 workers:

dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status

What I observed is that the unexpected behavior not always happens, but it happens pretty often. I ran the above command 10 times and it failed 5/10 (50%).
So I ended up with


Task     Name        Created    Status
46b886b  muddy-fare  12:47 PM   Queued
0741a83  sandy-bunk  12:47 PM   Queued
234edb7  nowed-pupa  12:47 PM   Queued
d419c3f  dorty-wain  12:47 PM   Queued
3ad8d1b  cynic-toff  12:47 PM   Queued
9bca7a1  beery-gyms  12:47 PM   Queued
5e41e1c  massy-ewes  12:47 PM   Queued
bfdf673  smoky-jeer  12:47 PM   Queued
81d3057  salic-sena  12:47 PM   Queued
c359d35  quack-runs  12:47 PM   Queued
2f715a1  trade-libs  12:47 PM   Queued
facc3ac  so-so-wage  12:47 PM   Queued
4f6950e  silty-hits  12:47 PM   Queued
39e29b0  gawky-merk  12:47 PM   Queued
771ca8c  fifty-dabs  12:47 PM   Queued
d86614a  weird-lure  12:47 PM   Queued
035dcd8  coaly-prob  12:47 PM   Queued
e6dc2b5  dolce-kaka  12:47 PM   Queued
f9439ae  gusty-daks  12:47 PM   Queued
df2fdf4  goosy-iron  12:47 PM   Queued
a4bf7ec  tangy-dees  12:47 PM   Queued
82f5770  natal-sand  12:47 PM   Queued
5f1401b  cedar-flip  12:47 PM   Queued
e6d98b4  mousy-sinh  12:47 PM   Queued
56116a0  dosed-knap  12:46 PM   Success
bf66773  pricy-buhl  12:46 PM   Success
0293e5f  sarky-snip  12:46 PM   Success
5833eb6  swish-heir  12:46 PM   Success
275e77a  hexed-cats  12:46 PM   Success
9c1bf71  hyoid-soul  12:47 PM   Success

Worker status: 0 active, 0 idle

I also managed to reproduce this behavior on 3 different machines.

shcheklein · 2024-09-24T00:49:48Z

Yep, I think it is really the same as #10427 as @nablabits has done an excellent research there. I see after the first run a bunch of messages left in the broker:

-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_db04a69e-e33c-4d26-9632-66031a0bac9f.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_4cb7a50e-739a-44bb-a110-15d371487f21.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_b9843cfa-d030-43f6-a257-4fabbb72030e.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_a73b40a0-0f35-4654-a1e1-d72bdad04fe4.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_ea17010c-08c9-4eb9-8199-0ee00cf50f59.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_c8b92dad-5f42-4587-a5bb-0cacae0c6980.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_b8dd4dc2-fbc9-4567-ba8c-5db8adcab2c4.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_d02287e4-d922-405e-b862-efa33f2c122a.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_faa62e23-582d-4c0f-ad6c-c665f935585e.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279055_1f0f890e-1df7-46aa-8640-1295b055c12d.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_c8db1eda-a0ad-4076-8174-d8ff08b42679.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_8e112e0d-a26d-4da1-8d7d-5051bd02aecb.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_532603b3-99b3-48cd-ab39-a715cd5482a6.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_51c237de-69bf-424a-ac1a-04654830fede.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_710efc24-7180-4166-95de-cd68131720e3.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_d1ad85ec-1cb3-4c6c-9c45-511535fa542a.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_1f04ba84-227d-49ae-81b6-b1148ba704e6.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_715cefcc-1949-42a2-baf9-d8d8008f67e9.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279083_b2ede287-0e7f-4de4-8a1d-19703c75faf3.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279083_db51942d-1413-47c2-8283-8be41cac4537.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279083_07a9a385-9d51-4c60-8872-99681d5c99e4.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279083_2fdec533-2a87-45ef-987f-04a0f1818ce6.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_f910a3d0-2f10-49de-8513-7c6488ed4219.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_bcccba75-5245-4dfe-a80c-121f3dc4ba46.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_4a878906-2e3d-4203-a2b4-95cb1570625d.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_03ced2d5-3e34-4ef2-88cb-56ca4aaa97b0.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320218_070cb665-5c5e-4e61-9ebf-5c98e6bfd267.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320218_1e9d9b78-c16d-4ae5-b57e-019d89a64294.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322839_1c42780f-af91-4847-9272-e24a6a2c6263.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322839_4652a208-95b5-45c7-9689-de7c4b7aabd1.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322840_46d804a4-943a-4b52-a98a-221cd279e50c.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322840_b2bf062b-d473-4872-be13-90730e620b15.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322840_72714df1-101f-4996-b6ab-ddddc150aa00.dvc-exp-394d7a-5@localhost.celery.pidbox.msg

In the /Users/ivan/Projects/dvc_playground/.dvc/tmp/exps/celery/broker/in

each messages:

{"method": "shutdown", "arguments": {}, "destination": null, "pattern": null, "matcher": null}

so, next time it starts some workers can go down probably because of these messages.

lefos99 · 2024-09-24T10:36:50Z

Ah, nice reference and awesome work by @nablabits indeed.

By following the thread, using dvc exp clean seems to me like a nice workaround for now! 👍
I also confirmed it that running this command before, so basically:

dvc exp clean && dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status

resolves the issue, which is quite nice already! 😸

shcheklein · 2024-09-24T18:08:58Z

Closing in favor of #10427 then. Thanks @lefos99 for confirming.

shcheklein added awaiting response we are waiting for your reply, please respond! :) triage Needs to be triaged A: experiments Related to dvc exp labels Sep 21, 2024

shcheklein added bug Did we break something? and removed awaiting response we are waiting for your reply, please respond! :) triage Needs to be triaged labels Sep 24, 2024

shcheklein closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

queue start: worker doesn't process all experiments automatically #10562

queue start: worker doesn't process all experiments automatically #10562

lefos99 commented Sep 20, 2024 •

edited

Loading

shcheklein commented Sep 21, 2024

lefos99 commented Sep 23, 2024

lefos99 commented Sep 23, 2024 •

edited

Loading

shcheklein commented Sep 24, 2024

lefos99 commented Sep 24, 2024 •

edited

Loading

shcheklein commented Sep 24, 2024

queue start: worker doesn't process all experiments automatically #10562

queue start: worker doesn't process all experiments automatically #10562

Comments

lefos99 commented Sep 20, 2024 • edited Loading

Bug Report

Description

Reproduce

Expected

Environment information

shcheklein commented Sep 21, 2024

lefos99 commented Sep 23, 2024

lefos99 commented Sep 23, 2024 • edited Loading

shcheklein commented Sep 24, 2024

lefos99 commented Sep 24, 2024 • edited Loading

shcheklein commented Sep 24, 2024

lefos99 commented Sep 20, 2024 •

edited

Loading

lefos99 commented Sep 23, 2024 •

edited

Loading

lefos99 commented Sep 24, 2024 •

edited

Loading