-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queue start: worker doesn't process all experiments automatically #10562
Comments
Is there a small repo to reproduce this? How long does it take to run a single experiment? Does I've tried the following:
stages:
run:
cmd: sleep 1; echo ${i} > metric.json
outs:
- metric.json
i: 1 Running like:
then:
And after a while, I'm getting all the experiments done. Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks! |
Thanks for the reply, @shcheklein ! 😸
In order to run a single experiment it takes around 46 mins.
I didn't check
I totally understand, but the problem is that sometimes it happens and sometimes it is not. Nevertheless, I will try to reproduce it with a small repo and report back. |
Actually, reproducing it was easier than I thought. mkdir dvc_playground && cd dvc_playground
git init && dvc init && git add -A && git commit -m "First commit"
mkdir out data
echo lala >> data/dummy.txt && dvc add data/dummy.txt
echo "from time import perf_counter
from time import sleep
start = perf_counter()
print('Hello everyone')
sleep(2)
end = perf_counter()
print('I am done')
with open('out/how_much_time.txt', 'w') as f:
f.write(str(f'Time has passed is {end-start}'))
" > run_train.py
echo "stages:
train:
cmd: python run_train.py
deps:
- data/dummy.txt
outs:
- out/how_much_time.txt
"> dvc.yaml
git add -A && git commit -m 'Add DVC stuff' Then one can run in order to remove old experiments in the queue and start 30 new experiments with 6 workers: dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status What I observed is that the unexpected behavior not always happens, but it happens pretty often. I ran the above command 10 times and it failed 5/10 (50%).
I also managed to reproduce this behavior on 3 different machines. |
Yep, I think it is really the same as #10427 as @nablabits has done an excellent research there. I see after the first run a bunch of messages left in the broker:
In the each messages:
so, next time it starts some workers can go down probably because of these messages. |
Ah, nice reference and awesome work by @nablabits indeed. By following the thread, using
resolves the issue, which is quite nice already! 😸 |
Bug Report
Description
When using dvc queue to queue multiple experiments, the worker picks up one or two experiments, runs them successfully, and then stops without processing the remaining experiments in the queue. I have to manually run
dvc queue start
again to resume processing the remaining queued experiments.Example:
If I run
dvc queue start
again it picks them up e.g.Reproduce
dvc exp run --queue
.dvc queue start
to process the experiments.dvc queue start
again to continue processing the remaining experiments.Expected
The worker should automatically continue to process all queued experiments without requiring manual intervention after the first batch.
Environment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: