-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc queue status doesn't report active workers #10427
Comments
Hi @shcheklein & @dberenbaum this issue looks a great learning opportunity, are you happy for me to pick it up —assuming @RaW-Git is not themself interested—? |
@nablabits sure! please give it a try. |
Hi @shcheklein I've tried this and it appears to me that we fixed it on 3.50.2, but it's not clear to me how (diff) I tested this with the examples repository, this is what I did:
Am I missing something? |
@nablabits how do you install it? I mean DVC. Are you using a virtualenv, are you completely destroying it? Also, the repo state - are you running it on the same / clean state? Just to make sure. |
@shcheklein well, I just cloned the examples repository and installed the requirements in a virtual environment as it's explained in the readme. Then, I ran through this section in the documentation to get familiar with the process. After running it for the first time, I realised that I could try to reproduce the issue with So, after that, I demoted dvc to Looking at the diff between both tags didn't cast anything obvious to me so I set myself out to find the tag that solved the issue that was happily the next one Let me know what you think, in the meantime I will run a check with a full clean repo pointing to |
Just a quick update on this: I have run a fair amount of experiments on the same version (3.50.1) finding that the issue sometimes whimsically appears and sometimes not. I'll keep investigating until I get to reproduce the error consistently. |
Another quick update on this, so it won't fall into oblivion. I've been creating a bunch of experiments on my local version of dvc out of tag |
I've finally managed to reproduce the issue consistently at will. This is what I did:
dvc exp run --queue --set-param "train.fine_tune_args.epochs=20,21" && \
dvc queue start -j 2
rm -rf .dvc/cache/runs && \
dvc exp run --queue --set-param "train.fine_tune_args.epochs=20,21" && \
dvc queue start -j 2
Next up: I've checked
before the task succeeds but the task seems to be there as it will eventually succeed which agrees with what was stated in the main description. In non-failing runs, this |
Hi there, for a few weeks I couldn't work on this issue, but this week has been somewhat fruitful. TL;DR
|
thanks for the update @nablabits ! |
Just another brief update here TL;DR on what We Have So Far
Shared Data DirectoriesOne of the things that I tried is to set different I have more personal annotations for myself in case anyone is interested but above are the main findings. Thanks for the patience 🙏 |
Hi @shcheklein , so finally I got to the bottom of this issue 🐌 , the thing is as follows:
So, the obvious solution is to add that extra second timeout to this check although I'm unsure of the side effects that that may have. Do you want me to open an issue in |
As per iterative/dvc#10427 investigation we realised that kombu creates expiration dates for messages 1" in the future which affects cleaning the directories on shutdown.
@nablabits excellent research, thanks and sorry for the delay. Before I can review the PR, could you remind / give a little bit more details (I need this to refresh my knowledge about this part of the code - it seems you now have more knowledge):
why do we create 4 messages (not 2), can you point to the code as well? also, why do we use two queues? (why it can't be a single queue and 2 workers)
how can it execute the jobs then? |
Hey @shcheklein, hope you are doing great.
Ha!, don't get too excited 🙃
Ahh, sorry I may have been sloppy maybe because the terminology is confusing as
Well, I hadn't gone so far as to nail down the cause of the 4 messages, but I've been researching this now, and it appears to me that there are three exchanges with different
The second one is the folk in charge of putting shutdown messages —and in general every other message as it feels as the exchange bounded to the actual workers— and it does to every worker as it uses And this is because when the second exchange is created here it uses a fanout mailbox as per this definition. I have the call stacks of the three exchanges if you want more details. Hope all this makes sense to you |
thanks! I'm looking a bit closer when I have some time. A few questions:
|
Hi @shcheklein
I've just tested your proposal and it worked like a charm ✨ , thanks for the suggestion. I will update #10552 and iterative/dvc-task#142 accordingly. Do we want to do something with extra 2"? as the messages kombu creates are 1" in the future. Not a big deal on itself but may save a few headaches in the future as debugging these folks is a bit time consuming. |
good question. What are the scenarios when we run it (I mean gc)? |
Pretty limited to be fair, just when we call |
does it happen after the message is sent (the link to the source code is probably wrong)? |
Yes, sorry the lack of clarity, I actually meant that link to point when shutdown started, a number of things happen between the worker sending
I can provide the |
thanks @nablabits ! so, correct me if I wrong - by AFAIU each worker pretty much calls My concern here is that if we make GC a bit more aggressive - is there a chance that some legitimate shutdown messages are cleaned up before workers have time to consume them? why can't we run cleanup when we start the queue again (0 workers and we launch a new one)? |
Hi @shcheklein, sorry the late reply My understanding is that only the first worker will call Edit from 03/10: the strike through text may not be quite right as Cleaning the queues before starting is something that I have also thought but it may require some more investigation that I'm more than happy to carry out, would it be worth to open a separate issue for this so we will be able to ship the fix for the current one? |
Yes, agreed, let's ship the simplest version (w/o modifying the timestamps, GC time thresholds, etc) and then create a new issue / PR to keep the discussion / research going! Thanks @nablabits . |
This reverts commit f15acd7 as per iterative/dvc#10427 (comment)
Hi @shcheklein, I've updated the PR and added that follow up ticket. As said I will be happy to carry out that investigation as I'm greatly enjoying this project, but I see that there are some P1s that maybe you folks want to give more priority so let me know your ideas 🙂 |
* dvc-10427 Add 2" delay to expirations check As per iterative/dvc#10427 investigation we realised that kombu creates expiration dates for messages 1" in the future which affects cleaning the directories on shutdown. * Add a tip to run a single test to the documentation * Exclude pyenv virtual environments * dvc-10427 Send the shutdown message to the appropriate worker. Please check: iterative/dvc#10427 (comment) * Revert "dvc-10427 Add 2" delay to expirations check" This reverts commit f15acd7 as per iterative/dvc#10427 (comment)
Closed by iterative/dvc-task#142 |
Bug Report
Description
I have two experiments queued up in my dvc queue:
Now I do
dvc queue start -j 2
. The 2 experiments are running (I can see that via the CPU and GPU usages). Also thedvc queue status
shows them asrunning
. But the worker reported bydvc queue status
don't show up:Reproduce
Expected
See the active running workers.
Environment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: