dvc queue status doesn't report active workers #10427

RaW-Git · 2024-05-14T10:44:05Z

Bug Report

Description

I have two experiments queued up in my dvc queue:

Now I do dvc queue start -j 2. The 2 experiments are running (I can see that via the CPU and GPU usages). Also the dvc queue status shows them as running. But the worker reported by dvc queue status don't show up:

Reproduce

dvc exp run --queue --name exp-1
dvc exp run --queue --name exp-2
dvc queue start -j 2
dvc queue status

Expected

See the active running workers.

Environment information

Output of dvc doctor:

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.11.0rc1 on Linux-5.15.0-105-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3)
Config:
        Global: /home/raw/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/sdb1
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/nvme0n1p3
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1c67604caf3156e9fea7df37dda80d5f

Additional Information (if any):

The text was updated successfully, but these errors were encountered:

nablabits · 2024-06-09T04:53:39Z

Hi @shcheklein & @dberenbaum this issue looks a great learning opportunity, are you happy for me to pick it up —assuming @RaW-Git is not themself interested—?

shcheklein · 2024-06-09T05:28:43Z

@nablabits sure! please give it a try.

nablabits · 2024-06-12T05:41:39Z

Hi @shcheklein I've tried this and it appears to me that we fixed it on 3.50.2, but it's not clear to me how (diff)

I tested this with the examples repository, this is what I did:

Enqueue a couple of jobs: dvc exp run --queue --set-param "train.batch_size=16,24"
Check the status: they appeared as queued
Run them dvc queue start -j 2
Check the status: in 3.50.1 I saw what @RaW-Git described, but for 3.50.2 I didn't.

Am I missing something?

shcheklein · 2024-06-12T20:36:49Z

@nablabits how do you install it? I mean DVC. Are you using a virtualenv, are you completely destroying it? Also, the repo state - are you running it on the same / clean state? Just to make sure.

nablabits · 2024-06-14T14:31:15Z

@shcheklein well, I just cloned the examples repository and installed the requirements in a virtual environment as it's explained in the readme. Then, I ran through this section in the documentation to get familiar with the process.

After running it for the first time, I realised that I could try to reproduce the issue with dvc exp run --queue --set-param "train.batch_size=16,24" finding that it was behaving ok for me.

So, after that, I demoted dvc to 3.50.1 just with pip install dvc==3.50.1 and checked that the issue was there. I upgraded the version again to the latest 3.51.2 just to double check and the issue was not there.

Looking at the diff between both tags didn't cast anything obvious to me so I set myself out to find the tag that solved the issue that was happily the next one 3.50.2 but maybe what you say about the repo cleanliness may have had something to do 🤔

Let me know what you think, in the meantime I will run a check with a full clean repo pointing to 3.50.2 to rule out that scenario. 🙂

nablabits · 2024-06-18T05:35:05Z

Just a quick update on this: I have run a fair amount of experiments on the same version (3.50.1) finding that the issue sometimes whimsically appears and sometimes not. I'll keep investigating until I get to reproduce the error consistently.

nablabits · 2024-07-02T06:21:26Z

Another quick update on this, so it won't fall into oblivion. I've been creating a bunch of experiments on my local version of dvc out of tag 3.50.1 using a basic sklearn algorithm but I couldn't reproduce the issue. Next up I will try to build something more sophisticated using deep learning to rule out the possibility that the GPU is not somehow spoiling the party behind the scenes.

nablabits · 2024-07-11T09:57:13Z

I've finally managed to reproduce the issue consistently at will. This is what I did:

In the example-get-started-experiments repository enqueue and start a couple of jobs modifying the epochs —so as to have more time to inspect the workers—:

dvc exp run --queue --set-param "train.fine_tune_args.epochs=20,21" && \
dvc queue start -j 2

This first run should show the active workers as usual.
Clean the cache, so you will be able to run the same experiment without retrieving a cached copy, and enqueue and run again the experiment:

rm -rf .dvc/cache/runs && \
dvc exp run --queue --set-param "train.fine_tune_args.epochs=20,21" && \
dvc queue start -j 2

This second run will raise the error. Even more, subsequent runs will alternatively working fine and raising the error.

Next up: I've checked dvc-exp-worker-1.out finding something promising, in failed runs we get a

[2024-07-11 11:28:09,150: WARNING/MainProcess] Got shutdown from remote

before the task succeeds but the task seems to be there as it will eventually succeed which agrees with what was stated in the main description. In non-failing runs, this Got shutdown from remote happens after the task has succeeded. This may well be the key of the problem as we retrieve these active workers by means of celery.control.inspect which seems to me using this remote under the hood. So I will investigate that avenue.

nablabits · 2024-08-02T07:09:03Z

Hi there, for a few weeks I couldn't work on this issue, but this week has been somewhat fruitful.

TL;DR

It seems that something remains after one command is run that triggers a shutdown for the next command run
Running dvc exp clean before enqueuing and running a new experiment fixes the issue

`pidbox` Log

After enabling debug logs in for Celery,I realised that working runs got this after the task is accepted:

pidbox received method active(safe=None) [reply_to:{'exchange': 'reply.celery.pidbox', 'routing_key': '6b303b21-8b49-3917-a76b-0b9636a45c9e'} ticket:206dc759-5f0a-42f4-b14f-da3595117808]

Whereas faulty ones got:

[2024-08-01 08:35:13,267: DEBUG/MainProcess] pidbox received method shutdown() [reply_to:None ticket:None]

This is not because of ThreadPoolExecutor and CmdQueueStop either. These are the two explicit places where shutdown() is called.

Clean Experiments

I also realised that running dvc exp clean before running experiments solves the issue, so this may be a potential workaround. As per documentation I note that This is done automatically when running queued experiments. which led me to play a bit with this --clean flag —source—. Unfortunately, adding it selectively to either worker or even to all of them didn't seem to fix the issue. Still, I will keep investigating in this line unless someone advises otherwise.

Thanks for the patience 🙏

shcheklein · 2024-08-02T17:53:49Z

thanks for the update @nablabits !

nablabits · 2024-08-16T07:17:07Z

Just another brief update here

TL;DR on what We Have So Far

This will be potentially an issue on dvc-task as clean seems not cleaning everything or something is still created after clean is called. Removing the conditional logic so as to run this clean always and putting a sleep(60) here fixes the issue. But, of course we don't want this sloppy solution 😉
Specifically, there's a directory that controls what is to be processed, .dvc/tmp/exps/celery/broker/in/, and there are residual files that remain there after the tasks are done. Removing these files solves the issue.
I also noticed that the issue takes place when one runs a couple of jobs with -j 2 as if one enqueues two experiments but only selects one job, then it behaves normally. Running for one job still creates those residuals in in, so I'm kind of exploring the differences between both approaches.

Shared Data Directories

One of the things that I tried is to set different out directory here so as to check if that had something to do and debug the inner workings of kombu and dvc-task, but this created all sorts of issues.

I have more personal annotations for myself in case anyone is interested but above are the main findings.

Thanks for the patience 🙏

nablabits · 2024-09-01T07:33:30Z

Hi @shcheklein , so finally I got to the bottom of this issue 🐌 , the thing is as follows:

Both workers will put Shutdown messages for both queues due to empty queues, this is, 4 shutdown files in total.
Each worker will pick the first set of shutdown messages only processing the one relevant to its queue.
These shutdown messages are then moved to processed once they are done.
When the clean kicks in, it finds the other pair of shutdown messages to be cleaned in the in directory.
_gc won't clean them as _delete_expired will detect that the message is not expired yet (source). These are two sample expiration times in one of the experiments I ran: 1725093643.3807564 > 1725093642.829151 where we can see that one is 1" off which is precisely what this folk is adding as a timeout when the message is created.
_clean_pidbox won't clean them either as the shutdown messages belong to the celery.pidbox exchange whereas the exchange that is being cleaned is reply.celery.pidbox
In the second iteration each queue will start off with the shutdown messages left from the previous reproducing the issue which is in agreement with what I described a few weeks ago.
dvc exp clean will naturally clean the files as it will find them expired which is also in agreement.

So, the obvious solution is to add that extra second timeout to this check although I'm unsure of the side effects that that may have.

Do you want me to open an issue in dvc-task for that?

As per iterative/dvc#10427 investigation we realised that kombu creates expiration dates for messages 1" in the future which affects cleaning the directories on shutdown.

shcheklein · 2024-09-14T22:53:32Z

@nablabits excellent research, thanks and sorry for the delay. Before I can review the PR, could you remind / give a little bit more details (I need this to refresh my knowledge about this part of the code - it seems you now have more knowledge):

Both workers will put Shutdown messages for both queues due to empty queues, this is, 4 shutdown files in total.

why do we create 4 messages (not 2), can you point to the code as well?

also, why do we use two queues? (why it can't be a single queue and 2 workers)

In the second iteration each queue will start off with the shutdown messages left from the previous reproducing the issue which is in agreement with what I described a few weeks ago.

how can it execute the jobs then?

nablabits · 2024-09-17T06:22:12Z

Hey @shcheklein, hope you are doing great.

it seems you now have more knowledge

Ha!, don't get too excited 🙃

also, why do we use two queues? (why it can't be a single queue and 2 workers)

Ahh, sorry I may have been sloppy maybe because the terminology is confusing as put talks about queues which seems to me are used interchangeably with exchanges (Source).
Let me try to unpack to the best of my knowledge. When we queue jobs they go effectively to the same data_folder_in which is then iterated in order of arrival (source) This can be effectively understood as a single queue. That's why all the messages prepend a sort of timestamp (source). However when iterating over messages each worker only acts regarding its messages and overlooks everything else.

why do we create 4 messages (not 2), can you point to the code as well?

Well, I hadn't gone so far as to nail down the cause of the 4 messages, but I've been researching this now, and it appears to me that there are three exchanges with different exchange_type:

celery, type direct
celery.pidbox, type fanout
reply.celery.pidbox, type direct

The second one is the folk in charge of putting shutdown messages —and in general every other message as it feels as the exchange bounded to the actual workers— and it does to every worker as it uses _put_fanout (source) . The iterator we see in that method lists both workers' mailboxes, these are:

[email protected]
[email protected]

And this is because when the second exchange is created here it uses a fanout mailbox as per this definition. I have the call stacks of the three exchanges if you want more details.

Hope all this makes sense to you

shcheklein · 2024-09-22T01:28:14Z

thanks! I'm looking a bit closer when I have some time. A few questions:

Each worker runs self.app.control.shutdown(). It creates 2 messages (per total number of workers). Thus 4 - is it correct?
Is there a way to pass a specific worker name into the shutdown, something like:

self.app.control.shutdown(destination=nodename) ?

nablabits · 2024-09-25T09:30:07Z

Hi @shcheklein

Yes, and
Yes

I've just tested your proposal and it worked like a charm ✨ , thanks for the suggestion. I will update #10552 and iterative/dvc-task#142 accordingly.

Do we want to do something with extra 2"? as the messages kombu creates are 1" in the future. Not a big deal on itself but may save a few headaches in the future as debugging these folks is a bit time consuming.

Please see: iterative/dvc#10427 (comment)

Please check: iterative/dvc#10427 (comment)

shcheklein · 2024-09-26T00:58:09Z

Do we want to do something with extra 2"? as the messages kombu creates are 1" in the future. Not a big deal on itself but may save a few headaches in the future as debugging these folks is a bit time consuming.

good question. What are the scenarios when we run it (I mean gc)?

nablabits · 2024-09-26T04:48:49Z

What are the scenarios when we run it (I mean gc)?

Pretty limited to be fair, just when we call clean (source) which happens automatically on shutdown (source) or manually via dvc exp clean. The latter will be rarely used as one should run that in less than 1" since the message was created (last bullet point here).

shcheklein · 2024-09-26T18:05:20Z

which happens automatically on shutdown (source)

does it happen after the message is sent (the link to the source code is probably wrong)?

nablabits · 2024-09-27T05:17:01Z

Yes, sorry the lack of clarity, I actually meant that link to point when shutdown started, a number of things happen between the worker sending shutdown (source) and clean being actually called (source), but in a nutshell:

shutdown is called
Shutdown messages are created in in
clean is called

I can provide the clean call stack and the logs between these two events if you need them.

shcheklein · 2024-09-27T18:19:11Z

thanks @nablabits !

so, correct me if I wrong - by AFAIU each worker pretty much calls clean when it's done (after it consumed the shutdown message)? Or is it only a single worker?

My concern here is that if we make GC a bit more aggressive - is there a chance that some legitimate shutdown messages are cleaned up before workers have time to consume them?

why can't we run cleanup when we start the queue again (0 workers and we launch a new one)?

nablabits · 2024-10-02T06:56:52Z

Hi @shcheklein, sorry the late reply

My understanding is that only the first worker will call clean as per this line

Edit from 03/10: the strike through text may not be quite right as dvc exp clean cleans everything that is expired. The follow up investigation may shed some more light on this.
Your concern makes sense to me ~~but not for shutdown as a worker can't delete messages from the other worker because of the queues argument (source)~~. What may happen is that some message gets into the queue after the shutdown message is created due to a race condition and clean will wipe it. I loosely remind this happening in some run, but take this with a grain of salt as logs at that point are a bit messy and I didn't get into the weeds of checking whether the actual experiment was successfully carried out.

Cleaning the queues before starting is something that I have also thought but it may require some more investigation that I'm more than happy to carry out, would it be worth to open a separate issue for this so we will be able to ship the fix for the current one?

shcheklein · 2024-10-02T15:59:32Z

Yes, agreed, let's ship the simplest version (w/o modifying the timestamps, GC time thresholds, etc) and then create a new issue / PR to keep the discussion / research going! Thanks @nablabits .

This reverts commit f15acd7 as per iterative/dvc#10427 (comment)

nablabits · 2024-10-03T05:59:13Z

Hi @shcheklein,

I've updated the PR and added that follow up ticket. As said I will be happy to carry out that investigation as I'm greatly enjoying this project, but I see that there are some P1s that maybe you folks want to give more priority so let me know your ideas 🙂

* dvc-10427 Add 2" delay to expirations check As per iterative/dvc#10427 investigation we realised that kombu creates expiration dates for messages 1" in the future which affects cleaning the directories on shutdown. * Add a tip to run a single test to the documentation * Exclude pyenv virtual environments * dvc-10427 Send the shutdown message to the appropriate worker. Please check: iterative/dvc#10427 (comment) * Revert "dvc-10427 Add 2" delay to expirations check" This reverts commit f15acd7 as per iterative/dvc#10427 (comment)

shcheklein · 2024-10-03T20:46:33Z

Closed by iterative/dvc-task#142

shcheklein added the A: experiments Related to dvc exp label May 19, 2024

dberenbaum added bug Did we break something? p2-medium Medium priority, should be done, but less important labels May 20, 2024

This was referenced Sep 11, 2024

dvc - 10427 Clean Environment Sandbox nablabits/example-get-started-experiments#1

Open

dvc-10427 Fix Celery Monitor Shutdowns iterative/dvc-task#142

Merged

dvc-10427 Always Clean the Workers on Shutdown. #10552

Closed

shcheklein mentioned this issue Sep 21, 2024

queue start: worker doesn't process all experiments automatically #10562

Closed

nablabits added a commit to nablabits/example-get-started-experiments that referenced this issue Sep 25, 2024

dvc-10427 Update environmentafter the discoveries on 10427

8a2a277

Please see: iterative/dvc#10427 (comment)

nablabits added a commit to nablabits/dvc-task that referenced this issue Sep 25, 2024

dvc-10427 Send the shutdown message to the appropriate worker.

307c98b

Please check: iterative/dvc#10427 (comment)

nablabits added a commit to nablabits/dvc-task that referenced this issue Oct 3, 2024

Revert "dvc-10427 Add 2" delay to expirations check"

667b431

This reverts commit f15acd7 as per iterative/dvc#10427 (comment)

nablabits mentioned this issue Oct 3, 2024

Clean Expired Messages before Experiments Start #10578

Open

shcheklein closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc queue status doesn't report active workers #10427

dvc queue status doesn't report active workers #10427

RaW-Git commented May 14, 2024 •

edited

Loading

nablabits commented Jun 9, 2024

shcheklein commented Jun 9, 2024

nablabits commented Jun 12, 2024 •

edited

Loading

shcheklein commented Jun 12, 2024

nablabits commented Jun 14, 2024

nablabits commented Jun 18, 2024

nablabits commented Jul 2, 2024

nablabits commented Jul 11, 2024

nablabits commented Aug 2, 2024

shcheklein commented Aug 2, 2024

nablabits commented Aug 16, 2024

nablabits commented Sep 1, 2024 •

edited

Loading

shcheklein commented Sep 14, 2024

nablabits commented Sep 17, 2024

shcheklein commented Sep 22, 2024

nablabits commented Sep 25, 2024 •

edited

Loading

shcheklein commented Sep 26, 2024

nablabits commented Sep 26, 2024

shcheklein commented Sep 26, 2024

nablabits commented Sep 27, 2024

shcheklein commented Sep 27, 2024

nablabits commented Oct 2, 2024 •

edited

Loading

shcheklein commented Oct 2, 2024

nablabits commented Oct 3, 2024

shcheklein commented Oct 3, 2024

dvc queue status doesn't report active workers #10427

dvc queue status doesn't report active workers #10427

Comments

RaW-Git commented May 14, 2024 • edited Loading

Bug Report

Description

Reproduce

Expected

Environment information

nablabits commented Jun 9, 2024

shcheklein commented Jun 9, 2024

nablabits commented Jun 12, 2024 • edited Loading

shcheklein commented Jun 12, 2024

nablabits commented Jun 14, 2024

nablabits commented Jun 18, 2024

nablabits commented Jul 2, 2024

nablabits commented Jul 11, 2024

nablabits commented Aug 2, 2024

TL;DR

pidbox Log

Clean Experiments

shcheklein commented Aug 2, 2024

nablabits commented Aug 16, 2024

TL;DR on what We Have So Far

Shared Data Directories

nablabits commented Sep 1, 2024 • edited Loading

shcheklein commented Sep 14, 2024

nablabits commented Sep 17, 2024

shcheklein commented Sep 22, 2024

nablabits commented Sep 25, 2024 • edited Loading

shcheklein commented Sep 26, 2024

nablabits commented Sep 26, 2024

shcheklein commented Sep 26, 2024

nablabits commented Sep 27, 2024

shcheklein commented Sep 27, 2024

nablabits commented Oct 2, 2024 • edited Loading

shcheklein commented Oct 2, 2024

nablabits commented Oct 3, 2024

shcheklein commented Oct 3, 2024

RaW-Git commented May 14, 2024 •

edited

Loading

nablabits commented Jun 12, 2024 •

edited

Loading

`pidbox` Log

nablabits commented Sep 1, 2024 •

edited

Loading

nablabits commented Sep 25, 2024 •

edited

Loading

nablabits commented Oct 2, 2024 •

edited

Loading