Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc queue status doesn't report active workers #10427

Closed
RaW-Git opened this issue May 14, 2024 · 25 comments
Closed

dvc queue status doesn't report active workers #10427

RaW-Git opened this issue May 14, 2024 · 25 comments
Labels
A: experiments Related to dvc exp bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@RaW-Git
Copy link

RaW-Git commented May 14, 2024

Bug Report

Description

I have two experiments queued up in my dvc queue:
Screenshot 2024-05-14 at 12 31 33

Now I do dvc queue start -j 2. The 2 experiments are running (I can see that via the CPU and GPU usages). Also the dvc queue status shows them as running. But the worker reported by dvc queue status don't show up:
Screenshot 2024-05-14 at 12 31 40

Reproduce

  1. dvc exp run --queue --name exp-1
  2. dvc exp run --queue --name exp-2
  3. dvc queue start -j 2
  4. dvc queue status

Expected

See the active running workers.

Environment information

Output of dvc doctor:

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.11.0rc1 on Linux-5.15.0-105-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3)
Config:
        Global: /home/raw/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/sdb1
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/nvme0n1p3
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1c67604caf3156e9fea7df37dda80d5f

Additional Information (if any):

@shcheklein shcheklein added the A: experiments Related to dvc exp label May 19, 2024
@dberenbaum dberenbaum added bug Did we break something? p2-medium Medium priority, should be done, but less important labels May 20, 2024
@nablabits
Copy link

Hi @shcheklein & @dberenbaum this issue looks a great learning opportunity, are you happy for me to pick it up —assuming @RaW-Git is not themself interested—?

@shcheklein
Copy link
Member

@nablabits sure! please give it a try.

@nablabits
Copy link

nablabits commented Jun 12, 2024

Hi @shcheklein I've tried this and it appears to me that we fixed it on 3.50.2, but it's not clear to me how (diff)

I tested this with the examples repository, this is what I did:

  • Enqueue a couple of jobs: dvc exp run --queue --set-param "train.batch_size=16,24"
  • Check the status: they appeared as queued
  • Run them dvc queue start -j 2
  • Check the status: in 3.50.1 I saw what @RaW-Git described, but for 3.50.2 I didn't.

Am I missing something?

@shcheklein
Copy link
Member

@nablabits how do you install it? I mean DVC. Are you using a virtualenv, are you completely destroying it? Also, the repo state - are you running it on the same / clean state? Just to make sure.

@nablabits
Copy link

@shcheklein well, I just cloned the examples repository and installed the requirements in a virtual environment as it's explained in the readme. Then, I ran through this section in the documentation to get familiar with the process.

After running it for the first time, I realised that I could try to reproduce the issue with dvc exp run --queue --set-param "train.batch_size=16,24" finding that it was behaving ok for me.

So, after that, I demoted dvc to 3.50.1 just with pip install dvc==3.50.1 and checked that the issue was there. I upgraded the version again to the latest 3.51.2 just to double check and the issue was not there.

Looking at the diff between both tags didn't cast anything obvious to me so I set myself out to find the tag that solved the issue that was happily the next one 3.50.2 but maybe what you say about the repo cleanliness may have had something to do 🤔

Let me know what you think, in the meantime I will run a check with a full clean repo pointing to 3.50.2 to rule out that scenario. 🙂

@nablabits
Copy link

Just a quick update on this: I have run a fair amount of experiments on the same version (3.50.1) finding that the issue sometimes whimsically appears and sometimes not. I'll keep investigating until I get to reproduce the error consistently.

@nablabits
Copy link

Another quick update on this, so it won't fall into oblivion. I've been creating a bunch of experiments on my local version of dvc out of tag 3.50.1 using a basic sklearn algorithm but I couldn't reproduce the issue. Next up I will try to build something more sophisticated using deep learning to rule out the possibility that the GPU is not somehow spoiling the party behind the scenes.

@nablabits
Copy link

I've finally managed to reproduce the issue consistently at will. This is what I did:

dvc exp run --queue --set-param "train.fine_tune_args.epochs=20,21" && \
dvc queue start -j 2
  • This first run should show the active workers as usual.
  • Clean the cache, so you will be able to run the same experiment without retrieving a cached copy, and enqueue and run again the experiment:
rm -rf .dvc/cache/runs && \
dvc exp run --queue --set-param "train.fine_tune_args.epochs=20,21" && \
dvc queue start -j 2
  • This second run will raise the error. Even more, subsequent runs will alternatively working fine and raising the error.

Next up: I've checked dvc-exp-worker-1.out finding something promising, in failed runs we get a

[2024-07-11 11:28:09,150: WARNING/MainProcess] Got shutdown from remote

before the task succeeds but the task seems to be there as it will eventually succeed which agrees with what was stated in the main description. In non-failing runs, this Got shutdown from remote happens after the task has succeeded. This may well be the key of the problem as we retrieve these active workers by means of celery.control.inspect which seems to me using this remote under the hood. So I will investigate that avenue.

@nablabits
Copy link

Hi there, for a few weeks I couldn't work on this issue, but this week has been somewhat fruitful.

TL;DR

  • It seems that something remains after one command is run that triggers a shutdown for the next command run
  • Running dvc exp clean before enqueuing and running a new experiment fixes the issue

pidbox Log

After enabling debug logs in for Celery,I realised that working runs got this after the task is accepted:

pidbox received method active(safe=None) [reply_to:{'exchange': 'reply.celery.pidbox', 'routing_key': '6b303b21-8b49-3917-a76b-0b9636a45c9e'} ticket:206dc759-5f0a-42f4-b14f-da3595117808]

Whereas faulty ones got:

[2024-08-01 08:35:13,267: DEBUG/MainProcess] pidbox received method shutdown() [reply_to:None ticket:None]

This is not because of ThreadPoolExecutor and CmdQueueStop either. These are the two explicit places where shutdown() is called.

Clean Experiments

I also realised that running dvc exp clean before running experiments solves the issue, so this may be a potential workaround. As per documentation I note that This is done automatically when running queued experiments. which led me to play a bit with this --clean flag —source—. Unfortunately, adding it selectively to either worker or even to all of them didn't seem to fix the issue. Still, I will keep investigating in this line unless someone advises otherwise.

Thanks for the patience 🙏

@shcheklein
Copy link
Member

thanks for the update @nablabits !

@nablabits
Copy link

Just another brief update here

TL;DR on what We Have So Far

  • This will be potentially an issue on dvc-task as clean seems not cleaning everything or something is still created after clean is called. Removing the conditional logic so as to run this clean always and putting a sleep(60) here fixes the issue. But, of course we don't want this sloppy solution 😉
  • Specifically, there's a directory that controls what is to be processed, .dvc/tmp/exps/celery/broker/in/, and there are residual files that remain there after the tasks are done. Removing these files solves the issue.
  • I also noticed that the issue takes place when one runs a couple of jobs with -j 2 as if one enqueues two experiments but only selects one job, then it behaves normally. Running for one job still creates those residuals in in, so I'm kind of exploring the differences between both approaches.

Shared Data Directories

One of the things that I tried is to set different out directory here so as to check if that had something to do and debug the inner workings of kombu and dvc-task, but this created all sorts of issues.

I have more personal annotations for myself in case anyone is interested but above are the main findings.

Thanks for the patience 🙏

@nablabits
Copy link

nablabits commented Sep 1, 2024

Hi @shcheklein , so finally I got to the bottom of this issue 🐌 , the thing is as follows:

  • Both workers will put Shutdown messages for both queues due to empty queues, this is, 4 shutdown files in total.
  • Each worker will pick the first set of shutdown messages only processing the one relevant to its queue.
  • These shutdown messages are then moved to processed once they are done.
  • When the clean kicks in, it finds the other pair of shutdown messages to be cleaned in the in directory.
  • _gc won't clean them as _delete_expired will detect that the message is not expired yet (source). These are two sample expiration times in one of the experiments I ran: 1725093643.3807564 > 1725093642.829151 where we can see that one is 1" off which is precisely what this folk is adding as a timeout when the message is created.
  • _clean_pidbox won't clean them either as the shutdown messages belong to the celery.pidbox exchange whereas the exchange that is being cleaned is reply.celery.pidbox
  • In the second iteration each queue will start off with the shutdown messages left from the previous reproducing the issue which is in agreement with what I described a few weeks ago.
  • dvc exp clean will naturally clean the files as it will find them expired which is also in agreement.

So, the obvious solution is to add that extra second timeout to this check although I'm unsure of the side effects that that may have.

Do you want me to open an issue in dvc-task for that?

nablabits added a commit to nablabits/dvc-task that referenced this issue Sep 11, 2024
As per iterative/dvc#10427 investigation we realised that kombu creates expiration dates for messages 1" in the future which affects cleaning the directories on shutdown.
@shcheklein
Copy link
Member

@nablabits excellent research, thanks and sorry for the delay. Before I can review the PR, could you remind / give a little bit more details (I need this to refresh my knowledge about this part of the code - it seems you now have more knowledge):

Both workers will put Shutdown messages for both queues due to empty queues, this is, 4 shutdown files in total.

why do we create 4 messages (not 2), can you point to the code as well?

also, why do we use two queues? (why it can't be a single queue and 2 workers)

In the second iteration each queue will start off with the shutdown messages left from the previous reproducing the issue which is in agreement with what I described a few weeks ago.

how can it execute the jobs then?

@nablabits
Copy link

Hey @shcheklein, hope you are doing great.

it seems you now have more knowledge

Ha!, don't get too excited 🙃

also, why do we use two queues? (why it can't be a single queue and 2 workers)

Ahh, sorry I may have been sloppy maybe because the terminology is confusing as put talks about queues which seems to me are used interchangeably with exchanges (Source).
Let me try to unpack to the best of my knowledge. When we queue jobs they go effectively to the same data_folder_in which is then iterated in order of arrival (source) This can be effectively understood as a single queue. That's why all the messages prepend a sort of timestamp (source). However when iterating over messages each worker only acts regarding its messages and overlooks everything else.

why do we create 4 messages (not 2), can you point to the code as well?

Well, I hadn't gone so far as to nail down the cause of the 4 messages, but I've been researching this now, and it appears to me that there are three exchanges with different exchange_type:

  • celery, type direct
  • celery.pidbox, type fanout
  • reply.celery.pidbox, type direct

The second one is the folk in charge of putting shutdown messages —and in general every other message as it feels as the exchange bounded to the actual workers— and it does to every worker as it uses _put_fanout (source) . The iterator we see in that method lists both workers' mailboxes, these are:

And this is because when the second exchange is created here it uses a fanout mailbox as per this definition. I have the call stacks of the three exchanges if you want more details.

Hope all this makes sense to you

@shcheklein
Copy link
Member

thanks! I'm looking a bit closer when I have some time. A few questions:

  • Each worker runs self.app.control.shutdown(). It creates 2 messages (per total number of workers). Thus 4 - is it correct?
  • Is there a way to pass a specific worker name into the shutdown, something like:

self.app.control.shutdown(destination=nodename) ?

@nablabits
Copy link

nablabits commented Sep 25, 2024

Hi @shcheklein

  • Yes, and
  • Yes

I've just tested your proposal and it worked like a charm ✨ , thanks for the suggestion. I will update #10552 and iterative/dvc-task#142 accordingly.

Do we want to do something with extra 2"? as the messages kombu creates are 1" in the future. Not a big deal on itself but may save a few headaches in the future as debugging these folks is a bit time consuming.

nablabits added a commit to nablabits/example-get-started-experiments that referenced this issue Sep 25, 2024
nablabits added a commit to nablabits/dvc-task that referenced this issue Sep 25, 2024
@shcheklein
Copy link
Member

Do we want to do something with extra 2"? as the messages kombu creates are 1" in the future. Not a big deal on itself but may save a few headaches in the future as debugging these folks is a bit time consuming.

good question. What are the scenarios when we run it (I mean gc)?

@nablabits
Copy link

What are the scenarios when we run it (I mean gc)?

Pretty limited to be fair, just when we call clean (source) which happens automatically on shutdown (source) or manually via dvc exp clean. The latter will be rarely used as one should run that in less than 1" since the message was created (last bullet point here).

@shcheklein
Copy link
Member

which happens automatically on shutdown (source)

does it happen after the message is sent (the link to the source code is probably wrong)?

@nablabits
Copy link

Yes, sorry the lack of clarity, I actually meant that link to point when shutdown started, a number of things happen between the worker sending shutdown (source) and clean being actually called (source), but in a nutshell:

  • shutdown is called
  • Shutdown messages are created in in
  • clean is called

I can provide the clean call stack and the logs between these two events if you need them.

@shcheklein
Copy link
Member

thanks @nablabits !

so, correct me if I wrong - by AFAIU each worker pretty much calls clean when it's done (after it consumed the shutdown message)? Or is it only a single worker?

My concern here is that if we make GC a bit more aggressive - is there a chance that some legitimate shutdown messages are cleaned up before workers have time to consume them?

why can't we run cleanup when we start the queue again (0 workers and we launch a new one)?

@nablabits
Copy link

nablabits commented Oct 2, 2024

Hi @shcheklein, sorry the late reply

My understanding is that only the first worker will call clean as per this line

Edit from 03/10: the strike through text may not be quite right as dvc exp clean cleans everything that is expired. The follow up investigation may shed some more light on this.
Your concern makes sense to me but not for shutdown as a worker can't delete messages from the other worker because of the queues argument (source). What may happen is that some message gets into the queue after the shutdown message is created due to a race condition and clean will wipe it. I loosely remind this happening in some run, but take this with a grain of salt as logs at that point are a bit messy and I didn't get into the weeds of checking whether the actual experiment was successfully carried out.

Cleaning the queues before starting is something that I have also thought but it may require some more investigation that I'm more than happy to carry out, would it be worth to open a separate issue for this so we will be able to ship the fix for the current one?

@shcheklein
Copy link
Member

Yes, agreed, let's ship the simplest version (w/o modifying the timestamps, GC time thresholds, etc) and then create a new issue / PR to keep the discussion / research going! Thanks @nablabits .

@nablabits
Copy link

Hi @shcheklein,

I've updated the PR and added that follow up ticket. As said I will be happy to carry out that investigation as I'm greatly enjoying this project, but I see that there are some P1s that maybe you folks want to give more priority so let me know your ideas 🙂

shcheklein pushed a commit to iterative/dvc-task that referenced this issue Oct 3, 2024
* dvc-10427 Add 2" delay to expirations check

As per iterative/dvc#10427 investigation we realised that kombu creates expiration dates for messages 1" in the future which affects cleaning the directories on shutdown.

* Add a tip to run a single test to the documentation

* Exclude pyenv virtual environments

* dvc-10427 Send the shutdown message to the appropriate worker.

Please check: iterative/dvc#10427 (comment)

* Revert "dvc-10427 Add 2" delay to expirations check"

This reverts commit f15acd7 as per iterative/dvc#10427 (comment)
@shcheklein
Copy link
Member

Closed by iterative/dvc-task#142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

4 participants