Skip to content

exp-workers failing without logs #10673

Open
Open
@nv-pipo

Description

@nv-pipo

Bug Report

DVC EXP workers dying

Running multiple workers results in failed experiments and no logs

Description

Launching dvc queue start with parameter -j greater than 1 fails some experiments that shouldn't fail and these experiments will have no logs. Furthermore, sometimes the exp-worker dies with the failed experiments.

Reproduce

Example:

params.yaml

value: 1

dvc.yaml

stages:
  experiment_candles:
    cmd: sleep 5 ; echo DONE
    params:
      - params.yaml:
  1. git init
  2. dvc init
  3. Copy dvc.yaml
  4. Copy params.yaml
  5. git add *.yaml
  6. git commit -m "initial commit"
  7. Queue experiments
dvc exp run \
    --queue \
    --set-param "value=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99"
  1. Start 20 jobs
dvc queue start -j 20
  1. Check for failed jobs
dvc queue status | grep Failed
  1. Check logs of failed jobs
dvc queue logs ...

Note: that it doesn't always fail, so maybe you have to iterate starting at step 7.

Output sample

Image

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.9
Supports:
        azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.3),
        gs (gcsfs = 2024.12.0),
        hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
        http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.12.0)
Config:
        Global: /Users/pichurri/Library/Application Support/dvc
        System: /Users/pichurri/homebrew/share/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Users/pichurri/homebrew/var/cache/dvc/repo/7b5c17002f7a7963a4dc1afee2b961e2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions