Open
Description
Bug Report
DVC EXP workers dying
Running multiple workers results in failed experiments and no logs
Description
Launching dvc queue start
with parameter -j
greater than 1 fails some experiments that shouldn't fail and these experiments will have no logs. Furthermore, sometimes the exp-worker dies with the failed experiments.
Reproduce
Example:
params.yaml
value: 1
dvc.yaml
stages:
experiment_candles:
cmd: sleep 5 ; echo DONE
params:
- params.yaml:
git init
dvc init
- Copy dvc.yaml
- Copy params.yaml
git add *.yaml
git commit -m "initial commit"
- Queue experiments
dvc exp run \
--queue \
--set-param "value=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99"
- Start 20 jobs
dvc queue start -j 20
- Check for failed jobs
dvc queue status | grep Failed
- Check logs of failed jobs
dvc queue logs ...
Note: that it doesn't always fail, so maybe you have to iterate starting at step 7.
Output sample

Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
dvc_data = 3.16.7
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.9
Supports:
azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
gdrive (pydrive2 = 1.21.3),
gs (gcsfs = 2024.12.0),
hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
oss (ossfs = 2023.12.0),
s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
ssh (sshfs = 2024.9.0),
webdav (webdav4 = 0.10.0),
webdavs (webdav4 = 0.10.0),
webhdfs (fsspec = 2024.12.0)
Config:
Global: /Users/pichurri/Library/Application Support/dvc
System: /Users/pichurri/homebrew/share/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Users/pichurri/homebrew/var/cache/dvc/repo/7b5c17002f7a7963a4dc1afee2b961e2