Skip to content

push: hangs after data transfer to s3 compatible remote #10374

Open
@tanguy-s

Description

@tanguy-s

Bug Report

push: hangs after data transfer to s3 compatible remote

Description

When pushing files to s3 compatible storage with endpointurl with :
dvc push -vv --show-stack
or
dvc push -j 32 -vv --show-stack

DVC intermittently hangs after pushing data to storage :

2024-03-26 10:56:55,858 DEBUG: Preparing to transfer data from '/root/workdir/.dvc/cache' to 's3://my-s3-bucket/'
2024-03-26 10:56:55,858 DEBUG: Preparing to collect status from 'my-s3-bucket/'
2024-03-26 10:56:55,858 DEBUG: Collecting status from 'my-s3-bucket/'   
2024-03-26 10:56:55,860 DEBUG: Querying 3 oids via object_exists                 
2024-03-26 10:56:56,544 DEBUG: Querying 0 oids via object_exists                 
2024-03-26 10:56:57,582 DEBUG: Preparing to transfer data from '/root/workdir/.dvc/cache/files/md5' to 's3://my-s3-bucket/files/md5'                   
2024-03-26 10:56:57,583 DEBUG: Preparing to collect status from 'my-s3-bucket/files/md5'
2024-03-26 10:56:57,591 DEBUG: Collecting status from 'my-s3-bucket/files/md5'
2024-03-26 10:56:57,592 DEBUG: Querying 2 oids via object_exists                 
2024-03-26 10:56:58,639 DEBUG: Estimated remote size: 749568 files               
2024-03-26 10:56:58,639 DEBUG: Querying 28882 oids via traverse                  
2024-03-26 10:58:02,859 DEBUG: Preparing to collect status from '/root/workdir/.dvc/cache/files/md5'                                                            
2024-03-26 10:58:02,866 DEBUG: Collecting status from '/root/workdir/.dvc/cache/files/md5'
2024-03-26 10:58:03,559 DEBUG: transfer dir: md5: e73f784c9d7a9d79aa8ddbdef314e12d.dir with 18343 files                                                           
Pushing                                                |0.00 [01:07,     ?file/s]
100%|█████████▉|Pushing to s3              18.3k/18.3k [19:42<00:00,  19.8file/s]

No verbose output after this point however, Ctrl+C gives systematically the following traceback :

2024-03-26 12:00:14,267 ERROR: interrupted by the user                                             
Traceback (most recent call last):                                                                 
  File "/usr/local/lib/python3.10/dist-packages/dvc/repo/push.py", line 144, in push
    push_transferred, push_failed = ipush(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/index/push.py", line 84, in push
    result = transfer(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/transfer.py", line 224, in transfer
    failed = _do_transfer(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/transfer.py", line 93, in _do_transfer
    dir_fails = _add(src, dest, bound_file_ids, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/transfer.py", line 165, in _add
    dest.add(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/db/__init__.py", line 111, in add
    transferred = super().add(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/db.py", line 188, in add
    generic.transfer(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/fs/generic.py", line 319, in transfer
    copy(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/fs/generic.py", line 87, in copy
    return _put(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/fs/generic.py", line 172, in _put
    for i, result in enumerate(fut.result()):
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 453, in result
    self._condition.wait(timeout)
  File "/usr/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

Seems like some futures of the underlying S3FileSystem in dvc-objects are never returning and do not have a timeout.

Running dvc push -j 1 -vv works, however seems quite slower on the status collection (approx. 30min for 18k files).

Environment information

x86 Ubuntu 22.04 Docker

aiobotocore==2.12.1
awscli==1.32.51
awscli-plugin-endpoint==0.4
boto3==1.34.51
botocore==1.34.51
dvc==3.36.0
dvc-data==3.2.0
dvc-http==2.32.0
dvc-objects==3.0.6
dvc-render==1.0.1
dvc-s3==3.1.0
dvc-studio-client==0.20.0
dvc-task==0.4.0
fsspec==2024.3.1
s3fs==2024.3.1
s3transfer==0.10.1

Output of dvc doctor:

DVC version: 3.36.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-26-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 3.2.0
	dvc_objects = 3.0.6
	dvc_render = 1.0.1
	dvc_task = 0.4.0
	scmrepo = 2.1.1
Supports:
	http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2024.3.1, boto3 = 1.34.51)
Config:
	Global: /root/.config/dvc
	System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3, s3
Workspace directory: overlay on overlay
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/da3f4f6485fee7c550b5a6ccb3e96e47

Same isssue using latest 3.48 DVC.

Output of dvc config -l:

remote.dvc-cache.url=s3://my-s3-bucket/
remote.dvc-cache.endpointurl=https://endpoint.url
remote.dvc-cache.profile=scw
remote.scans.url=s3://my-s3-bucket-2/
remote.scans.endpointurl=https://endpoint.url
remote.scans.profile=scw
cache.type=hardlink
core.remote=dvc-cache
core.hardlink_lock=false

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-syncRelated to dvc get/fetch/import/pull/pushbugDid we break something?triageNeeds to be triaged

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions