Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Oct 29, 2025

The "fork" start method in multiprocessing doesn't work well with the instances cache.

Indeed contrary to "spawn" which pickles the instances and repopulates the cache, "fork" doesn't repopulate the instances. However "fork" does keep the old cache but it's unable to reuse the old instances because the fs_token used to identify instances changes in subprocesses.

I fixed that by making fs_token independent from the process id. This implied improving a fsspec metaclass 😬

Finally I improved the multithreading case: it can now reuse the cache from the main thread in a new instance.

Minor: I needed to make HfHubHTTPError picklable / deepcopyable.

TODO:

  • tests

related to #3443

This improvement will avoid calling the API again in DataLoader workers when using "fork" and reuse the data files list from the parent process (see https://huggingface.co/blog/streaming-datasets)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq changed the title [HfFileSystem] improve cache for multiprocessing fork [HfFileSystem] improve cache for multiprocessing fork and multithreading Oct 29, 2025
@lhoestq lhoestq marked this pull request as ready for review October 30, 2025 13:03
@lhoestq lhoestq force-pushed the improve-hffs-cache-for-mp-fork branch from 177b790 to 82a58ee Compare October 30, 2025 15:01
@lhoestq lhoestq requested a review from Wauplin October 30, 2025 16:10
@Wauplin
Copy link
Contributor

Wauplin commented Oct 31, 2025

@lhoestq , I started to have a look at this PR which looks good overall. However would you be up for using __new__ instead of a metaclass? (which feels a bit old school to me)

@lhoestq
Copy link
Member Author

lhoestq commented Oct 31, 2025

Sure let me try that
edit: actually this is maybe not ideal, given we need to "cancel" the AbstractFileSystem metaclass anyway and we probably want to stay aligned with how AbstractFileSystem works for maintenance

@lhoestq
Copy link
Member Author

lhoestq commented Nov 3, 2025

Let me know if it looks good to you, I'll update the blog post once it's out

This is also affecting the datasets CI

FAILED tests/test_upstream_hub.py::TestPushToHub::test_push_dataset_dict_to_hub_num_proc - httpx.ReadError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2651)
FAILED tests/test_iterable_dataset.py::test_iterable_dataset_from_hub_torch_dataloader_parallel[2] - ConnectionError: Couldn't reach 'hf-internal-testing/dataset_with_data_files' on the Hub (LocalEntryNotFoundError)
FAILED tests/test_upstream_hub.py::TestPushToHub::test_push_dataset_dict_to_hub_iterable_num_proc - httpx.ReadError: [Errno 104] Connection reset by peer

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got some time to review this PR. I'm fine with current logic but left some comments about implementation details.

All of these fsspec hacks and magic behavior comfort me in the fact that we should be vague on the "overhead" that the framework brings compared to HfApi. Definitely true that in some cases it's more efficient (e.g. streaming?) but there is a lot of complexity hidden between fsspec's implementation and ours. Hence better to be vague rather than exhaustively listing differences between HfApi and HfFileSystem. (related to #3177 (comment))

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good :)

@lhoestq
Copy link
Member Author

lhoestq commented Nov 4, 2025

I took all your comments into account :)

@lhoestq lhoestq merged commit d116de0 into main Nov 4, 2025
32 of 39 checks passed
@lhoestq lhoestq deleted the improve-hffs-cache-for-mp-fork branch November 4, 2025 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants