-
Notifications
You must be signed in to change notification settings - Fork 840
[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
177b790 to
82a58ee
Compare
|
@lhoestq , I started to have a look at this PR which looks good overall. However would you be up for using |
|
Sure let me try that |
|
Let me know if it looks good to you, I'll update the blog post once it's out This is also affecting the FAILED tests/test_upstream_hub.py::TestPushToHub::test_push_dataset_dict_to_hub_num_proc - httpx.ReadError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2651)
FAILED tests/test_iterable_dataset.py::test_iterable_dataset_from_hub_torch_dataloader_parallel[2] - ConnectionError: Couldn't reach 'hf-internal-testing/dataset_with_data_files' on the Hub (LocalEntryNotFoundError)
FAILED tests/test_upstream_hub.py::TestPushToHub::test_push_dataset_dict_to_hub_iterable_num_proc - httpx.ReadError: [Errno 104] Connection reset by peer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally got some time to review this PR. I'm fine with current logic but left some comments about implementation details.
All of these fsspec hacks and magic behavior comfort me in the fact that we should be vague on the "overhead" that the framework brings compared to HfApi. Definitely true that in some cases it's more efficient (e.g. streaming?) but there is a lot of complexity hidden between fsspec's implementation and ours. Hence better to be vague rather than exhaustively listing differences between HfApi and HfFileSystem. (related to #3177 (comment))
Co-authored-by: Lucain <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good :)
|
I took all your comments into account :) |
The "fork" start method in multiprocessing doesn't work well with the instances cache.
Indeed contrary to "spawn" which pickles the instances and repopulates the cache, "fork" doesn't repopulate the instances. However "fork" does keep the old cache but it's unable to reuse the old instances because the
fs_tokenused to identify instances changes in subprocesses.I fixed that by making
fs_tokenindependent from the process id. This implied improving a fsspec metaclass 😬Finally I improved the multithreading case: it can now reuse the cache from the main thread in a new instance.
Minor: I needed to make HfHubHTTPError picklable / deepcopyable.
TODO:
related to #3443
This improvement will avoid calling the API again in DataLoader workers when using "fork" and reuse the data files list from the parent process (see https://huggingface.co/blog/streaming-datasets)