Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP RFC] OS-specific shared memory #89

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Mar 21, 2023

This PR is an alternative to the changes to File in #80. You still need the changes to Buffer from #80.

In #80, you have a OS-agnostic memory mapping which sits on a OS-specific tmpfs, which is available in Linux only.
Lifecycle management of the memory is guaranteed by the Nanny, also in case of sudden death of the worker.

In this PR, you have OS-specific access to non-POSIX shared memory API, available on Windows and Linux but not on POSIX (crucially for dask, not on MacOSX).
Unlike multiprocessing.shared_memory, which is a thin wrapper around the POSIX shm_open on all OSes except Windows, this API crucially performs reference counting, automatically releasing a shared memory buffer when all the processes holding a reference to it die (gracefully or not).

Design #80 #89 multiprocessing.shared_memory
Works on Linux ✔️ ✔️ ✔️
Works on Windows ✔️ ✔️
Works on MacOSX ✔️
Could be extended to scatter/gather ✔️ ✔️ ✔️
Free from OS configuration ✔️
Resilient to worker crashes on Linux ✔️ ✔️
Resilient to worker crashes on Windows n/a ✔️ ✔️
Resilient to worker crashes on MacOSX n/a n/a
Track total shared memory size on Linux ✔️ 1 2
Track total shared memory size on Windows n/a 2 2
Track total shared memory size on MacOSX n/a n/a 2

Notes

1 You can straightforwardly calculate total shared memory, without duplication, if you know the PIDs of all the workers on the host. Which in turn is something you can straightforwardly figure out without info from the scheduler as long as all worker processes were forked/spawned from the same parent and didn't secede (e.g. like in dask worker CLI). This requires kernel calls costing O(n), where n is the total number of replicated shared memory buffers on the host, but from early benchmarking it looks fast enough not to be of concern. This is not implemented in this PR (yet?)

2 You could implement OS-agnostic tracking of the total shared memory through a bespoke service (distributed.core.Server) that is informed by the various workers every time they acquire/release a buffer. This service would then communicate directly to the scheduler via a heartbeat. Since it's just a meter and not what actually holds the references to the memory, you need not worry about race conditions and leaks - workers would asynchronously inform the tracker of any new events, when time allows. This feels like a clean design although there's legwork involved around the deployment (dask worker CLI, LocalCluster, etc. would need to spawn a new Server and inform all workers of the server's address).

@crusaderky
Copy link
Collaborator Author

crusaderky commented Mar 21, 2023

CC @jakirkham @martindurant @fjetter

@crusaderky crusaderky force-pushed the memfd_create branch 2 times, most recently from 17f4768 to 130ba41 Compare March 24, 2023 11:07
@crusaderky crusaderky self-assigned this Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant