[WIP RFC] OS-specific shared memory #89

crusaderky · 2023-03-21T11:09:22Z

Alternative to File-based shared memory model #80
Alternative to [WIP, RFC] Add sharedmem #86

This PR is an alternative to the changes to File in #80. You still need the changes to Buffer from #80.

In #80, you have a OS-agnostic memory mapping which sits on a OS-specific tmpfs, which is available in Linux only.
Lifecycle management of the memory is guaranteed by the Nanny, also in case of sudden death of the worker.

In this PR, you have OS-specific access to non-POSIX shared memory API, available on Windows and Linux but not on POSIX (crucially for dask, not on MacOSX).
Unlike multiprocessing.shared_memory, which is a thin wrapper around the POSIX shm_open on all OSes except Windows, this API crucially performs reference counting, automatically releasing a shared memory buffer when all the processes holding a reference to it die (gracefully or not).

Design	#80	#89	multiprocessing.shared_memory
Works on Linux	✔️	✔️	✔️
Works on Windows	❌	✔️	✔️
Works on MacOSX	❌	❌	✔️
Could be extended to scatter/gather	✔️	✔️	✔️
Free from OS configuration	❌	✔️	❌
Resilient to worker crashes on Linux	✔️	✔️	❌
Resilient to worker crashes on Windows	n/a	✔️	✔️
Resilient to worker crashes on MacOSX	n/a	n/a	❌
Track total shared memory size on Linux	✔️	❌¹	❌²
Track total shared memory size on Windows	n/a	❌²	❌²
Track total shared memory size on MacOSX	n/a	n/a	❌²

Notes

¹ You can straightforwardly calculate total shared memory, without duplication, if you know the PIDs of all the workers on the host. Which in turn is something you can straightforwardly figure out without info from the scheduler as long as all worker processes were forked/spawned from the same parent and didn't secede (e.g. like in dask worker CLI). This requires kernel calls costing O(n), where n is the total number of replicated shared memory buffers on the host, but from early benchmarking it looks fast enough not to be of concern. This is not implemented in this PR (yet?)

² You could implement OS-agnostic tracking of the total shared memory through a bespoke service (distributed.core.Server) that is informed by the various workers every time they acquire/release a buffer. This service would then communicate directly to the scheduler via a heartbeat. Since it's just a meter and not what actually holds the references to the memory, you need not worry about race conditions and leaks - workers would asynchronously inform the tracker of any new events, when time allows. This feels like a clean design although there's legwork involved around the deployment (dask worker CLI, LocalCluster, etc. would need to spawn a new Server and inform all workers of the server's address).

crusaderky · 2023-03-21T11:37:52Z

CC @jakirkham @martindurant @fjetter

crusaderky force-pushed the memfd_create branch 2 times, most recently from 17f4768 to 130ba41 Compare March 24, 2023 11:07

crusaderky force-pushed the memfd_create branch from 130ba41 to 0ceffef Compare March 29, 2023 23:51

crusaderky force-pushed the memfd_create branch from 0ceffef to f883663 Compare April 20, 2023 22:01

WIP: shared memory without tmpfs

2b9a840

crusaderky force-pushed the memfd_create branch from f883663 to 2b9a840 Compare April 20, 2023 22:04

crusaderky self-assigned this Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP RFC] OS-specific shared memory #89

[WIP RFC] OS-specific shared memory #89

crusaderky commented Mar 21, 2023 •

edited

Loading

crusaderky commented Mar 21, 2023 •

edited

Loading

[WIP RFC] OS-specific shared memory #89

Are you sure you want to change the base?

[WIP RFC] OS-specific shared memory #89

Conversation

crusaderky commented Mar 21, 2023 • edited Loading

Notes

crusaderky commented Mar 21, 2023 • edited Loading

crusaderky commented Mar 21, 2023 •

edited

Loading

crusaderky commented Mar 21, 2023 •

edited

Loading