Fast parallel compressed wksp checkpt/restore #3034

kbowers-jump · 2024-10-03T21:00:51Z

Very few top level changes:

The raw style is now called the v1 style (the raw style macro still exists for backward compat) but is otherwise unchanged (i.e. should backward compatible with existing wksp checkpts).
Added v2 (uncompressed) and v3 (compressed) styles.
Preview function API refined for more general usages across all versions (this required minor changes to the one place outside wksp where preview was getting called). Updated apps accordingly.

Under the hood, v2 and v3 formats have many useful properties for fast checkpt / restore performance and for long term archival purposes (these semantics are also usable for ultra high performance snapshot distribution and recovery).

v2/v3 support writing a checkpt with an arbitrary number parallel threads and restoring with an arbitrary and potentially different number of parallel threads. Thus performance can be scaled out to theoretical memory or network bandwidth (v2) and compression library (v3) limits. (Currently only thread parallelization of v2 restore is implemented but that is by far the most important case practically.)
The v2 and v3 wksp allocation data frames will further be bit level identical regardless of the number of threads used on checkpt / restore.
While v2/v3 metadata (which store information about the environment in which the checkpt was made among other things) obviously can vary from run-to-run and host-to-host, this information can quickly identified and ignored without having to process the whole checkpt.
These two features make it much easier to have multiple hosts create what should be bit-level identical checkpt files and then distribute them torrent style from multiple servers to multiple clients concurrently (and thus avoid having a network hot spot on a single server with the "blessed" checkpt).
Thus, at one extreme, a huge v3 (compressed) checkpoint can be written directly out a network socket zero copy / single pass / single threaded / O(1) scratch memory and read from an archival copy of the checkpoint in DRAM via zero copy memory mapped I/O with as many parallel threads to restore. And similarly for the other extreme (and all intermediate combinations).
Current checkpt implementation load balances over multiple parallel restore threads via a high performance approximation to a greedy load balance algorithm. Restore implementation uses a taskq model for further load balancing restores.
Tweaked wksp allocation to always use fully trimmed partitions for an allocation such that there isn't a lot of waste in a wksp checkpt. (Previous behavior would allow an allocation request to use an untrimmed or partially trimmed partition if part_max was inadequate. But these can be arbitrarily sized which then can bloat a checkpt if checkpointing a wksp with all partitions full.)
Added a supported-styles command to fd_wksp_ctl to identify which styles are supported on the target.
Based on the recent checkpt API additions.
Minor whitespace cleanups and fixed missing return in fd_wksp_usage.

Lower level APIs were improved to support implementing this cleanly. This includes checkpt API improvements (including optimizing handling of checkpoint metadata versus bulk data) and io APIs improvements (including portable memory mapped IO APIs), See the include commits for more details.

- Type and comment cleanup fd_checkpt.h - Eliminated redundant test in fd_restore.c.

- fd_wksp.h brings in fd_checkpt.h in anticipation of checkpt based wksp checkpointing. - swept through and cleaned up other util includes in the process.

Provides coverage of a case that has long been missing from the test_alloc (that was already well covered in application level testing). Not run by default. This was made a few months ago to help FD devs isolate an allocation issue (that were not in fd_alloc alas). Doesn't really belong in this PR but also isn't really worth a separate PR. But I'm tired of it lying around in my local copy. And it probably shouldn't be thrown away as it is a very stringent stress tester when the free matching an alloc happens on a different thread (e.g. pipelining with alloc on the "source" thread and matching free on the "sink" thread, potentially in a different process). So here it is.

Needed for writing parallel compressed restore from a file descriptor.

Useful for all sorts of things, including parallel wksp checkpt/restore implementations.

Useful for writing robust high level functionality. - Split fd_checkpt_buf into two functions, fd_checkpt_{meta,data}, and similarly for fd_restore_buf. The meta functions are optimized for metadata / control while the data functions are optimized for bulk data. That is, fd_{checkpt,restore}_meta are meant for small often temporary buffers formed on the fly when creating a checkpt and that are needed immediately when executing a restore (e.g. the byte size of the next data buffer in a checkpt frame, a control signal to tell the restore there are no more data buffers in the current frame, ...). Accordingly, the size of these buffers is limited to at most FD_{CHECKPT,RESTORE}_META_MAX (64 KiB) and these buffers can be read / written / freed immediately on return. Conversely, fd_{checkpt,restore}_data are meant for large persistent buffers used after the restore completes. These can have (practically) arbitrary size. Buffers passed to these cannot be read / written / freed until the corresponding frame is closed. Splitting these functions makes it much simpler to implement non-trivial object level checkpt/restore functions while retaining zero copy efficiency and high compression ratio. (E.g. it is much easier to write an optimized parallel compressed wksp checkpt/restore with these semantics.) Under the hood, this piggybacks on the small buffer gather/scatter optimizations already done to improve the LZ4 compression ratio when checkpt a lot of tiny metadata buffers. Other frame styles are free to use this distinction as they wish (just have to respect the buffer lifetime rules). - Renamed frame_{open,close} to just {open,close} to make API easier to call. - Added fd_restore_sz and fd_restore_seek to help with parallel checkpt/restore. - Added fd_restore_{open,close}_advanced APIs that mirror the existing checkpt advanced APIs. These expose the restore frame offsets to support better high level validation of restores. As part of this, restore tracks offsets under the hood and has strict semantics about the meaning of the offset between mmio, streaming mode with seekable files and streaming mode with streams / pipes. - Added a frame_style_is_supported API to help with cross-platform restores. - Added is_mmio and varous accessors to make it easier to clone checkpt and restore objects for thread parallelization. - Made can_open and in_frame public to help with cleaning up after a deeply nested error. - Other minor cleanups (gbuf_cursor init, checkpt/restore public APIs grouped together). - Updated unit tests coverage accordingly and also added tests to stress out the compressor doing non-trivial gather/scatter operations (e.g. contiguous regions on checkpt to discontiguous regions on restore and vice versa) and use the new-fangled fd_io_seek API. - Updated documentation (typo corrections, etc).

Very few top level changes: - The raw style is now called the v1 style (the raw style macro still exists for backward compat) but is otherwise unchanged (i.e. should backward compatible with existing wksp checkpts). - Added v2 (uncompressed) and v3 (compressed) styles. - Preview function API refined for more general usages across all versions (this required minor changes to the places outside wksp where preview was getting called and tweaking the number of minimal part_max used by topo). Updated fd_firedancer.c accordingly. Under the hood, v2 and v3 formats have many useful properties for fast checkpt / restore performance and for long term archival purposes (these semantics are also usable for ultra high performance snapshot distribution and recovery). - v2/v3 support writing a checkpt with an arbitrary number parallel threads and restoring with an arbitrary and potentially different number of parallel threads. Thus performance can be scaled out to theoretical memory or network bandwidth (v2) and compression library (v3) limits. (Currently only thread parallelizatio of v2 restore is implemented but that is by far the most important case practically.) - The v2 and v3 wksp allocation data frames will further be bit level identical regardless of the number of threads used on checkpt / restore. - While v2/v3 metadata (which store information about the environment in which the checkpt was made among other things) obviously can vary from run-to-run and host-to-host, this information can quickly identified and ignored without having to process the whole checkpt. - These two features make it much easier to have multiple hosts create what should be bit-level identical checkpt files and then distribute them torrent style from multiple servers to multiple clients concurrently (and thus avoid having a network hot spot on a single server with the "blessed" checkpt). - Thus, at one extreme, a huge v3 (compressed) checkpoint can be written directly out a network socket zero copy / single pass / single threaded and read from an archival copy of the checkpoint in DRAM via zero copy memory mapped I/O with as many parallel threads as it takes to restore. And similarly for the other extreme (and all intermediate combinations). - Current checkpt implementation load balances over multiple parallel restore threads via a high performance approximation to a greedy load balance algorithm. Current restore uses a task queue to dynamic load balance further. Note that parallelization is at partition granularity. If an application just allocates the entire wksp, all checkpt/ restore will behave single threaded regardless of number of threads available to checkpt/restore. - Tweaked wksp allocation to always use fully trimmed partitions for allocations such that there is minimal waste in a wksp checkpt. (Previous behavior would allow an allocation request to use an untrimmed or partially trimmed partition if part_max was inadequate. But these can be arbitrarily sized which then can bloat a checkpt if checkpt a completly full wksp.) - Added a supported-styles command to fd_wksp_ctl to identify which styles are supported on the target. - Based on the recent checkpt API additions. - Minor whitespace cleanups and fixed missing return in fd_wksp_usage.

kbowers-jump added 7 commits October 3, 2024 14:53

Minor util/checkpt cleanups

a8c9de5

- Type and comment cleanup fd_checkpt.h - Eliminated redundant test in fd_restore.c.

util header linting

07c45ea

- fd_wksp.h brings in fd_checkpt.h in anticipation of checkpt based wksp checkpointing. - swept through and cleaned up other util includes in the process.

fd_io seek and sz APIs

b8b0230

Needed for writing parallel compressed restore from a file descriptor.

Low level portable memory mapped I/O API

f558a0d

Useful for all sorts of things, including parallel wksp checkpt/restore implementations.

kbowers-jump force-pushed the kbowers-jump/wksp-checkpt branch from 14538f8 to 4924d6e Compare October 4, 2024 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast parallel compressed wksp checkpt/restore #3034

Fast parallel compressed wksp checkpt/restore #3034

kbowers-jump commented Oct 3, 2024

Fast parallel compressed wksp checkpt/restore #3034

Are you sure you want to change the base?

Fast parallel compressed wksp checkpt/restore #3034

Conversation

kbowers-jump commented Oct 3, 2024