Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast parallel compressed wksp checkpt/restore #3034

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kbowers-jump
Copy link
Contributor

Very few top level changes:

  • The raw style is now called the v1 style (the raw style macro still exists for backward compat) but is otherwise unchanged (i.e. should backward compatible with existing wksp checkpts).

  • Added v2 (uncompressed) and v3 (compressed) styles.

  • Preview function API refined for more general usages across all versions (this required minor changes to the one place outside wksp where preview was getting called). Updated apps accordingly.

Under the hood, v2 and v3 formats have many useful properties for fast checkpt / restore performance and for long term archival purposes (these semantics are also usable for ultra high performance snapshot distribution and recovery).

  • v2/v3 support writing a checkpt with an arbitrary number parallel threads and restoring with an arbitrary and potentially different number of parallel threads. Thus performance can be scaled out to theoretical memory or network bandwidth (v2) and compression library (v3) limits. (Currently only thread parallelization of v2 restore is implemented but that is by far the most important case practically.)

  • The v2 and v3 wksp allocation data frames will further be bit level identical regardless of the number of threads used on checkpt / restore.

  • While v2/v3 metadata (which store information about the environment in which the checkpt was made among other things) obviously can vary from run-to-run and host-to-host, this information can quickly identified and ignored without having to process the whole checkpt.

  • These two features make it much easier to have multiple hosts create what should be bit-level identical checkpt files and then distribute them torrent style from multiple servers to multiple clients concurrently (and thus avoid having a network hot spot on a single server with the "blessed" checkpt).

  • Thus, at one extreme, a huge v3 (compressed) checkpoint can be written directly out a network socket zero copy / single pass / single threaded / O(1) scratch memory and read from an archival copy of the checkpoint in DRAM via zero copy memory mapped I/O with as many parallel threads to restore. And similarly for the other extreme (and all intermediate combinations).

  • Current checkpt implementation load balances over multiple parallel restore threads via a high performance approximation to a greedy load balance algorithm. Restore implementation uses a taskq model for further load balancing restores.

  • Tweaked wksp allocation to always use fully trimmed partitions for an allocation such that there isn't a lot of waste in a wksp checkpt. (Previous behavior would allow an allocation request to use an untrimmed or partially trimmed partition if part_max was inadequate. But these can be arbitrarily sized which then can bloat a checkpt if checkpointing a wksp with all partitions full.)

  • Added a supported-styles command to fd_wksp_ctl to identify which styles are supported on the target.

  • Based on the recent checkpt API additions.

  • Minor whitespace cleanups and fixed missing return in fd_wksp_usage.

Lower level APIs were improved to support implementing this cleanly. This includes checkpt API improvements (including optimizing handling of checkpoint metadata versus bulk data) and io APIs improvements (including portable memory mapped IO APIs), See the include commits for more details.

- Type and comment cleanup fd_checkpt.h
- Eliminated redundant test in fd_restore.c.
- fd_wksp.h brings in fd_checkpt.h in anticipation of checkpt based
  wksp checkpointing.
- swept through and cleaned up other util includes in the process.
Provides coverage of a case that has long been missing from the
test_alloc (that was already well covered in application level testing).

Not run by default.

This was made a few months ago to help FD devs isolate an allocation
issue (that were not in fd_alloc alas).  Doesn't really belong in this
PR but also isn't really worth a separate PR.  But I'm tired of it lying
around in my local copy.  And it probably shouldn't be thrown away as it
is a very stringent stress tester when the free matching an alloc
happens on a different thread (e.g. pipelining with alloc on the
"source" thread and matching free on the "sink" thread, potentially in a
different process).  So here it is.
Needed for writing parallel compressed restore from a file descriptor.
Useful for all sorts of things, including parallel wksp checkpt/restore
implementations.
Useful for writing robust high level functionality.

- Split fd_checkpt_buf into two functions, fd_checkpt_{meta,data}, and
  similarly for fd_restore_buf.  The meta functions are optimized for
  metadata / control while the data functions are optimized for bulk
  data.

  That is, fd_{checkpt,restore}_meta are meant for small often temporary
  buffers formed on the fly when creating a checkpt and that are needed
  immediately when executing a restore (e.g. the byte size of the next
  data buffer in a checkpt frame, a control signal to tell the restore
  there are no more data buffers in the current frame, ...).
  Accordingly, the size of these buffers is limited to at most
  FD_{CHECKPT,RESTORE}_META_MAX (64 KiB) and these buffers can be read /
  written / freed immediately on return.

  Conversely, fd_{checkpt,restore}_data are meant for large persistent
  buffers used after the restore completes.  These can have
  (practically) arbitrary size.  Buffers passed to these cannot be read
  / written / freed until the corresponding frame is closed.

  Splitting these functions makes it much simpler to implement
  non-trivial object level checkpt/restore functions while retaining
  zero copy efficiency and high compression ratio.  (E.g. it is much
  easier to write an optimized parallel compressed wksp checkpt/restore
  with these semantics.)

  Under the hood, this piggybacks on the small buffer gather/scatter
  optimizations already done to improve the LZ4 compression ratio when
  checkpt a lot of tiny metadata buffers.  Other frame styles are free
  to use this distinction as they wish (just have to respect the buffer
  lifetime rules).

- Renamed frame_{open,close} to just {open,close} to make API easier
  to call.

- Added fd_restore_sz and fd_restore_seek to help with parallel
  checkpt/restore.

- Added fd_restore_{open,close}_advanced APIs that mirror the existing
  checkpt advanced APIs.  These expose the restore frame offsets to
  support better high level validation of restores.  As part of this,
  restore tracks offsets under the hood and has strict semantics about
  the meaning of the offset between mmio, streaming mode with seekable
  files and streaming mode with streams / pipes.

- Added a frame_style_is_supported API to help with cross-platform
  restores.

- Added is_mmio and varous accessors to make it easier to clone
  checkpt and restore objects for thread parallelization.

- Made can_open and in_frame public to help with cleaning up after
  a deeply nested error.

- Other minor cleanups (gbuf_cursor init, checkpt/restore public APIs
  grouped together).

- Updated unit tests coverage accordingly and also added tests to stress
  out the compressor doing non-trivial gather/scatter operations (e.g.
  contiguous regions on checkpt to discontiguous regions on restore and
  vice versa) and use the new-fangled fd_io_seek API.

- Updated documentation (typo corrections, etc).
Very few top level changes:

- The raw style is now called the v1 style (the raw style macro still
  exists for backward compat) but is otherwise unchanged (i.e. should
  backward compatible with existing wksp checkpts).

- Added v2 (uncompressed) and v3 (compressed) styles.

- Preview function API refined for more general usages across all
  versions (this required minor changes to the places outside wksp where
  preview was getting called and tweaking the number of minimal part_max
  used by topo).  Updated fd_firedancer.c accordingly.

Under the hood, v2 and v3 formats have many useful properties for
fast checkpt / restore performance and for long term archival purposes
(these semantics are also usable for ultra high performance snapshot
distribution and recovery).

- v2/v3 support writing a checkpt with an arbitrary number parallel
  threads and restoring with an arbitrary and potentially different
  number of parallel threads.  Thus performance can be scaled out to
  theoretical memory or network bandwidth (v2) and compression library
  (v3) limits.  (Currently only thread parallelizatio of v2 restore is
  implemented but that is by far the most important case practically.)

- The v2 and v3 wksp allocation data frames will further be bit level
  identical regardless of the number of threads used on checkpt /
  restore.

- While v2/v3 metadata (which store information about the
  environment in which the checkpt was made among other things)
  obviously can vary from run-to-run and host-to-host, this information
  can quickly identified and ignored without having to process the whole
  checkpt.

- These two features make it much easier to have multiple hosts create
  what should be bit-level identical checkpt files and then distribute
  them torrent style from multiple servers to multiple clients
  concurrently (and thus avoid having a network hot spot on a single
  server with the "blessed" checkpt).

- Thus, at one extreme, a huge v3 (compressed) checkpoint can be written
  directly out a network socket zero copy / single pass / single
  threaded and read from an archival copy of the checkpoint in DRAM via
  zero copy memory mapped I/O with as many parallel threads as it takes
  to restore.  And similarly for the other extreme (and all intermediate
  combinations).

- Current checkpt implementation load balances over multiple parallel
  restore threads via a high performance approximation to a greedy load
  balance algorithm.  Current restore uses a task queue to dynamic load
  balance further.  Note that parallelization is at partition
  granularity.  If an application just allocates the entire wksp, all
  checkpt/ restore will behave single threaded regardless of number of
  threads available to checkpt/restore.

- Tweaked wksp allocation to always use fully trimmed partitions for
  allocations such that there is minimal waste in a wksp checkpt.
  (Previous behavior would allow an allocation request to use an
  untrimmed or partially trimmed partition if part_max was inadequate.
  But these can be arbitrarily sized which then can bloat a checkpt if
  checkpt a completly full wksp.)

- Added a supported-styles command to fd_wksp_ctl to identify which
  styles are supported on the target.

- Based on the recent checkpt API additions.

- Minor whitespace cleanups and fixed missing return in fd_wksp_usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant