Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast parallel compressed wksp checkpt/restore #3034

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Commits on Oct 3, 2024

  1. Minor util/checkpt cleanups

    - Type and comment cleanup fd_checkpt.h
    - Eliminated redundant test in fd_restore.c.
    kbowers-jump committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    a8c9de5 View commit details
    Browse the repository at this point in the history
  2. util header linting

    - fd_wksp.h brings in fd_checkpt.h in anticipation of checkpt based
      wksp checkpointing.
    - swept through and cleaned up other util includes in the process.
    kbowers-jump committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    07c45ea View commit details
    Browse the repository at this point in the history
  3. Added pipelined alloc/free test coverage

    Provides coverage of a case that has long been missing from the
    test_alloc (that was already well covered in application level testing).
    
    Not run by default.
    
    This was made a few months ago to help FD devs isolate an allocation
    issue (that were not in fd_alloc alas).  Doesn't really belong in this
    PR but also isn't really worth a separate PR.  But I'm tired of it lying
    around in my local copy.  And it probably shouldn't be thrown away as it
    is a very stringent stress tester when the free matching an alloc
    happens on a different thread (e.g. pipelining with alloc on the
    "source" thread and matching free on the "sink" thread, potentially in a
    different process).  So here it is.
    kbowers-jump committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    df124a3 View commit details
    Browse the repository at this point in the history
  4. fd_io seek and sz APIs

    Needed for writing parallel compressed restore from a file descriptor.
    kbowers-jump committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    b8b0230 View commit details
    Browse the repository at this point in the history
  5. Low level portable memory mapped I/O API

    Useful for all sorts of things, including parallel wksp checkpt/restore
    implementations.
    kbowers-jump committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    f558a0d View commit details
    Browse the repository at this point in the history
  6. fd_checkpt API improvements and cleanups

    Useful for writing robust high level functionality.
    
    - Split fd_checkpt_buf into two functions, fd_checkpt_{meta,data}, and
      similarly for fd_restore_buf.  The meta functions are optimized for
      metadata / control while the data functions are optimized for bulk
      data.
    
      That is, fd_{checkpt,restore}_meta are meant for small often temporary
      buffers formed on the fly when creating a checkpt and that are needed
      immediately when executing a restore (e.g. the byte size of the next
      data buffer in a checkpt frame, a control signal to tell the restore
      there are no more data buffers in the current frame, ...).
      Accordingly, the size of these buffers is limited to at most
      FD_{CHECKPT,RESTORE}_META_MAX (64 KiB) and these buffers can be read /
      written / freed immediately on return.
    
      Conversely, fd_{checkpt,restore}_data are meant for large persistent
      buffers used after the restore completes.  These can have
      (practically) arbitrary size.  Buffers passed to these cannot be read
      / written / freed until the corresponding frame is closed.
    
      Splitting these functions makes it much simpler to implement
      non-trivial object level checkpt/restore functions while retaining
      zero copy efficiency and high compression ratio.  (E.g. it is much
      easier to write an optimized parallel compressed wksp checkpt/restore
      with these semantics.)
    
      Under the hood, this piggybacks on the small buffer gather/scatter
      optimizations already done to improve the LZ4 compression ratio when
      checkpt a lot of tiny metadata buffers.  Other frame styles are free
      to use this distinction as they wish (just have to respect the buffer
      lifetime rules).
    
    - Renamed frame_{open,close} to just {open,close} to make API easier
      to call.
    
    - Added fd_restore_sz and fd_restore_seek to help with parallel
      checkpt/restore.
    
    - Added fd_restore_{open,close}_advanced APIs that mirror the existing
      checkpt advanced APIs.  These expose the restore frame offsets to
      support better high level validation of restores.  As part of this,
      restore tracks offsets under the hood and has strict semantics about
      the meaning of the offset between mmio, streaming mode with seekable
      files and streaming mode with streams / pipes.
    
    - Added a frame_style_is_supported API to help with cross-platform
      restores.
    
    - Added is_mmio and varous accessors to make it easier to clone
      checkpt and restore objects for thread parallelization.
    
    - Made can_open and in_frame public to help with cleaning up after
      a deeply nested error.
    
    - Other minor cleanups (gbuf_cursor init, checkpt/restore public APIs
      grouped together).
    
    - Updated unit tests coverage accordingly and also added tests to stress
      out the compressor doing non-trivial gather/scatter operations (e.g.
      contiguous regions on checkpt to discontiguous regions on restore and
      vice versa) and use the new-fangled fd_io_seek API.
    
    - Updated documentation (typo corrections, etc).
    kbowers-jump committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    9346e8e View commit details
    Browse the repository at this point in the history

Commits on Oct 4, 2024

  1. Fast parallel compressed wksp checkpt/restore

    Very few top level changes:
    
    - The raw style is now called the v1 style (the raw style macro still
      exists for backward compat) but is otherwise unchanged (i.e. should
      backward compatible with existing wksp checkpts).
    
    - Added v2 (uncompressed) and v3 (compressed) styles.
    
    - Preview function API refined for more general usages across all
      versions (this required minor changes to the places outside wksp where
      preview was getting called and tweaking the number of minimal part_max
      used by topo).  Updated fd_firedancer.c accordingly.
    
    Under the hood, v2 and v3 formats have many useful properties for
    fast checkpt / restore performance and for long term archival purposes
    (these semantics are also usable for ultra high performance snapshot
    distribution and recovery).
    
    - v2/v3 support writing a checkpt with an arbitrary number parallel
      threads and restoring with an arbitrary and potentially different
      number of parallel threads.  Thus performance can be scaled out to
      theoretical memory or network bandwidth (v2) and compression library
      (v3) limits.  (Currently only thread parallelizatio of v2 restore is
      implemented but that is by far the most important case practically.)
    
    - The v2 and v3 wksp allocation data frames will further be bit level
      identical regardless of the number of threads used on checkpt /
      restore.
    
    - While v2/v3 metadata (which store information about the
      environment in which the checkpt was made among other things)
      obviously can vary from run-to-run and host-to-host, this information
      can quickly identified and ignored without having to process the whole
      checkpt.
    
    - These two features make it much easier to have multiple hosts create
      what should be bit-level identical checkpt files and then distribute
      them torrent style from multiple servers to multiple clients
      concurrently (and thus avoid having a network hot spot on a single
      server with the "blessed" checkpt).
    
    - Thus, at one extreme, a huge v3 (compressed) checkpoint can be written
      directly out a network socket zero copy / single pass / single
      threaded and read from an archival copy of the checkpoint in DRAM via
      zero copy memory mapped I/O with as many parallel threads as it takes
      to restore.  And similarly for the other extreme (and all intermediate
      combinations).
    
    - Current checkpt implementation load balances over multiple parallel
      restore threads via a high performance approximation to a greedy load
      balance algorithm.  Current restore uses a task queue to dynamic load
      balance further.  Note that parallelization is at partition
      granularity.  If an application just allocates the entire wksp, all
      checkpt/ restore will behave single threaded regardless of number of
      threads available to checkpt/restore.
    
    - Tweaked wksp allocation to always use fully trimmed partitions for
      allocations such that there is minimal waste in a wksp checkpt.
      (Previous behavior would allow an allocation request to use an
      untrimmed or partially trimmed partition if part_max was inadequate.
      But these can be arbitrarily sized which then can bloat a checkpt if
      checkpt a completly full wksp.)
    
    - Added a supported-styles command to fd_wksp_ctl to identify which
      styles are supported on the target.
    
    - Based on the recent checkpt API additions.
    
    - Minor whitespace cleanups and fixed missing return in fd_wksp_usage.
    kbowers-jump committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    4924d6e View commit details
    Browse the repository at this point in the history