Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute release 2024-09-25 #9151

Merged
merged 20 commits into from
Sep 25, 2024
Merged

Commits on Sep 24, 2024

  1. chore(#9077): cleanups & code dedup (#9082)

    Punted from #9077
    problame authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    a65d437 View commit details
    Browse the repository at this point in the history
  2. Move the patch to compute (#9120)

    ## Problem
    All the other patches were moved to the compute directory, and only one
    was left in the patches subdirectory in the root directory.
    
    ## Summary of changes
    The patch was moved to the compute directory as others
    a-masterov authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    b224a5a View commit details
    Browse the repository at this point in the history
  3. test: Make test_hot_standby_feedback more forgiving of slow initializ…

    …ation (#9113)
    
    Don't start waiting for the index to appear in the secondary until it
    has been created in the primary. Before, if the "pgbench -i" step took
    more than 60 s, we would give up.
    
    There was a flaky test failure along those lines at:
    https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9105/10997477941/index.html#suites/950eff205b552e248417890b8b8f189e/73cf4b5648fa6f74/
    Hopefully, this avoids such failures in the future.
    hlinnaka authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    70fe007 View commit details
    Browse the repository at this point in the history
  4. test: Skip fsync when initdb'ing the storage controller db

    After initdb, we configure it with "fsync=off" anyway.
    hlinnaka committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    589594c View commit details
    Browse the repository at this point in the history
  5. test: Poll pageserver availability more aggressively at test startup

    Even with the 100 ms interval, on my laptop the pageserver always
    becomes available on second attempt, so this saves about 900 ms at
    every test startup.
    hlinnaka committed Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    2f7ceca View commit details
    Browse the repository at this point in the history
  6. pageserver: handle decompression outside vectored read_blobs (#8942)

    Part of #8130.
    
    ## Problem
    
    Currently, decompression is performed within the `read_blobs`
    implementation and the decompressed blob will be appended to the end of
    the `BytesMut` buffer. We will lose this flexibility of extending the
    buffer when we switch to using our own dio-aligned buffer (WIP in
    #8730). To facilitate the
    adoption of aligned buffer, we need to refactor the code to perform
    decompression outside `read_blobs`.
    
    ## Summary of changes
    
    - `VectoredBlobReader::read_blobs` will return `VectoredBlob` without
    performing decompression and appending decompressed blob. It becomes the
    caller's responsibility to decompress the buffer.
    - Added a new `BufView` type that functions as `Cow<Bytes, &[u8]>`.
    - Perform decompression within `VectoredBlob::read` so that people don't
    have to explicitly thinking about compression when using the reader
    interface.
    
    Signed-off-by: Yuchen Liang <[email protected]>
    yliang412 authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    4f67b02 View commit details
    Browse the repository at this point in the history
  7. Catch Cancelled and don't print a warning for it (#9121)

    In the `imitate_synthetic_size_calculation_worker` function, we might
    obtain the `Cancelled` error variant instead of hitting the cancellation
    token based path. Therefore, catch `Cancelled` and handle it analogously
    to the cancellation case.
     
    Fixes #8886.
    arpad-m authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    c47f355 View commit details
    Browse the repository at this point in the history
  8. Fix compiler warnings on macOS (#9128)

    ## Problem
    
    Compilation of neon extension on macOS produces a warning
    ```
    pgxn/neon/neon_perf_counters.c:50:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
    ```
    
    ## Summary of changes
    - Change the return type of `NeonPerfCountersShmemInit` to void
    bayandin authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    523cf71 View commit details
    Browse the repository at this point in the history
  9. test: Make test_lfc_resize more robust (#9117)

    1. Increase statement_timeout. It defaults to 120 s, which is not quite
    enough on slow or busy systems with debug build. On my laptop, the index
    creation takes about 100 s. On buildfarm, we've seen failures, e.g:
    https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9084/10997888708/index.html#suites/821f97908a487f1d7d3a2a4dd1571e99/db1834bddfe8c5b9/
    
    2. Keep twiddling the LFC size through the whole test. Before, we would
    do it for the first 10 seconds, but that only covers a small part of the
    pgbench initialization phase. Change the loop so that the pgbench run
    time determines how long the test runs, and we keep changing the LFC for
    the whole time.
    
    In the passing, also fix bogus test description, copy-pasted from a
    completely unrelated test.
    hlinnaka authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    af5c54e View commit details
    Browse the repository at this point in the history
  10. Remove TenantState::Loading (#9118)

    The last real use was removed in commit de90bf4. It was still used in
    a few unit tests, but they can use Attaching too.
    hlinnaka authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    5cbf5b4 View commit details
    Browse the repository at this point in the history
  11. chore(docker-compose): fix typo in readme (#9133)

    Typo in the readme inside docker-compose folder
    
    ## Summary of changes
    - Update the readme
    Damian972 authored Sep 24, 2024
    1 Configuration menu
    Copy the full SHA
    938b163 View commit details
    Browse the repository at this point in the history
  12. fix(test): storage scrubber should only log to stdout with info (#9067)

    As @koivunej mentioned in the storage channel, for regress test, we
    don't need to create a log file for the scrubber, and we should reduce
    noisy logs.
    
    ## Summary of changes
    
    * Disable log file creation for storage scrubber
    * Only log at info level
    
    ---------
    
    Signed-off-by: Alex Chi Z <[email protected]>
    skyzh authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    5f2f31e View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2024

  1. storcon: add tags to scheduler logs (#9127)

    We log something at info level each time we schedule a shard to a
    non-secondary location.
    
    Might as well have context for it.
    VladLazar authored Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    a26cc29 View commit details
    Browse the repository at this point in the history
  2. 1 Configuration menu
    Copy the full SHA
    7dcfccc View commit details
    Browse the repository at this point in the history
  3. storcon: do az aware scheduling (#9083)

    ## Problem
    
    Storage controller didn't previously consider AZ locality between
    compute and pageservers
    when scheduling nodes. Control plane has this feature, and, since we are
    migrating tenants
    away from it, we need feature parity to avoid perf degradations.
    
    ## Summary of changes
    
    The change itself is fairly simple:
    1. Thread az info into the scheduler
    2. Add an extra member to the scheduling scores
    
    Step (2) deserves some more discussion. Let's break it down by the shard
    type being scheduled:
    
    **Attached Shards**
    
    We wish for attached shards of a tenant to end up in the preferred AZ of
    the tenant since that
    is where the compute is like to be. 
    
    The AZ member for `NodeAttachmentSchedulingScore` has been placed
    below the affinity score (so it's got the second biggest weight for
    picking the node). The rationale for going
    below the affinity score is to avoid having all shards of a single
    tenant placed on the same node in 2 node
    regions, since that would mean that one tenant can drive the general
    workload of an entire pageserver.
    I'm not 100% sure this is the right decision, so open to discussing
    hoisting the AZ up to first place.
    
     **Secondary Shards**
    
    We wish for secondary shards of a tenant to be scheduled in a different
    AZ from the preferred one
    for HA purposes.
    
    The AZ member for `NodeSecondarySchedulingScore` has been placed first,
    so nodes in different AZs
    from the preferred one will always be considered first. On small
    clusters, this can mean that all the secondaries
    of a tenant are scheduled to the same pageserver, but secondaries don't
    use up as many resources as the
    attached location, so IMO the argument made for attached shards doesn't
    hold.
    
    Related: #8848
    VladLazar authored Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    2cf47b1 View commit details
    Browse the repository at this point in the history
  4. storage controller: make proxying of GETs to pageservers more robust (#…

    …9065)
    
    ## Problem
    
    These commits are split off from
    https://github.com/neondatabase/neon/pull/8971/commits where I was
    fixing this to make a better scale test pass -- Vlad also independently
    recognized these issues with cloudbench in
    #9062.
    
    1. The storage controller proxies GET requests to pageservers based on
    their intent, not the ground truth of where they're really attached.
    2. Proxied requests can race with scheduling to tenants, resulting in
    404 responses if the request hits the wrong pageserver.
    
    Closes: #9062
    
    ## Summary of changes
    
    1. If a shard has a running reconciler, then use the database
    generation_pageserver to decide who to proxy the request to
    2. If such a request gets a 404 response and its scheduled node has
    changed since the request was dispatched.
    jcsp authored Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    4b711ca View commit details
    Browse the repository at this point in the history
  5. 1 Configuration menu
    Copy the full SHA
    518f598 View commit details
    Browse the repository at this point in the history
  6. Build images for PG17 using Debian 12 "Bookworm" (#9132)

    This increases the support window of the OS used for PG17 by 2 years
    compared to the previous usage of Debian 11 "Bullseye".
    MMeent authored Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    c4f5736 View commit details
    Browse the repository at this point in the history
  7. storcon: include timeline ID in LSN waiting logs (#9141)

    ## Problem
    Hard to tell which timeline is holding the migration.
    
    ## Summary of Changes
    Add timeline id to log.
    VladLazar authored Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    c597238 View commit details
    Browse the repository at this point in the history
  8. fix(pageserver): handle lsn lease requests for unnormalized lsns (#9137)

    Fixes #9098.
    
    ## Problem
    
    See
    #9098 (comment).
    
    ### Related
    
    A similar problem happened with branch creation, which was discussed
    [here](#2143 (comment))
    and fixed by #2529.
    
    ## Summary of changes
    
    - Normalize the lsn on pageserver side upon lsn lease request, stores
    the normalized LSN.
    
    Signed-off-by: Yuchen Liang <[email protected]>
    yliang412 authored Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    d447f49 View commit details
    Browse the repository at this point in the history