Storage release 2024-12-09 #10053

vipvap · 2024-12-09T06:05:44Z

Storage release 2024-12-09

Please merge this Pull Request using 'Create a merge commit' button

Improves `wait_until` by: * Use `timeout` instead of `iterations`. This allows changing the timeout/interval parameters independently. * Make `timeout` and `interval` optional (default 20s and 0.5s). Most callers don't care. * Only output status every 1s by default, and add optional `status_interval` parameter. * Remove `show_intermediate_error`, this was always emitted anyway. Most callers have been updated to use the defaults, except where they had good reason otherwise.

## Problem We can't easily tell how far the state of shards is from their AZ preferences. This can be a cause of performance issues, so it's important for diagnosability that we can tell easily if there are significant numbers of shards that aren't running in their preferred AZ. Related: neondatabase/cloud#15413 ## Summary of changes - In reconcile_all, count shards that are scheduled into the wrong AZ (if they have a preference), and publish it as a prometheus gauge. - Also calculate a statistic for how many shards wanted to reconcile but couldn't. This is clearly a lazy calculation: reconcile all only runs periodically. But that's okay: shards in the wrong AZ is something that only matters if it stays that way for some period of time.

Fixes neondatabase/cloud#20973. This refactors `connect_raw` in order to return direct access to the delayed notices. I cannot find a way to test this with psycopg2 unfortunately, although testing it with psql does return the expected results.

…ures (#9950) ## Problem `if: ${{ github.event.schedule }}` gets skipped if a previous step has failed, but we want to run the step for both `success` and `failure` ## Summary of changes - Add `!cancelled()` to notification step if-condition, to skip only cancelled jobs

## Problem The credentials providers tries to connect to AWS STS even when we use plain Redis connections. ## Summary of changes * Construct the CredentialsProvider only when needed ("irsa").

The spec was written for the buggy protocol which we had before the one more similar to Raft was implemented. Update the spec with what we currently have. ref #8699

Fix the HTTP AuthMethod to accomodate the JWT authorization method. Introduces the JWT issuer as an additional field in the parquet logs

On reconfigure, we no longer passed a port for the extension server which caused us to not write out the neon.extension_server_port line. Thus, Postgres thought we were setting the port to the default value of 0. PGC_POSTMASTER GUCs cannot be set at runtime, which causes the following log messages: > LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server > LOG: configuration file "/var/db/postgres/compute/pgdata/postgresql.conf" contains errors; unaffected changes were applied Fixes: #9945 Signed-off-by: Tristan Partin <[email protected]>

## Problem After enabling LFC in tests and lowering `shared_buffers` we started having more problems with `test_pg_regress`. ## Summary of changes Set `shared_buffers` to 1MB to both exercise getPage requests/LFC, and still have enough room for Postgres to operate. Everything smaller might be not enough for Postgres under load, and can cause errors like 'no unpinned buffers available'. See Konstantin's comment [1] as well. Fixes #9956 [1]: #9956 (comment)

## Problem I was touching `test_storage_controller_node_deletion` because for AZ scheduling work I was adding a change to the storage controller (kick secondaries during optimisation) that made a FIXME in this test defunct. While looking at it I also realized that we can easily fix the way node deletion currently doesn't use a proper ScheduleContext, using the iterator type recently added for that purpose. ## Summary of changes - A testing-only behavior in storage controller where if a secondary location isn't yet ready during optimisation, it will be actively polled. - Remove workaround in `test_storage_controller_node_deletion` that previously was needed because optimisation would get stuck on cold secondaries. - Update node deletion code to use a `TenantShardContextIterator` and thereby a proper ScheduleContext

#9899) Before this PR, the storcon_cli didn't have a way to show the tenant-wide information of the TenantDescribeResponse. Sadly, the `Serialize` impl for the tenant config doesn't skip on `None`, so, the output becomes a bit bloated. Maybe we can use `skip_serializing_if(Option::is_none)` in the future. => #9983

… metrics (#9870) This PR - fixes smgr metrics #9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed *after* batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a *single* counter in `RequestContext` to track micros spent in throttle. But there are *N* metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in #9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes #9925 - sub-issue #9377 - epic: #9376

## Problem The extensions for Postgres v17 are ready but we do not test the extensions shipped with v17 ## Summary of changes Build the test image based on Postgres v17. Run the tests for v17. --------- Co-authored-by: Anastasia Lubennikova <[email protected]>

## Problem We don't have good observability for memory usage. This would be useful e.g. to debug OOM incidents or optimize performance or resource usage. We would also like to use continuous profiling with e.g. [Grafana Cloud Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/) (see neondatabase/cloud#14888). This PR is intended as a proof of concept, to try it out in staging and drive further discussions about profiling more broadly. Touches #9534. Touches neondatabase/cloud#14888. Depends on #9779. Depends on #9780. ## Summary of changes Adds a HTTP route `/profile/heap` that takes a heap profile and returns it. Query parameters: * `format`: output format (`jemalloc` or `pprof`; default `pprof`). Unlike CPU profiles (see #9764), heap profiles are not symbolized and require the original binary to translate addresses to function names. To make this work with Grafana, we'll probably have to symbolize the process server-side -- this is left as future work, as is other output formats like SVG. Heap profiles don't work on macOS due to limitations in jemalloc.

## Problem `test_sharded_ingest` ingests a lot of data, which can cause shutdown to be slow e.g. due to local "S3 uploads" or compactions. This can cause test flakes during teardown. Resolves #9740. ## Summary of changes Perform an immediate shutdown of the cluster.

… deduction for smgr latency metrics (#9962) ## Problem In the batching PR - #9870 I stopped deducting the time-spent-in-throttle fro latency metrics, i.e., - smgr latency metrics (`SmgrOpTimer`) - basebackup latency (+scan latency, which I think is part of basebackup). The reason for stopping the deduction was that with the introduction of batching, the trick with tracking time-spent-in-throttle inside RequestContext and swap-replacing it from the `impl Drop for SmgrOpTimer` no longer worked with >1 requests in a batch. However, deducting time-spent-in-throttle is desirable because our internal latency SLO definition does not account for throttling. ## Summary of changes - Redefine throttling to be a page_service pagestream request throttle instead of a throttle for repository `Key` reads through `Timeline::get` / `Timeline::get_vectored`. - This means reads done by `basebackup` are no longer subject to any throttle. - The throttle applies after batching, before handling of the request. - Drive-by fix: make throttle sensitive to cancellation. - Rename metric label `kind` from `timeline_get` to `pagestream` to reflect the new scope of throttling. To avoid config format breakage, we leave the config field named `timeline_get_throttle` and ignore the `task_kinds` field. This will be cleaned up in a future PR. ## Trade-Offs Ideally, we would apply the throttle before reading a request off the connection, so that we queue the minimal amount of work inside the process. However, that's not possible because we need to do shard routing. The redefinition of the throttle to limit pagestream request rate instead of repository `Key` rate comes with several downsides: - We're no longer able to use the throttle mechanism for other other tasks, e.g. image layer creation. However, in practice, we never used that capability anyways. - We no longer throttle basebackup.

## Problem Sharded tenants should be run in a single AZ for best performance, so that computes have AZ-local latency to all the shards. Part of #8264 ## Summary of changes - When we split a tenant, instead of updating each shard's preferred AZ to wherever it is scheduled, propagate the preferred AZ from the parent. - Drop the check in `test_shard_preferred_azs` that asserts shards end up in their preferred AZ: this will not be true again until the optimize_attachment logic is updated to make this so. The existing check wasn't testing anything about scheduling, it was just asserting that we set preferred AZ in a way that matches the way things happen to be scheduled at time of split.

## Problem Since #9423 the non-zero shards no longer need SLRU content in order to do GC. This data is now redundant on shards >0. One release cycle after merging that PR, we may merge this one, which also stops writing those pages to shards > 0, reaping the efficiency benefit. Closes: #7512 Closes: #9641 ## Summary of changes - Avoid storing SLRUs on non-zero shards - Bonus: avoid storing aux files on non-zero shards

## Problem We saw a peculiar case where a pageserver apparently got a 0-tenant response to `/re-attach` but we couldn't see the request landing on a storage controller. It was hard to confirm retrospectively that the pageserver was configured properly at the moment it sent the request. ## Summary of changes - Log the URL to which we are sending the request - Log the NodeId and metadata that we sent

Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres allows us to remove a lot of dead weight from our actual postgres connection logic.

## Problem The Pageserver signal handler would only respond to a single signal and initiate shutdown. Subsequent signals were ignored. This meant that a `SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a slow or stalled shutdown). The `test_runner` uses this to force shutdown if graceful shutdown is slow. Touches #9740. ## Summary of changes Keep responding to signals after the initial shutdown signal has been received. Arguably, the `test_runner` should also use `SIGKILL` rather than `SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT` regardless.

## Problem There is no information on session being cancelled in 2 minutes at the moment ## Summary of changes The timeout being logged for the user

proxy doesn't ever provide multiple hosts/ports, so this code adds a lot of complexity of error handling for no good reason. (stacked on #9990)

Support tenant manifests in the storage scrubber: * list the manifests, order them by generation * delete all manifests except for the two most recent generations * for the latest manifest: try parsing it. I've tested this patch by running the against a staging bucket and it successfully deleted stuff (and avoided deleting the latest two generations). In follow-up work, we might want to also check some invariants of the manifest, as mentioned in #8088. Part of #9386 Part of #8088 --------- Co-authored-by: Christian Schwarz <[email protected]>

…fig (#9992) Before this PR, some override callbacks used `.default()`, others used `.setdefault()`. As of this PR, all callbacks use `.setdefault()` which I think is least prone to failure. Aligning on a single way will set the right example for future tests that need such customization. The `test_pageserver_getpage_throttle.py` technically is a change in behavior: before, it replaced the `tenant_config` field, now it just configures the throttle. This is what I believe is intended anyway.

## Problem ``` 2024-12-03T15:42:46.5978335Z + poetry run python /__w/neon/neon/scripts/ingest_perf_test_result.py --ingest /__w/neon/neon/test_runner/perf-report-local 2024-12-03T15:42:49.5325077Z Traceback (most recent call last): 2024-12-03T15:42:49.5325603Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 165, in <module> 2024-12-03T15:42:49.5326029Z main() 2024-12-03T15:42:49.5326316Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 155, in main 2024-12-03T15:42:49.5326739Z ingested = ingest_perf_test_result(cur, item, recorded_at_timestamp) 2024-12-03T15:42:49.5327488Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-12-03T15:42:49.5327914Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 99, in ingest_perf_test_result 2024-12-03T15:42:49.5328321Z psycopg2.extras.execute_values( 2024-12-03T15:42:49.5328940Z File "/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/psycopg2/extras.py", line 1299, in execute_values 2024-12-03T15:42:49.5335618Z cur.execute(b''.join(parts)) 2024-12-03T15:42:49.5335967Z psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type numeric: "concurrent-futures" 2024-12-03T15:42:49.5336287Z LINE 57: 'concurrent-futures', 2024-12-03T15:42:49.5336462Z ^ ``` ## Summary of changes - `test_page_service_batching`: save non-numeric params as `labels` - Add a runtime check that `metric_value` is NUMERIC

…rks (#9993) This is the first step towards batching rollout. Refs - rollout plan: neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

## Problem We currently allow drain operations to proceed while the node policy is paused. ## Summary of changes Return a precondition failed error in such cases. The orchestrator is updated in neondatabase/infra#2544 to skip drain and fills if the pageserver is paused. Closes: #9907

…rd (#10035) ## Problem With the current metrics we can't identify which shards are ingesting data at any given time. ## Summary of changes Add a metric for the number of wal records received for processing by each shard. This is per (tenant, timeline, shard).

Adds an improved bulk insert benchmark, including S3 uploads. Touches #9789.

Signed-off-by: Tristan Partin <[email protected]>

## Problem part of #9114, stacked PR over #9809 The compaction scheduler now schedules partial compaction jobs. ## Summary of changes * Add the compaction job splitter based on size. * Schedule subcompactions using the compaction scheduler. * Test subcompaction scheduler in the smoke regress test. * Temporarily disable layer map checks --------- Signed-off-by: Alex Chi Z <[email protected]>

…9923) If the pageserver connection is lost while receiving the prefetch request, the prefetch queue is cleared. The error message prints the values from the prefetch slot, but because the slot was already cleared, they're all zeros: LOG: [NEON_SMGR] [shard 0] No response from reading prefetch entry 0: 0/0/0.0 block 0. This can be caused by a concurrent disconnect To fix, make local copies of the values. In the passing, also add a sanity check that if the receive() call succeeds, the prefetch slot is still intact.

## Problem The test is flaky. ## Summary of changes Disable the test. --------- Signed-off-by: Alex Chi Z <[email protected]>

…trics (#10042) ## Problem In #9962 I changed the smgr metrics to include time spent on flush. It isn't under our (=storage team's) control how long that flush takes because the client can stop reading requests. ## Summary of changes Stop the timer as soon as we've buffered up the response in the `pgb_writer`. Track flush time in a separate metric. --------- Co-authored-by: Yuchen Liang <[email protected]>

… detaches (#10011) ## Problem We saw a tenant get stuck when it had been put into Pause scheduling mode to pin it to a pageserver, then it was left idle for a while and the control plane tried to detach it. Close: #9957 ## Summary of changes - When changing policy to Detached or Secondary, set the scheduling policy to Active. - Add a test that exercises this - When persisting tenant shards, set their `generation_pageserver` to null if the placement policy is not Attached (this enables consistency checks to work, and avoids leaving state in the DB that could be confusing/misleading in future)

github-actions · 2024-12-09T07:01:42Z

7051 tests run: 6726 passed, 0 failed, 325 skipped (full report)

Flaky tests (2)

Postgres 17

test_compute_pageserver_connection_stress: debug-x86-64

Postgres 16

test_timeline_archive[4]: release-arm64

Code coverage* (full report)

functions: 31.4% (8334 of 26529 functions)
lines: 47.7% (65561 of 137366 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
6c349e7 at 2024-12-09T07:01:41.522Z :recycle:}

danieltprice · 2024-12-12T21:16:26Z

Reviewed for changelog

lubennikovaav and others added 30 commits December 2, 2024 10:10

Update pgvector to 0.8.0 (#9733)

45658cc

proxy: Create Elasticache credentials provider lazily (#9967)

1b60571

## Problem The credentials providers tries to connect to AWS STS even when we use plain Redis connections. ## Summary of changes * Construct the CredentialsProvider only when needed ("irsa").

Update consensus protocol spec (#9607)

fa909c2

The spec was written for the buggy protocol which we had before the one more similar to Raft was implemented. Update the spec with what we currently have. ref #8699

Bump OTel, tracing, reqwest crates (#9970)

243bca1

feat(proxy): emit JWT auth method and JWT issuer in parquet logs (#9971)

2dc238e

Fix the HTTP AuthMethod to accomodate the JWT authorization method. Introduces the JWT issuer as an additional field in the parquet logs

chore(proxy): remove postgres config parser and md5 support (#9990)

27a42d0

Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres allows us to remove a lot of dead weight from our actual postgres connection logic.

Improvement: add console redirect timeout warning (#9985)

3baef0b

## Problem There is no information on session being cancelled in 2 minutes at the moment ## Summary of changes The timeout being logged for the user

chore(proxy): enforce single host+port (#9995)

9ef0662

proxy doesn't ever provide multiple hosts/ports, so this code adds a lot of complexity of error handling for no good reason. (stacked on #9990)

page_service: enable batching in Rust & Python Tests + Python benchma…

8d93d02

…rks (#9993) This is the first step towards batching rollout. Refs - rollout plan: neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

VladLazar and others added 9 commits December 6, 2024 12:51

test_runner/performance: add improved bulk insert benchmark (#9812)

14c4fae

Adds an improved bulk insert benchmark, including S3 uploads. Touches #9789.

Bump sql_exporter to 0.16.0 (#10041)

e4837b0

Signed-off-by: Tristan Partin <[email protected]>

test(pageserver): disable gc_compaction smoke test for now (#10045)

b1fd086

## Problem The test is flaky. ## Summary of changes Disable the test. --------- Signed-off-by: Alex Chi Z <[email protected]>

Storage release 2024-12-09

6c349e7

vipvap requested review from a team as code owners December 9, 2024 06:05

vipvap requested review from yliang412, conradludgate, knizhnik, piercypixel and mtyazici and removed request for a team December 9, 2024 06:05

VladLazar approved these changes Dec 9, 2024

View reviewed changes

bayandin approved these changes Dec 9, 2024

View reviewed changes

ololobus approved these changes Dec 9, 2024

View reviewed changes

VladLazar merged commit 5f4559e into release Dec 9, 2024
89 checks passed

VladLazar deleted the rc/release/2024-12-09 branch December 9, 2024 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage release 2024-12-09 #10053

Storage release 2024-12-09 #10053

vipvap commented Dec 9, 2024

github-actions bot commented Dec 9, 2024

Postgres 17

Postgres 16

danieltprice commented Dec 12, 2024

Storage release 2024-12-09 #10053

Storage release 2024-12-09 #10053

Conversation

vipvap commented Dec 9, 2024