Compute release 2025-02-24 #10952

vipvap · 2025-02-24T15:34:44Z

Compute release 2025-02-24

Please merge this Pull Request using 'Create a merge commit' button

…#10865) ## Problem If the deploy job on the release branch doesn't succeed, the preprod deployment will not have happened. It was requested that this triggers a notification in #10662. ## Summary of changes If we're on the release branch and the deploy job doesn't end up in "success", notify storage oncall on slack.

## Problem We log image compaction stats even when no image compaction happened. This is logged every 10 seconds for every timeline. ## Summary of changes Only log when we actually performed any image compaction.

## Problem The usual workflow for me to debug read path errors in staging is: download the tenant to my laptop, import, and then run some read tests. With this patch, we can do this directly over staging pageservers. ## Summary of changes * Add a new `touchpagelsn` API that does a page read but does not return page info back. * Allow read from latest record LSN from get/touchpagelsn * Add read_debug config in the context. * The read path will read the context config to decide whether to enable read path tracing or not. Signed-off-by: Alex Chi Z <[email protected]>

…nchmarking workflow logs (#10874) ## Problem We have links to deprecated dashboards in our logs Example https://github.com/neondatabase/neon/actions/runs/13382454571/job/37401983608#step:8:348 ## Summary of changes Use link to cross service endpoint instead. Example: https://github.com/neondatabase/neon/actions/runs/13395407925/job/37413056148#step:7:345

## Problem Read errors during repartition should be a critical error. ## Summary of changes <del>We only have one call site</del> We have two call sites of `repartition` where one of them is during the initial image upload optimization and another is during image layer creation, so I added a `critical!` here instead of inside `collect_keyspace`. --------- Signed-off-by: Alex Chi Z <[email protected]>

## Problem The nightly test discovered problems in the extensions upgrade test. 1. `PLv8` has different versions on PGv17 and PGv16 and a different test set, which was not implemented correctly [sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930271) 2. The same for `semver` [sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930017) 3. `pgtap` interfered with the other tests, e.g. tables, created by other extensions caused the tests to fail. ## Summary of changes The discovered problems were fixed. 1. The tests list for `PLv8` is now generated using the original Makefile 2. The patches for `semver` are now split for PGv16 and PGv17. 3. `pgtap` is being tested in a separate database now. --------- Co-authored-by: Mikhail Kot <[email protected]>

## Problem we measure ingest performance for a few variants (stripe-sizes, pre-sharded, shard-splitted). However some phenomena (e.g. related to L0 compaction) in PS can be better observed and optimized with un-sharded tenants. ## Summary of changes - Allow to create projects with a policy that disables sharding (`{"scheduling": "Essential"}`) - add a variant to ingest_benchmark that uses that policy for the new project ## Test run https://github.com/neondatabase/neon/actions/runs/13396325970

## Problem Autosplits are crucial for bulk ingest performance. However, autosplits were only attempted when there was no other pending work. This could cause e.g. mass AZ affinity violations following Pageserver restarts to starve out autosplits for hours. Resolves #10762. ## Summary of changes Always attempt autosplits in the background reconciliation loop, regardless of other pending work.

## Problem Our AWS account IDs are copy-pasted all over the place. A wrong paste might only be caught late if we hardcode them, but will get flagged instantly by actionlint if we access them from github actions variables. Resolves #10787, follow-up for #10613. ## Summary of changes Access AWS account IDs using Github Actions variables.

Per cargo machete.

## Problem We've seen the previous default of 50 cause OOMs. Compacting many L0 layers at once now has limited benefit, since the cost is mostly linear anyway. This is already being reduced to 20 in production settings. ## Summary of changes Reduce `DEFAULT_COMPACTION_UPPER_LIMIT` to 20. Once released, let's remove the config overrides.

## Problem We'd like to enable CPU/heap profiling for storcon. This requires jemalloc. ## Summary of changes Use jemalloc as the global allocator, and enable heap sampling for profiling.

Adds CPU/heap profiling for storcon. Also fixes allowlists to match on the path only, since profiling endpoints take query parameters. Requires #10892 for heap profiling.

…ines (#10877) ## Problem A simpler version of #10812 ## Summary of changes Image layer creation will be preempted by L0 accumulated on other timelines. We stop image layer generation if there's a pending L0 compaction request. --------- Signed-off-by: Alex Chi Z <[email protected]>

This PR does the following things: * The initial heartbeat round blocks the storage controller from becoming online again. If all safekeepers are unresponsive, this can cause storage controller startup to be very slow. The original intent of #10583 was that heartbeats don't affect normal functionality of the storage controller. So add a short timeout to prevent it from impeding storcon functionality. * Fix the URL of the utilization endpoint. * Don't send heartbeats to safekeepers which are decomissioned. Part of #9011 context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589

`pprof::symbolize()` used a regex to strip the Rust monomorphization suffix from generic methods. However, the `backtrace` crate can do this itself if formatted with the `:#` flag. Also tighten up the code a bit.

…nning the build-tools image (#10890) ## Problem Pinning build tools still replicated the ACR/ECR/Docker Hub login and pushing, even though we have a reusable workflow for this. Was mentioned as a TODO in #10613. ## Summary of changes Reuse `_push-to-container-registry.yml` for pinning the build-tools images.

…ion test (#10895) ## Problem Background heatmap uploads and downloads were blocking the ones done manually by the test. ## Summary of changes Disable Background heatmap uploads and downloads for the cold migration test. The test does them explicitly.

Doing this to help debugging offline safekeepers. Part of #9011

Fix an issue caused by PR #10891: we introduced the concept of timeouts for heartbeats, where we would hang up on the other side of the oneshot channel if a timeout happened (future gets cancelled, receiver is dropped). This hang up would make the heartbeat task panic when it did obtain the response, as we unwrap the result of the result sending operation. The panic would lead to the heartbeat task panicing itself, which is then according to logs the last sign of life we of that process invocation. I'm not sure what brings down the process, in theory tokio [should continue](https://docs.rs/tokio/latest/tokio/runtime/enum.UnhandledPanic.html#variant.Ignore), but idk. Alternative to #10901.

## Problem Statistic is saved in local file and so lost on compute restart. Persist in in page server using the same AUX file mechanism used for replication slots See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789 ## Summary of changes Persist postal file using AUX mechanism Postgres PRs: neondatabase/postgres#547 neondatabase/postgres#446 neondatabase/postgres#445 Related to #6684 and #6228 Co-authored-by: Konstantin Knizhnik <[email protected]>

## Problem We don't test `pg_repack` ## Summary of changes The test for `pg_repack` is added

We already log a message for this during the L0 flush, so the additional message is mostly noise.

Mark walredo errors as critical too. Also pull the pattern matching out into the outer `match`. Follows #10872.

ALTER SUBSCRIPTION requires AccessExclusive lock which conflicts with iteration over pg_subscription when multiple databases are present and operations are applied concurrently. Fix by explicitly locking pg_subscription in the beginning of the transaction in each database. ## Problem neondatabase/cloud#24292

## Problem Storage controller uses unsecure http for pageserver API. Closes: neondatabase/cloud#23734 Closes: neondatabase/cloud#24091 ## Summary of changes - Add an optional `listen_https_port` field to storage controller's Node state and its API (RegisterNode/ListNodes/etc). - Allow updating `listen_https_port` on node registration to gradually add https port for all nodes. - Add `use_https_pageserver_api` CLI option to storage controller to enable https. - Pageserver doesn't support https for now and always reports `https_port=None`. This will be addressed in follow-up PR.

…10797) These hacks were added to appease the forward compatibility tests and can be removed. Signed-off-by: Tristan Partin <[email protected]>

## Problem The interpreted SK <-> PS protocol does not guard against gaps (neither does the Vanilla one, but that's beside the point). ## Summary of changes Extend the protocol to include the start LSN of the PG WAL section from which the records were interpreted. Validation is enabled via a config flag on the pageserver and works as follows: **Case 1**: `raw_wal_start_lsn` is smaller than the requested LSN There can't be gaps here, but we check that the shard received records which it hasn't seen before. **Case 2**: `raw_wal_start_lsn` is equal to the requested LSN This is the happy case. No gap and nothing to check **Case 3**: `raw_wal_start_lsn` is greater than the requested LSN This is a gap. To make Case 3 work I had to bend the protocol a bit. We read record chunks of WAL which aren't record aligned and feed them to the decoder. The picture below shows a shard which subscribes at a position somewhere within Record 2. We already have a wal reader which is below that position so we wait to catch up. We read some wal in Read 1 (all of Record 1 and some of Record 2). The new shard doesn't need Record 1 (it has already processed it according to the starting position), but we read past it's starting position. When we do Read 2, we decode Record 2 and ship it off to the shard, but the starting position of Read 2 is greater than the starting position the shard requested. This looks like a gap. ![image](https://github.com/user-attachments/assets/8aed292e-5d62-46a3-9b01-fbf9dc25efe0) To make it work, we extend the protocol to send an empty `InterpretedWalRecords` to shards if the WAL the records originated from ends the requested start position. On the pageserver, that just updates the tracking LSNs in memory (no-op really). This gives us a workaround for the fake gap. As a drive by, make `InterpretedWalRecords::next_record_lsn` mandatory in the application level definition. It's always included. Related: neondatabase/cloud#23935

## Problem Occasionally removed (void) from definition of `CheckPointReplicationState` function ## Summary of changes Restore function prototype. neondatabase/postgres#585 neondatabase/postgres#586 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

…ion (#10848) ## Problem We failed to detect #10845 before merging, because the tests we run with a matrix of component versions didn't include the ones that did live migrations. ## Summary of changes - Do a live migration during the storage controller smoke test, since this is a pretty core piece of functionality - Apply a compat version matrix to the graceful cluster restart test, since this is the functionality that we most urgently need to work across versions to make deploys work. I expect the first CI run of this to fail, because #10845 isn't merged yet.

## Problem There are a bunch of minor improvements that are too small and insignificant as is, so collecting them in one PR. ## Summary of changes - Add runner arch to artifact name to make it easier to distinguish files on S3 ([ref](https://neondb.slack.com/archives/C059ZC138NR/p1739365938371149)) - Use `github.event.pull_request.number` instead of parsing `$GITHUB_EVENT_PATH` file - Update Allure CLI and `allure-pytest`

## Problem #10788 introduced an API for warming up attached locations by downloading all layers in the heatmap. We intend to use it for warming up timelines after unarchival too, but it doesn't work. Any heatmap generated after the unarchival will not include our timeline, so we've lost all those layers. ## Summary of changes Generate a cheeky heatmap on unarchival. It includes all the visible layers. Use that as the `PreviousHeatmap` which inputs into actual heatmap generation. Closes: #10541

This upgrades the `proxy/` crate as well as the forked libraries in `libs/proxy/` to edition 2024. Also reformats the imports of those forked libraries via: ``` cargo +nightly fmt -p proxy -p postgres-protocol2 -p postgres-types2 -p tokio-postgres2 -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true ``` It can be read commit-by-commit: the first commit has no formatting changes, only changes to accomodate the new edition. Part of #10918

github-actions · 2025-02-24T15:35:00Z

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

skyzh

LGTM, and please let me know when and which tenant has #6560 (the pg_stat persistence) enabled, I will keep an eye on it and fix any potential storage perf issues :)

ololobus · 2025-02-24T16:38:21Z

and please let me know when and which tenant has #6560 (the pg_stat persistence) enabled, I will keep an eye on it and fix any potential storage perf issues :)

I think we will enable it on staging for all tenants first

github-actions · 2025-02-24T19:07:13Z

7689 tests run: 7311 passed, 0 failed, 378 skipped (full report)

Flaky tests (4)

Postgres 17

test_crafted_wal_end[vanilla-last_wal_record_crossing_segment]: release-x86-64-without-lfc
test_storage_controller_s3_time_travel_recovery: debug-x86-64-without-lfc
test_timelines_parallel_endpoints: release-arm64-with-lfc
test_s3_wal_replay[real_s3]: release-arm64-without-lfc

Code coverage* (full report)

functions: 32.8% (8634 of 26289 functions)
lines: 48.7% (72648 of 149289 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
33e5930 at 2025-02-24T19:07:12.737Z :recycle:}

jcgruenhage and others added 30 commits February 18, 2025 17:20

pageserver: don't log noop image compaction (#10873)

cb80605

## Problem We log image compaction stats even when no image compaction happened. This is logged every 10 seconds for every timeline. ## Summary of changes Only log when we actually performed any image compaction.

Remove unused dependencies (#10887)

2d96134

Per cargo machete.

storcon: use jemalloc (#10892)

3720cf1

## Problem We'd like to enable CPU/heap profiling for storcon. This requires jemalloc. ## Summary of changes Use jemalloc as the global allocator, and enable heap sampling for profiling.

storcon: add CPU/heap profiling endpoints (#10894)

aab5482

Adds CPU/heap profiling for storcon. Also fixes allowlists to match on the path only, since profiling endpoints take query parameters. Requires #10892 for heap profiling.

libs: remove unnecessary regex in pprof::symbolize (#10893)

0b3db74

`pprof::symbolize()` used a regex to strip the Rust monomorphization suffix from generic methods. However, the `backtrace` crate can do this itself if formatted with the `:#` flag. Also tighten up the code a bit.

storcon: log all safekeepers marked as offline (#10898)

787b98f

Doing this to help debugging offline safekeepers. Part of #9011

Add pg_repack test (#10638)

1d9346f

## Problem We don't test `pg_repack` ## Summary of changes The test for `pg_repack` is added

pageserver: downgrade ephemeral layer roll wait message (#10883)

f7edcf1

We already log a message for this during the L0 flush, so the additional message is mostly noise.

pageserver: make compaction walredo errors critical (#10884)

07bee60

Mark walredo errors as critical too. Also pull the pattern matching out into the outer `match`. Follows #10872.

Remove forward compatibility hacks related to compute HTTP servers (#…

f7474d3

…10797) These hacks were added to appease the forward compatibility tests and can be removed. Signed-off-by: Tristan Partin <[email protected]>

Remove hacks in compute_ctl related to compute ID (#10751)

d571553

jcsp and others added 5 commits February 24, 2025 12:22

Compute release 2025-02-24

33e5930

vipvap requested review from a team as code owners February 24, 2025 15:34

vipvap requested review from hlinnaka, MMeent, skyzh, NanoBjorn, piercypixel, cloneable and sharnoff and removed request for a team February 24, 2025 15:34

ololobus approved these changes Feb 24, 2025

View reviewed changes

skyzh approved these changes Feb 24, 2025

View reviewed changes

tristan957 merged commit bcfc633 into release-compute Feb 24, 2025
97 checks passed

tristan957 deleted the rc/release-compute/2025-02-24 branch February 24, 2025 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2025-02-24 #10952

Compute release 2025-02-24 #10952

vipvap commented Feb 24, 2025

github-actions bot commented Feb 24, 2025

skyzh left a comment

ololobus commented Feb 24, 2025

github-actions bot commented Feb 24, 2025

Postgres 17

Compute release 2025-02-24 #10952

Compute release 2025-02-24 #10952

Conversation

vipvap commented Feb 24, 2025