Compute release 2024-10-20 #9459

ololobus · 2024-10-20T11:17:06Z

This is a PR for compute-only release on Mon. It's needed for Neon Authorize release.
Slack discussion is here

I see that there are some recent compute fixes (like 62a3348), so I just too the main HEAD at the moment of PR creation.

part of #9114 ## Summary of changes gc-compaction may take a lot of disk space, and if it does, the caller should do a partial gc-compaction. This patch adds space check for the compaction job. --------- Signed-off-by: Alex Chi Z <[email protected]>

## Problem Tenant deletion only removes the current shards from remote storage. Any stale parent shards (before splits) will be left behind. These shards are kept since child shards may reference data from the parent until new image layers are generated. ## Summary of changes * Document a special case for pageserver tenant deletion that deletes all shards in remote storage when given an unsharded tenant ID, as well as any unsharded tenant data. * Pass an unsharded tenant ID to delete all remote storage under the tenant ID prefix. * Split out `RemoteStorage::delete_prefix()` to delete a bucket prefix, with additional test coverage. * Add a `delimiter` argument to `asset_prefix_empty()` to support partial prefix matches (i.e. all shards starting with a given tenant ID).

## Problem In #9259, we found that the `check_safekeepers_synced` fast path could result in a lower basebackup LSN than the `flush_lsn` reported by Safekeepers in `VoteResponse`, causing the compute to panic once on startup. This would happen if the Safekeeper had unflushed WAL records due to a compute disconnect. The `TIMELINE_STATUS` query would report a `flush_lsn` below these unflushed records, while `VoteResponse` would flush the WAL and report the advanced `flush_lsn`. See #9259 (comment). ## Summary of changes Flush the WAL if the compute disconnects during WAL processing.

Simple PR to log installed_extensions statistics. in the following format: ``` 2024-10-17T13:53:02.860595Z INFO [NEON_EXT_STAT] {"extensions":[{"extname":"plpgsql","versions":["1.0"],"n_databases":2},{"extname":"neon","versions":["1.5"],"n_databases":1}]} ```

Part of the aux v1 retirement #8623 ## Summary of changes Remove write/read path for aux v1, but keeping the config item and the index part field for now. --------- Signed-off-by: Alex Chi Z <[email protected]>

…he result of a bad generation (#9383) ## Problem The pageserver generally trusts the storage controller/control plane to give it valid generations. However, sometimes it should be obvious that a generation is bad, and for defense in depth we should detect that on the pageserver. This PR is part 1 of 2: 1. in this PR we detect and warn on such situations, but do not block starting up the tenant. Once we have confidence that the check is not firing unexpectedly in the field 2. part 2 of 2 will introduce a condition that refuses to start a tenant in this situtation, and a test for that (maybe, if we can figure out how to spoof an ancient mtime) Related: #6951 ## Summary of changes - When loading an index older than 2 weeks, log an INFO message noting that we will check for other indices - When loading an index older than 2 weeks _and_ a newer-generation index exists, log a warning.

We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1820-2024-10-17). Also update mold. [release notes for 2.34.0](https://github.com/rui314/mold/releases/tag/v2.34.0), [release notes for 2.34.1](https://github.com/rui314/mold/releases/tag/v2.34.1). Prior update was in #8939.

The forever ongoing effort of juggling multiple versions of rustls :3 now with new crypto library aws-lc. Because of dependencies, it is currently impossible to not have both ring and aws-lc in the dep tree, therefore our only options are not updating rustls or having both crypto backends enabled... According to benchmarks run by the rustls maintainer, aws-lc is faster than ring in some cases too <https://jbp.io/graviola/>, so it's not without its upsides,

Fixes new lints from `cargo +nightly clippy` (`clippy 0.1.83 (798fb83f 2024-10-16)`)

This PR introduces a `/grants` endpoint which allows setting specific `privileges` to certain `role` for a certain `schema`. Related to #9344 Together these endpoints will be used to configure JWT extension and set correct usage to its schema to specific roles that will need them. --------- Co-authored-by: Conrad Ludgate <[email protected]>

…fter two opposite migrations (#9435) ## Problem If we migrate A->B, then B->A, and the notification of A->B fails, then we might have retained state that makes us think "A" is the last state we sent to the compute hook, whereas when we migrate B->A we should really be sending a fresh notification in case our earlier failed notification has actually mutated the remote compute config. Closes: #9417 ## Summary of changes - Add a reproducer for the bug (`test_storage_controller_compute_hook_revert`) - Refactor compute hook code to represent remote state with `ComputeRemoteState` which stores a boolean for whether the compute has fully applied the change as well as the request that the compute accepted. - The actual bug fix: after sending a compute notification, if we got a 423 response then update our ComputeRemoteState to reflect that we have mutated the remote state. This way, when we later try and notify for our historic location, we will properly see that as a change and send the notification. Co-authored-by: Vlad Lazar <[email protected]>

Might make the test less flaky.

Includes a multidict patch release to fix build with newer cpython.

Adds endpoint to install extensions: **POST** `/extensions` ``` {"extension":"pg_sessions_jwt","database":"neondb","version":"1.0.0"} ``` Will be used by `local-proxy`. Example, for the JWT authentication to work the database needs to have the pg_session_jwt extension and also to enable JWT to work in RLS policies. --------- Co-authored-by: Conrad Ludgate <[email protected]>

Otherwise term history starting with 0/0 is streamed to safekeepers. ref #9434

## Problem Consider the following sequence of events: 1. Shard location gets downgraded to secondary while there's a libpq connection in pagestream mode from the compute 2. There's no active tenant, so we return `QueryError::Reconnect` from `PageServerHandler::handle_get_page_at_lsn_request`. 3. Error bubbles up to `PostgresBackendIO::process_message`, bailing us out of pagestream mode. 4. We instruct the client to reconnnect, but continue serving the libpq connection. The client isn't yet aware of the request to reconnect and believes it is still in pagestream mode. Pageserver fails to deserialize get page requests wrapped in `CopyData` since it's not in pagestream mode. ## Summary of Changes When we wish to instruct the client to reconnect, also disconnect from the server side after flushing the error. Closes neondatabase/cloud#17336

Follow up on #9344. We want to install the extension automatically. We didn't want to couple the extension into compute_ctl so instead local_proxy is the one to issue requests specific to the extension. depends on #9344 and #9395

## Problem Pageserver returns 409 (Conflict) if any of the shards are already deleting the timeline. This resulted in an error being propagated out of the HTTP handler and to the client. It's an expected scenario so we should handle it nicely. This caused failures in `test_storage_controller_smoke` [here](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9435/11390431900/index.html#suites/8fc5d1648d2225380766afde7c428d81/86eee4b002d6572d). ## Summary of Changes Instead of returning an error on 409s, we now bubble the status code up and let the HTTP handler code retry until it gets a 404 or times out.

In neon_collector_autoscaling.jsonnet, the collector name is hardcoded to neon_collector_autoscaling. This issue manifests itself such that sql_exporter would not find the collector configuration. Signed-off-by: Tristan Partin <[email protected]>

In #9453, we want to remove the non-gzipped basebackup code in the computes, and always request gzipped basebackups. However, right now the pageserver's page service only accepts basebackup requests in the following formats: * `basebackup <tenant_id> <timeline_id>`, lsn is determined by the pageserver as the most recent one (`timeline.get_last_record_rlsn()`) * `basebackup <tenant_id> <timeline_id> <lsn>` * `basebackup <tenant_id> <timeline_id> <lsn> --gzip` We add a fourth case, `basebackup <tenant_id> <timeline_id> --gzip` to allow gzipping the request for the latest lsn as well.

github-actions · 2024-10-20T12:08:26Z

5229 tests run: 5015 passed, 0 failed, 214 skipped (full report)

Flaky tests (4)

Postgres 17

test_pg_regress[None]: debug-x86-64

Postgres 16

test_node_status_after_restart: release-x86-64

Postgres 14

test_multixid_wraparound_import: release-x86-64
test_layer_download_timeouted: release-x86-64

Code coverage* (full report)

functions: 31.2% (7566 of 24247 functions)
lines: 48.8% (59921 of 122750 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
cc25ef7 at 2024-10-20T13:37:59.928Z :recycle:}

forgot to bump this before

danieltprice · 2024-10-24T20:03:55Z

Reviewed for changelog

skyzh and others added 20 commits October 17, 2024 10:29

refactor(pageserver): remove aux v1 code path (#9424)

63b3491

Part of the aux v1 retirement #8623 ## Summary of changes Remove write/read path for aux v1, but keeping the config item and the index part field for now. --------- Signed-off-by: Alex Chi Z <[email protected]>

2024 oct new clippy lints (#9448)

b8304f9

Fixes new lints from `cargo +nightly clippy` (`clippy 0.1.83 (798fb83f 2024-10-16)`)

Increase shared_buffers in test_subscriber_synchronous_commit. (#9427)

98fee7a

Might make the test less flaky.

Update ruff to much newer version (#9433)

15fecff

Includes a multidict patch release to fix build with newer cpython.

walproposer: immediately exit if sync-safekeepers collected 0/0. (#9442)

fecff15

Otherwise term history starting with 0/0 is streamed to safekeepers. ref #9434

ololobus requested review from a team as code owners October 20, 2024 11:17

ololobus requested review from yliang412, conradludgate and hlinnaka and removed request for a team October 20, 2024 11:17

ololobus requested review from piercypixel, davidgomes, lubennikovaav, devjv and tristan957 and removed request for a team October 20, 2024 11:17

bump pg-session-jwt version (#9455)

cc25ef7

forgot to bump this before

conradludgate approved these changes Oct 20, 2024

View reviewed changes

ololobus changed the title ~~Compute release 2024-10-21~~ Compute release 2024-10-20 Oct 20, 2024

ololobus merged commit fe1b181 into release Oct 20, 2024
172 checks passed

ololobus deleted the compute-rc-2024-10-20 branch October 20, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2024-10-20 #9459

Compute release 2024-10-20 #9459

ololobus commented Oct 20, 2024 •

edited

Loading

github-actions bot commented Oct 20, 2024 •

edited

Loading

Postgres 17

Postgres 16

Postgres 14

danieltprice commented Oct 24, 2024

Compute release 2024-10-20 #9459

Compute release 2024-10-20 #9459

Conversation

ololobus commented Oct 20, 2024 • edited Loading

github-actions bot commented Oct 20, 2024 • edited Loading

5229 tests run: 5015 passed, 0 failed, 214 skipped (full report)

Postgres 17

Postgres 16

Postgres 14

Code coverage* (full report)

danieltprice commented Oct 24, 2024

ololobus commented Oct 20, 2024 •

edited

Loading

github-actions bot commented Oct 20, 2024 •

edited

Loading