Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute release 2025-02-24 #10952

Merged
merged 51 commits into from
Feb 24, 2025
Merged

Conversation

vipvap
Copy link

@vipvap vipvap commented Feb 24, 2025

Compute release 2025-02-24

Please merge this Pull Request using 'Create a merge commit' button

jcgruenhage and others added 30 commits February 18, 2025 17:20
…#10865)

## Problem
If the deploy job on the release branch doesn't succeed, the preprod
deployment will not have happened. It was requested that this triggers a
notification in #10662.

## Summary of changes
If we're on the release branch and the deploy job doesn't end up in
"success", notify storage oncall on slack.
## Problem

We log image compaction stats even when no image compaction happened.
This is logged every 10 seconds for every timeline.

## Summary of changes

Only log when we actually performed any image compaction.
## Problem

The usual workflow for me to debug read path errors in staging is:
download the tenant to my laptop, import, and then run some read tests.

With this patch, we can do this directly over staging pageservers.

## Summary of changes

* Add a new `touchpagelsn` API that does a page read but does not return
page info back.
* Allow read from latest record LSN from get/touchpagelsn
* Add read_debug config in the context.
* The read path will read the context config to decide whether to enable
read path tracing or not.

Signed-off-by: Alex Chi Z <[email protected]>
…nchmarking workflow logs (#10874)

## Problem

We have links to deprecated dashboards in our logs

Example
https://github.com/neondatabase/neon/actions/runs/13382454571/job/37401983608#step:8:348

## Summary of changes

Use link to cross service endpoint instead.

Example:
https://github.com/neondatabase/neon/actions/runs/13395407925/job/37413056148#step:7:345
## Problem

Read errors during repartition should be a critical error.

## Summary of changes

<del>We only have one call site</del> We have two call sites of
`repartition` where one of them is during the initial image upload
optimization and another is during image layer creation, so I added a
`critical!` here instead of inside `collect_keyspace`.

---------

Signed-off-by: Alex Chi Z <[email protected]>
## Problem
The nightly test discovered problems in the extensions upgrade test.
1. `PLv8` has different versions on PGv17 and PGv16 and a different test
set, which was not implemented correctly
[sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930271)
2. The same for `semver`
[sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930017)
3. `pgtap` interfered with the other tests, e.g. tables, created by
other extensions caused the tests to fail.

## Summary of changes
The discovered problems were fixed.
1. The tests list for `PLv8` is now generated using the original
Makefile
2. The patches for `semver` are now split for PGv16 and PGv17.
3. `pgtap` is being tested in a separate database now.

---------

Co-authored-by: Mikhail Kot <[email protected]>
## Problem

we measure ingest performance for a few variants (stripe-sizes,
pre-sharded, shard-splitted).
However some phenomena (e.g. related to L0 compaction) in PS can be
better observed and optimized with un-sharded tenants.

## Summary of changes

- Allow to create projects with a policy that disables sharding
(`{"scheduling": "Essential"}`)
- add a variant to ingest_benchmark that uses that policy for the new
project

## Test run
https://github.com/neondatabase/neon/actions/runs/13396325970
## Problem

Autosplits are crucial for bulk ingest performance. However, autosplits
were only attempted when there was no other pending work. This could
cause e.g. mass AZ affinity violations following Pageserver restarts to
starve out autosplits for hours.

Resolves #10762.

## Summary of changes

Always attempt autosplits in the background reconciliation loop,
regardless of other pending work.
## Problem
Our AWS account IDs are copy-pasted all over the place. A wrong paste
might only be caught late if we hardcode them, but will get flagged
instantly by actionlint if we access them from github actions variables.
Resolves #10787, follow-up
for #10613.

## Summary of changes
Access AWS account IDs using Github Actions variables.
## Problem

We've seen the previous default of 50 cause OOMs. Compacting many L0
layers at once now has limited benefit, since the cost is mostly linear
anyway. This is already being reduced to 20 in production settings.

## Summary of changes

Reduce `DEFAULT_COMPACTION_UPPER_LIMIT` to 20.

Once released, let's remove the config overrides.
## Problem

We'd like to enable CPU/heap profiling for storcon. This requires
jemalloc.

## Summary of changes

Use jemalloc as the global allocator, and enable heap sampling for
profiling.
Adds CPU/heap profiling for storcon.

Also fixes allowlists to match on the path only, since profiling
endpoints take query parameters.

Requires #10892 for heap profiling.
…ines (#10877)

## Problem

A simpler version of #10812

## Summary of changes

Image layer creation will be preempted by L0 accumulated on other
timelines. We stop image layer generation if there's a pending L0
compaction request.

---------

Signed-off-by: Alex Chi Z <[email protected]>
This PR does the following things:

* The initial heartbeat round blocks the storage controller from
becoming online again. If all safekeepers are unresponsive, this can
cause storage controller startup to be very slow. The original intent of
#10583 was that heartbeats don't affect normal functionality of the
storage controller. So add a short timeout to prevent it from impeding
storcon functionality.

* Fix the URL of the utilization endpoint.

* Don't send heartbeats to safekeepers which are decomissioned.

Part of #9011

context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589
`pprof::symbolize()` used a regex to strip the Rust monomorphization
suffix from generic methods. However, the `backtrace` crate can do this
itself if formatted with the `:#` flag.

Also tighten up the code a bit.
…nning the build-tools image (#10890)

## Problem
Pinning build tools still replicated the ACR/ECR/Docker Hub login and
pushing, even though we have a reusable workflow for this. Was mentioned
as a TODO in #10613.

## Summary of changes
Reuse `_push-to-container-registry.yml` for pinning the build-tools
images.
…ion test (#10895)

## Problem

Background heatmap uploads and downloads were blocking the ones done
manually by the test.

## Summary of changes

Disable Background heatmap uploads and downloads for the cold migration
test. The test does
them explicitly.
Doing this to help debugging offline safekeepers.

Part of #9011
Fix an issue caused by PR
#10891: we introduced the
concept of timeouts for heartbeats, where we would hang up on the other
side of the oneshot channel if a timeout happened (future gets
cancelled, receiver is dropped).

This hang up would make the heartbeat task panic when it did obtain the
response, as we unwrap the result of the result sending operation. The
panic would lead to the heartbeat task panicing itself, which is then
according to logs the last sign of life we of that process invocation.
I'm not sure what brings down the process, in theory tokio [should
continue](https://docs.rs/tokio/latest/tokio/runtime/enum.UnhandledPanic.html#variant.Ignore),
but idk.

Alternative to #10901.
## Problem

Statistic is saved in local file and so lost on compute restart.

Persist in in page server using the same AUX file mechanism used for
replication slots

See more about motivation in
https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789

## Summary of changes

Persist postal file using AUX mechanism


Postgres PRs:
neondatabase/postgres#547
neondatabase/postgres#446
neondatabase/postgres#445

Related to #6684 and #6228

Co-authored-by: Konstantin Knizhnik <[email protected]>
## Problem
We don't test `pg_repack`
## Summary of changes
The test for `pg_repack` is added
We already log a message for this during the L0 flush, so the additional
message is mostly noise.
Mark walredo errors as critical too.

Also pull the pattern matching out into the outer `match`.

Follows #10872.
ALTER SUBSCRIPTION requires AccessExclusive lock
which conflicts with iteration over pg_subscription when multiple
databases are present
and operations are applied concurrently.

Fix by explicitly locking pg_subscription
in the beginning of the transaction in each database.

## Problem
neondatabase/cloud#24292
## Problem

Storage controller uses unsecure http for pageserver API.

Closes: neondatabase/cloud#23734
Closes: neondatabase/cloud#24091

## Summary of changes

- Add an optional `listen_https_port` field to storage controller's Node
state and its API (RegisterNode/ListNodes/etc).
- Allow updating `listen_https_port` on node registration to gradually
add https port for all nodes.
- Add `use_https_pageserver_api` CLI option to storage controller to
enable https.
- Pageserver doesn't support https for now and always reports
`https_port=None`. This will be addressed in follow-up PR.
…10797)

These hacks were added to appease the forward compatibility tests and
can be removed.

Signed-off-by: Tristan Partin <[email protected]>
## Problem

The interpreted SK <-> PS protocol does not guard against gaps (neither
does the Vanilla one, but that's beside the point).

## Summary of changes

Extend the protocol to include the start LSN of the PG WAL section from
which the records were interpreted.
Validation is enabled via a config flag on the pageserver and works as
follows:

**Case 1**: `raw_wal_start_lsn` is smaller than the requested LSN
There can't be gaps here, but we check that the shard received records
which it hasn't seen before.

**Case 2**: `raw_wal_start_lsn` is equal to the requested LSN
This is the happy case. No gap and nothing to check

**Case 3**: `raw_wal_start_lsn` is greater than the requested LSN
This is a gap.

To make Case 3 work I had to bend the protocol a bit.
We read record chunks of WAL which aren't record aligned and feed them
to the decoder.
The picture below shows a shard which subscribes at a position somewhere
within Record 2.
We already have a wal reader which is below that position so we wait to
catch up.
We read some wal in Read 1 (all of Record 1 and some of Record 2). The
new shard doesn't
need Record 1 (it has already processed it according to the starting
position), but we read
past it's starting position. When we do Read 2, we decode Record 2 and
ship it off to the shard,
but the starting position of Read 2 is greater than the starting
position the shard requested.
This looks like a gap.


![image](https://github.com/user-attachments/assets/8aed292e-5d62-46a3-9b01-fbf9dc25efe0)

To make it work, we extend the protocol to send an empty
`InterpretedWalRecords` to shards
if the WAL the records originated from ends the requested start
position. On the pageserver,
that just updates the tracking LSNs in memory (no-op really). This gives
us a workaround for
the fake gap.

As a drive by, make `InterpretedWalRecords::next_record_lsn` mandatory
in the application level definition.
It's always included.

Related: neondatabase/cloud#23935
## Problem

Occasionally removed (void) from definition of
`CheckPointReplicationState` function

## Summary of changes

Restore function prototype.

neondatabase/postgres#585
neondatabase/postgres#586

---------

Co-authored-by: Konstantin Knizhnik <[email protected]>
jcsp and others added 5 commits February 24, 2025 12:22
…ion (#10848)

## Problem

We failed to detect #10845
before merging, because the tests we run with a matrix of component
versions didn't include the ones that did live migrations.

## Summary of changes

- Do a live migration during the storage controller smoke test, since
this is a pretty core piece of functionality
- Apply a compat version matrix to the graceful cluster restart test,
since this is the functionality that we most urgently need to work
across versions to make deploys work.

I expect the first CI run of this to fail, because
#10845 isn't merged yet.
## Problem

There are a bunch of minor improvements that are too small and
insignificant as is, so collecting them in one PR.

## Summary of changes
- Add runner arch to artifact name to make it easier to distinguish
files on S3
([ref](https://neondb.slack.com/archives/C059ZC138NR/p1739365938371149))
- Use `github.event.pull_request.number` instead of parsing
`$GITHUB_EVENT_PATH` file
- Update Allure CLI and `allure-pytest`
## Problem

#10788 introduced an API for
warming up attached locations
by downloading all layers in the heatmap. We intend to use it for
warming up timelines after unarchival too,
but it doesn't work. Any heatmap generated after the unarchival will not
include our timeline, so we've lost
all those layers.

## Summary of changes

Generate a cheeky heatmap on unarchival. It includes all the visible
layers.
Use that as the `PreviousHeatmap` which inputs into actual heatmap
generation.

Closes: #10541
This upgrades the `proxy/` crate as well as the forked libraries in
`libs/proxy/` to edition 2024.

Also reformats the imports of those forked libraries via:

```
cargo +nightly fmt -p proxy -p postgres-protocol2 -p postgres-types2 -p tokio-postgres2 -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true
```

It can be read commit-by-commit: the first commit has no formatting
changes, only changes to accomodate the new edition.

Part of #10918
@vipvap vipvap requested review from a team as code owners February 24, 2025 15:34
@vipvap vipvap requested review from hlinnaka, MMeent, skyzh, NanoBjorn, piercypixel, cloneable and sharnoff and removed request for a team February 24, 2025 15:34
Copy link

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and please let me know when and which tenant has #6560 (the pg_stat persistence) enabled, I will keep an eye on it and fix any potential storage perf issues :)

@ololobus
Copy link
Member

and please let me know when and which tenant has #6560 (the pg_stat persistence) enabled, I will keep an eye on it and fix any potential storage perf issues :)

I think we will enable it on staging for all tenants first

Copy link

7689 tests run: 7311 passed, 0 failed, 378 skipped (full report)


Flaky tests (4)

Postgres 17

Code coverage* (full report)

  • functions: 32.8% (8634 of 26289 functions)
  • lines: 48.7% (72648 of 149289 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
33e5930 at 2025-02-24T19:07:12.737Z :recycle:

@tristan957 tristan957 merged commit bcfc633 into release-compute Feb 24, 2025
97 checks passed
@tristan957 tristan957 deleted the rc/release-compute/2025-02-24 branch February 24, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.