pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest #8591

jcsp · 2024-08-02T14:52:09Z

Problem/Solution

TimelineWriter::put_batch is simply a loop over individual puts. Each put acquires and releases locks, and checks for potentially starting a new layer. Batching these is more efficient, but more importantly unlocks future changes where we can pre-build serialized buffers much earlier in the ingest process, potentially even on the safekeeper (imagine a future model where some variant of DatadirModification lives on the safekeeper).

Ensuring that the values in put_batch are written to one layer also enables a simplification upstream, where we no longer need to write values in LSN-order. This saves us a sort, but also simplifies follow-on refactors to DatadirModification: we can store metadata keys and data keys separately at that level without needing to zip them together in LSN order later.

Why?

In this PR, these changes are simplify optimizations, but they are motivated by evolving the ingest path in the direction of disentangling extracting DatadirModification from Timeline. It may not obvious how right now, but the general idea is that we'll end up with three phases of ingest:

A) Decode walrecords and build a datadirmodification with all the simple data contents already in a big serialized buffer ready to write to an ephemeral layer <-- this part can be pipelined and parallelized, and done on a safekeeper!
B) Let that datadirmodification see a Timeline, so that it can also generate all the metadata updates that require a read-modify-write of existing pages
C) Dump the results of B into an ephemeral layer.

Related: #8452

Caveats

Doing a big monolithic buffer of values to write to disk is ordinarily an anti-pattern: we prefer nice streaming I/O. However:

In future, when we do this first decode stage on the safekeeper, it would be inefficient to serialize a Vec of Value, and then later deserialize it just to add blob size headers while writing into the ephemeral layer format. The idea is that for bulk write data, we will serialize exactly once.
The monolithic buffer is a stepping stone to pipelining more of this: by seriailizing earlier (rather than at the final put_value), we will be able to parallelize the wal decoding and bulk serialization of data page writes.
The ephemeral layer's buffered writer already stalls writes while it waits to flush: so while yes we'll stall for a couple milliseconds to write a couple megabytes, we already have stalls like this, just distributed across smaller writes.

Benchmarks

This PR is primarily a stepping stone to safekeeper ingest filtering, but also provides a modest efficiency improvement to the wal_recovery part of test_bulk_ingest.

test_bulk_ingest:

test_bulk_insert[neon-release-pg16].insert: 23.659 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 626 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s
test_bulk_insert[neon-release-pg16].compaction: 0.055 s

vs. tip of main:
test_bulk_insert[neon-release-pg16].insert: 24.001 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 604 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s
test_bulk_insert[neon-release-pg16].compaction: 0.054 s

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-08-02T16:14:00Z

2198 tests run: 2134 passed, 0 failed, 64 skipped (full report)

Flaky tests (2)

Postgres 15

test_hot_standby_gc[True]: release
test_ondemand_wal_download_in_replication_slot_funcs: release

Code coverage* (full report)

functions: 32.4% (7241 of 22331 functions)
lines: 50.4% (58546 of 116115 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
67ef0e4 at 2024-08-22T10:14:09.856Z :recycle:}

pageserver/src/tenant/storage_layer/inmemory_layer.rs

pageserver/src/tenant/ephemeral_file.rs

pageserver/src/pgdatadir_mapping.rs

pageserver/src/tenant/storage_layer/inmemory_layer.rs

…r-pt0

jcsp · 2024-08-16T10:32:00Z

This should be good to go, but I plan on merging it after Monday's release branch so that it gets a week in staging.

VladLazar

Looks good to me - just nits

pageserver/src/tenant/storage_layer/inmemory_layer.rs

pageserver/src/pgdatadir_mapping.rs

pageserver/src/tenant/storage_layer/inmemory_layer.rs

…r-pt0

…8621) ## Problem Currently, DatadirModification keeps a key-indexed map of all pending writes, even though we (almost) never need to read back dirty pages for anything other than metadata pages (e.g. relation sizes). Related: #6345 ## Summary of changes - commit() modifications before ingesting database creation wal records, so that they are guaranteed to be able to get() everything they need directly from the underlying Timeline. - Split dirty pages in DatadirModification into pending_metadata_pages and pending_data_pages. The data ones don't need to be in a key-addressable format, so they just go in a Vec instead. - Special case handling of zero-page writes in DatadirModification, putting them in a map which is flushed on the end of a WAL record. This handles the case where during ingest, we might first write a zero page, and then ingest a postgres write to that page. We used to do this via the key-indexed map of writes, but in this PR we change the data page write path to not bother indexing these by key. My least favorite thing about this PR is that I needed to change the DatadirModification interface to add the on_record_end call. This is not very invasive because there's really only one place we use it, but it changes the object's behaviour from being clearly an aggregation of many records to having some per-record state. I could avoid this by implicitly doing the work when someone calls set_lsn or commit -- I'm open to opinions on whether that's cleaner or dirtier. ## Performance There may be some efficiency improvement here, but the primary motivation is to enable an earlier stage of ingest to operate without access to a Timeline. The `pending_data_pages` part is the "fast path" bulk write data that can in principle be generated without a Timeline, in parallel with other ingest batches, and ultimately on the safekeeper. `test_bulk_insert` on AX102 shows approximately the same results as in the previous PR #8591: ``` ------------------------------ Benchmark results ------------------------------- test_bulk_insert[neon-release-pg16].insert: 23.577 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 637 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 18.264 s test_bulk_insert[neon-release-pg16].compaction: 0.052 s ```

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Aug 2, 2024

jcsp changed the title ~~pageserver: avoid a spurious sort during ingest~~ pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest Aug 2, 2024

jcsp force-pushed the jcsp/ingest-refactor-pt0 branch from 933cf8f to 30bae03 Compare August 2, 2024 18:12

jcsp mentioned this pull request Aug 5, 2024

WAL filtering in safekeeper for sharding #6345

Closed

jcsp force-pushed the jcsp/ingest-refactor-pt0 branch 3 times, most recently from d559ee2 to cb393b1 Compare August 6, 2024 14:58

jcsp mentioned this pull request Aug 6, 2024

pageserver: separate metadata and data pages in DatadirModification #8621

Merged

5 tasks

jcsp marked this pull request as ready for review August 14, 2024 09:57

jcsp requested a review from a team as a code owner August 14, 2024 09:57

jcsp requested review from skyzh, problame and VladLazar and removed request for skyzh August 14, 2024 09:57

jcsp added 7 commits August 14, 2024 10:25

pageserver: batch ephemeral layer writes during ingest

b9d6454

pageserver: avoid a no-longer-needed sort during ingest

8bdd440

Remove unused singular puts

66ba0fd

Refactor InMemoryLayer put_batch code

8a95385

Soft limit on the size of monolithic serialization stage

9a8fcf8

pre-allocate in SerializedBatch::from_values

4025ada

pageserver: update bench_ingest

1c0264d

jcsp force-pushed the jcsp/ingest-refactor-pt0 branch from cb393b1 to 1c0264d Compare August 14, 2024 10:41

VladLazar reviewed Aug 14, 2024

View reviewed changes

jcsp added 3 commits August 15, 2024 18:25

Merge remote-tracking branch 'upstream/main' into jcsp/ingest-refacto…

171922c

…r-pt0

comments

44679b1

Carry serialized value size through put path

e94eb61

jcsp requested a review from VladLazar August 15, 2024 19:20

Merge branch 'main' into jcsp/ingest-refactor-pt0

d61e2d7

VladLazar reviewed Aug 16, 2024

View reviewed changes

VladLazar approved these changes Aug 16, 2024

View reviewed changes

jcsp added 3 commits August 19, 2024 15:57

SerializedBatchOffset

e8f4f1c

Merge remote-tracking branch 'upstream/main' into jcsp/ingest-refacto…

9aadd90

…r-pt0

Merge branch 'main' into jcsp/ingest-refactor-pt0

67ef0e4

jcsp enabled auto-merge (squash) August 22, 2024 09:45

jcsp merged commit 7c74112 into main Aug 22, 2024
63 checks passed

jcsp deleted the jcsp/ingest-refactor-pt0 branch August 22, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest #8591

pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest #8591

jcsp commented Aug 2, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024 •

edited

Loading

Postgres 15

jcsp commented Aug 16, 2024

VladLazar left a comment

pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest #8591

pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest #8591

Conversation

jcsp commented Aug 2, 2024 • edited Loading

Problem/Solution

Why?

Caveats

Benchmarks

Checklist before requesting a review

Checklist before merging

github-actions bot commented Aug 2, 2024 • edited Loading

2198 tests run: 2134 passed, 0 failed, 64 skipped (full report)

Postgres 15

Code coverage* (full report)

jcsp commented Aug 16, 2024

VladLazar left a comment

Choose a reason for hiding this comment

pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest #8591

pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest #8591

jcsp commented Aug 2, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024 •

edited

Loading