safekeeper: send `AppendResponse` on segment flush #9692

erikgrinaker · 2024-11-08T14:55:15Z

Problem

When processing pipelined AppendRequests, we explicitly flush the WAL every second and return an AppendResponse. However, the WAL is also implicitly flushed on segment bounds, but this does not result in an AppendResponse. Because of this, concurrent transactions may take up to 1 second to commit and writes may take up to 1 second before sending to the pageserver.

Separately, we should consider flushing the WAL on transaction commits -- see #9690.

Resolves #9688.

Summary of changes

Advance flush_lsn when a WAL segment is closed and flushed, and emit an AppendResponse. To accommodate this, track the flush_lsn in addition to the flush_record_lsn.

Note that this will result in more frequent commits during pipelined WAL ingestion, resulting in a control file flush (3 fsyncs) on every segment bound. We should address #9663 first, e.g. by taking control file flushes off of the ingest hot path.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

erikgrinaker · 2024-11-08T14:56:04Z

@arssher Just submitting this for an early look to see if you agree with the overall approach. This needs a few tweaks and tests/benchmarks.

github-actions · 2024-11-08T15:57:00Z

5391 tests run: 5170 passed, 1 failed, 220 skipped (full report)

Failures on Postgres 17

test_cli_start_stop: release-arm64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_cli_start_stop[release-pg17]"

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
4bfdf46 at 2024-11-13T10:49:55.004Z :recycle:}

erikgrinaker · 2024-11-09T15:55:02Z

Here are some benchmark results to illustrate the need for #9698 before merging this.

wal_acceptor_throughput/fsync=true/commit=true/size=1048576 pipelines 1 GB of AppendRequest with 1 MB logical messages, and sets commit_lsn to the previous message. commit_lsn gets clamped to flush_record_lsn by the Safekeeper (this is equivalent to the case where it lags behind the quorum, but the same holds for the quorum).

Before this change, flush_record_lsn was only advanced every second, and the compute only received an AppendResponse every second. Thus the Safekeeper only advanced its commit_lsn at the end of the 1 GB ingestion, which is when the control file is flushed.

With this change, the flush_record_lsn advances every segment flush, allowing the commit_lsn to advance every segment. This means that the control file is also flushed every segment.

Without #9698, this increases the number of control file flushes from 2 to 60 (each with 3 fsyncs on the ingest path), reducing throughput by 30%:

wal_acceptor_throughput/fsync=true/commit=true/size=1048576
                        time:   [2.3644 s 2.3874 s 2.4096 s]
                        thrpt:  [424.97 MiB/s 428.91 MiB/s 433.09 MiB/s]
                 change:
                        time:   [+41.215% +43.684% +45.926%] (p = 0.00 < 0.05)
                        thrpt:  [-31.472% -30.403% -29.186%]

60 control file flushes in 0.755s

With #9698, only 5 control file flushes happen, and more importantly, these happen off of the ingest hot path. Thus throughput remains unchanged:

wal_acceptor_throughput/fsync=true/commit=true/size=1048576
                        time:   [1.6705 s 1.6885 s 1.7071 s]
                        thrpt:  [599.85 MiB/s 606.46 MiB/s 612.98 MiB/s]
                 change:
                        time:   [-0.1404% +1.6188% +3.4530%] (p = 0.11 > 0.05)
                        thrpt:  [-3.3377% -1.5930% +0.1406%]

5 control file flushes in 0.064s

So #9698 is a necessary prerequisite to this PR.

erikgrinaker · 2024-11-11T10:35:49Z

This should be ready for review now. We don't have any tests for wal_storage afaict, so I haven't added any further tests. We should add a proper test suite for this separately.

The benchmarks in #9692 (comment) confirm that this results in more frequent commits and no performance regression (assuming #9698 merges first). I'll add a separate benchmark measuring commit latency as part of #9690.

arssher · 2024-11-12T11:18:32Z

Approach LGTM.

erikgrinaker · 2024-11-12T13:53:14Z

Thanks! I think this should be good for a final review -- anything you think is missing?

arssher · 2024-11-15T14:04:41Z

safekeeper/src/wal_storage.rs

+    /// The last LSN flushed to disk. May be in the middle of a record.
+    ///
+    /// NB: when the rest of the system refers to `flush_lsn`, it usually
+    /// actually refers to `flush_record_lsn`. This ambiguity can be dangerous


Let's proceed with renaming flush_lsn to flush_record_lsn everywhere in wal_storage related code? Plus leave the comment that other outside places might continue to call flush_record_lsn as flush_lsn because they care only about whole records.

Sure. I'll submit a follow-up PR, to avoid too much churn in this one.

arssher · 2024-11-15T14:06:21Z

safekeeper/src/wal_storage.rs

    async fn write_exact(&mut self, pos: Lsn, mut buf: &[u8]) -> Result<()> {
+        // TODO: this shouldn't be possible, except possibly with write_lsn == 0.


Let's do this todo :)

Will submit a follow-up PR.

arssher · 2024-11-15T15:31:52Z

safekeeper/src/wal_storage.rs

-                );
-            }
+            // We have unflushed data (write_lsn != flush_lsn), but no file. This
+            // shouldn't happen, since the segment is flushed on close.


This still might happen because write_in_segment temporarily takes the file; if fsync/rename fails it is not installed back. So either assert should be removed completely (and file reopened) or write_in_segment shouldn't take the file out.

This is not directly related to this PR and shouldn't be problem in practice because on error reconnection would do truncate_wal establishing self.flush_record_lsn == self.write_record_lsn, so I'm ok with leaving this out.

erikgrinaker requested a review from arssher November 8, 2024 14:55

erikgrinaker mentioned this pull request Nov 8, 2024

safekeeper: flush WAL on transaction commit #9694

Draft

5 tasks

erikgrinaker mentioned this pull request Nov 8, 2024

safekeeper: don't flush control file on WAL ingest path #9698

Merged

5 tasks

erikgrinaker force-pushed the erik/respond-segment-flush-lsn branch 2 times, most recently from 10223d1 to 1cf5b69 Compare November 9, 2024 15:40

safekeeper: send AppendResponse on segment flush

71f04f9

erikgrinaker force-pushed the erik/respond-segment-flush-lsn branch from 1cf5b69 to 71f04f9 Compare November 11, 2024 10:28

erikgrinaker marked this pull request as ready for review November 11, 2024 10:35

erikgrinaker requested a review from a team as a code owner November 11, 2024 10:35

Merge branch 'main' into erik/respond-segment-flush-lsn

4bfdf46

arssher approved these changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safekeeper: send `AppendResponse` on segment flush #9692

safekeeper: send `AppendResponse` on segment flush #9692

erikgrinaker commented Nov 8, 2024 •

edited

Loading

erikgrinaker commented Nov 8, 2024

github-actions bot commented Nov 8, 2024 •

edited

Loading

erikgrinaker commented Nov 9, 2024 •

edited

Loading

erikgrinaker commented Nov 11, 2024

arssher commented Nov 12, 2024

erikgrinaker commented Nov 12, 2024

arssher Nov 15, 2024

erikgrinaker Nov 15, 2024

arssher Nov 15, 2024

erikgrinaker Nov 15, 2024 •

edited

Loading

arssher Nov 15, 2024

		async fn write_exact(&mut self, pos: Lsn, mut buf: &[u8]) -> Result<()> {
		// TODO: this shouldn't be possible, except possibly with write_lsn == 0.

safekeeper: send AppendResponse on segment flush #9692

Are you sure you want to change the base?

safekeeper: send AppendResponse on segment flush #9692

Conversation

erikgrinaker commented Nov 8, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

erikgrinaker commented Nov 8, 2024

github-actions bot commented Nov 8, 2024 • edited Loading

5391 tests run: 5170 passed, 1 failed, 220 skipped (full report)

Failures on Postgres 17

Test coverage report is not available

erikgrinaker commented Nov 9, 2024 • edited Loading

erikgrinaker commented Nov 11, 2024

arssher commented Nov 12, 2024

erikgrinaker commented Nov 12, 2024

arssher Nov 15, 2024

Choose a reason for hiding this comment

erikgrinaker Nov 15, 2024

Choose a reason for hiding this comment

arssher Nov 15, 2024

Choose a reason for hiding this comment

erikgrinaker Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

arssher Nov 15, 2024

Choose a reason for hiding this comment

safekeeper: send `AppendResponse` on segment flush #9692

safekeeper: send `AppendResponse` on segment flush #9692

erikgrinaker commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading

erikgrinaker commented Nov 9, 2024 •

edited

Loading

erikgrinaker Nov 15, 2024 •

edited

Loading