Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: send AppendResponse on segment flush #9692

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 8, 2024

Problem

When processing pipelined AppendRequests, we explicitly flush the WAL every second and return an AppendResponse. However, the WAL is also implicitly flushed on segment bounds, but this does not result in an AppendResponse. Because of this, concurrent transactions may take up to 1 second to commit and writes may take up to 1 second before sending to the pageserver.

Separately, we should consider flushing the WAL on transaction commits -- see #9690.

Resolves #9688.

Summary of changes

Advance flush_lsn when a WAL segment is closed and flushed, and emit an AppendResponse. To accommodate this, track the flush_lsn in addition to the flush_record_lsn.

Note that this will result in more frequent commits during pipelined WAL ingestion, resulting in a control file flush (3 fsyncs) on every segment bound. We should address #9663 first, e.g. by taking control file flushes off of the ingest hot path.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@erikgrinaker
Copy link
Contributor Author

@arssher Just submitting this for an early look to see if you agree with the overall approach. This needs a few tweaks and tests/benchmarks.

Copy link

github-actions bot commented Nov 8, 2024

5391 tests run: 5170 passed, 1 failed, 220 skipped (full report)


Failures on Postgres 17

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_cli_start_stop[release-pg17]"

Test coverage report is not available

The comment gets automatically updated with the latest test results
4bfdf46 at 2024-11-13T10:49:55.004Z :recycle:

@erikgrinaker erikgrinaker force-pushed the erik/respond-segment-flush-lsn branch 2 times, most recently from 10223d1 to 1cf5b69 Compare November 9, 2024 15:40
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 9, 2024

Here are some benchmark results to illustrate the need for #9698 before merging this.

wal_acceptor_throughput/fsync=true/commit=true/size=1048576 pipelines 1 GB of AppendRequest with 1 MB logical messages, and sets commit_lsn to the previous message. commit_lsn gets clamped to flush_record_lsn by the Safekeeper (this is equivalent to the case where it lags behind the quorum, but the same holds for the quorum).

Before this change, flush_record_lsn was only advanced every second, and the compute only received an AppendResponse every second. Thus the Safekeeper only advanced its commit_lsn at the end of the 1 GB ingestion, which is when the control file is flushed.

With this change, the flush_record_lsn advances every segment flush, allowing the commit_lsn to advance every segment. This means that the control file is also flushed every segment.

Without #9698, this increases the number of control file flushes from 2 to 60 (each with 3 fsyncs on the ingest path), reducing throughput by 30%:

wal_acceptor_throughput/fsync=true/commit=true/size=1048576
                        time:   [2.3644 s 2.3874 s 2.4096 s]
                        thrpt:  [424.97 MiB/s 428.91 MiB/s 433.09 MiB/s]
                 change:
                        time:   [+41.215% +43.684% +45.926%] (p = 0.00 < 0.05)
                        thrpt:  [-31.472% -30.403% -29.186%]

60 control file flushes in 0.755s

With #9698, only 5 control file flushes happen, and more importantly, these happen off of the ingest hot path. Thus throughput remains unchanged:

wal_acceptor_throughput/fsync=true/commit=true/size=1048576
                        time:   [1.6705 s 1.6885 s 1.7071 s]
                        thrpt:  [599.85 MiB/s 606.46 MiB/s 612.98 MiB/s]
                 change:
                        time:   [-0.1404% +1.6188% +3.4530%] (p = 0.11 > 0.05)
                        thrpt:  [-3.3377% -1.5930% +0.1406%]

5 control file flushes in 0.064s

So #9698 is a necessary prerequisite to this PR.

@erikgrinaker
Copy link
Contributor Author

This should be ready for review now. We don't have any tests for wal_storage afaict, so I haven't added any further tests. We should add a proper test suite for this separately.

The benchmarks in #9692 (comment) confirm that this results in more frequent commits and no performance regression (assuming #9698 merges first). I'll add a separate benchmark measuring commit latency as part of #9690.

@erikgrinaker erikgrinaker marked this pull request as ready for review November 11, 2024 10:35
@erikgrinaker erikgrinaker requested a review from a team as a code owner November 11, 2024 10:35
@arssher
Copy link
Contributor

arssher commented Nov 12, 2024

Approach LGTM.

@erikgrinaker
Copy link
Contributor Author

Thanks! I think this should be good for a final review -- anything you think is missing?

/// The last LSN flushed to disk. May be in the middle of a record.
///
/// NB: when the rest of the system refers to `flush_lsn`, it usually
/// actually refers to `flush_record_lsn`. This ambiguity can be dangerous
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's proceed with renaming flush_lsn to flush_record_lsn everywhere in wal_storage related code? Plus leave the comment that other outside places might continue to call flush_record_lsn as flush_lsn because they care only about whole records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll submit a follow-up PR, to avoid too much churn in this one.

async fn write_exact(&mut self, pos: Lsn, mut buf: &[u8]) -> Result<()> {
// TODO: this shouldn't be possible, except possibly with write_lsn == 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this todo :)

Copy link
Contributor Author

@erikgrinaker erikgrinaker Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will submit a follow-up PR.

);
}
// We have unflushed data (write_lsn != flush_lsn), but no file. This
// shouldn't happen, since the segment is flushed on close.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still might happen because write_in_segment temporarily takes the file; if fsync/rename fails it is not installed back. So either assert should be removed completely (and file reopened) or write_in_segment shouldn't take the file out.

This is not directly related to this PR and shouldn't be problem in practice because on error reconnection would do truncate_wal establishing self.flush_record_lsn == self.write_record_lsn, so I'm ok with leaving this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

safekeeper: eagerly return AppendResponse with flush_lsn on segment fsync
2 participants