Skip to content

Commit

Permalink
safekeeper: fix atomicity of WAL truncation (#9685)
Browse files Browse the repository at this point in the history
If WAL truncation fails in the middle it might leave some data on disk
above the write/flush LSN. In theory, concatenated with previous records
it might form bogus WAL (though very unlikely in practice because CRC
would protect from that). To protect from that, set
pending_wal_truncation flag: means before any WAL writes truncation must
be retried until it succeeds. We already did that in case of safekeeper
restart, now extend this mechanism for failures without restart. Also,
importantly, reset LSNs in the beginning of the operation, not in the
end, because once on disk deletion starts previous pointers are wrong.

All this most likely haven't created any problems in practice because
CRC protects from the consequences.

Tests for this are hard; simulation infrastructure might be useful here
in the future, but not yet.
  • Loading branch information
arssher authored Nov 14, 2024
1 parent 1280b70 commit d06bf4b
Showing 1 changed file with 47 additions and 19 deletions.
66 changes: 47 additions & 19 deletions safekeeper/src/wal_storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,23 +127,29 @@ pub struct PhysicalStorage {
/// - doesn't point to the end of the segment
file: Option<File>,

/// When false, we have just initialized storage using the LSN from find_end_of_wal().
/// In this case, [`write_lsn`] can be less than actually written WAL on disk. In particular,
/// there can be a case with unexpected .partial file.
/// When true, WAL truncation potentially has been interrupted and we need
/// to finish it before allowing WAL writes; see truncate_wal for details.
/// In this case [`write_lsn`] can be less than actually written WAL on
/// disk. In particular, there can be a case with unexpected .partial file.
///
/// Imagine the following:
/// - 000000010000000000000001
/// - it was fully written, but the last record is split between 2 segments
/// - after restart, `find_end_of_wal()` returned 0/1FFFFF0, which is in the end of this segment
/// - `write_lsn`, `write_record_lsn` and `flush_record_lsn` were initialized to 0/1FFFFF0
/// - it was fully written, but the last record is split between 2
/// segments
/// - after restart, `find_end_of_wal()` returned 0/1FFFFF0, which is in
/// the end of this segment
/// - `write_lsn`, `write_record_lsn` and `flush_record_lsn` were
/// initialized to 0/1FFFFF0
/// - 000000010000000000000002.partial
/// - it has only 1 byte written, which is not enough to make a full WAL record
/// - it has only 1 byte written, which is not enough to make a full WAL
/// record
///
/// Partial segment 002 has no WAL records, and it will be removed by the next truncate_wal().
/// This flag will be set to true after the first truncate_wal() call.
/// Partial segment 002 has no WAL records, and it will be removed by the
/// next truncate_wal(). This flag will be set to true after the first
/// truncate_wal() call.
///
/// [`write_lsn`]: Self::write_lsn
is_truncated_after_restart: bool,
pending_wal_truncation: bool,
}

impl PhysicalStorage {
Expand Down Expand Up @@ -208,7 +214,7 @@ impl PhysicalStorage {
flush_record_lsn: flush_lsn,
decoder: WalStreamDecoder::new(write_lsn, state.server.pg_version / 10000),
file: None,
is_truncated_after_restart: false,
pending_wal_truncation: true,
})
}

Expand Down Expand Up @@ -405,6 +411,13 @@ impl Storage for PhysicalStorage {
startpos
);
}
if self.pending_wal_truncation {
bail!(
"write_wal called with pending WAL truncation, write_lsn={}, startpos={}",
self.write_lsn,
startpos
);
}

let write_seconds = time_io_closure(self.write_exact(startpos, buf)).await?;
// WAL is written, updating write metrics
Expand Down Expand Up @@ -479,15 +492,34 @@ impl Storage for PhysicalStorage {
);
}

// Quick exit if nothing to do to avoid writing up to 16 MiB of zeros on
// disk (this happens on each connect).
if self.is_truncated_after_restart
// Quick exit if nothing to do and we know that the state is clean to
// avoid writing up to 16 MiB of zeros on disk (this happens on each
// connect).
if !self.pending_wal_truncation
&& end_pos == self.write_lsn
&& end_pos == self.flush_record_lsn
{
return Ok(());
}

// Atomicity: we start with LSNs reset because once on disk deletion is
// started it can't be reversed. However, we might crash/error in the
// middle, leaving garbage above the truncation point. In theory,
// concatenated with previous records it might form bogus WAL (though
// very unlikely in practice because CRC would guard from that). To
// protect, set pending_wal_truncation flag before beginning: it means
// truncation must be retried and WAL writes are prohibited until it
// succeeds. Flag is also set on boot because we don't know if the last
// state was clean.
//
// Protocol (HandleElected before first AppendRequest) ensures we'll
// always try to ensure clean truncation before any writes.
self.pending_wal_truncation = true;

self.write_lsn = end_pos;
self.write_record_lsn = end_pos;
self.flush_record_lsn = end_pos;

// Close previously opened file, if any
if let Some(unflushed_file) = self.file.take() {
self.fdatasync_file(&unflushed_file).await?;
Expand All @@ -513,11 +545,7 @@ impl Storage for PhysicalStorage {
fs::rename(wal_file_path, wal_file_partial_path).await?;
}

// Update LSNs
self.write_lsn = end_pos;
self.write_record_lsn = end_pos;
self.flush_record_lsn = end_pos;
self.is_truncated_after_restart = true;
self.pending_wal_truncation = false;
Ok(())
}

Expand Down

1 comment on commit d06bf4b

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5491 tests run: 5253 passed, 1 failed, 237 skipped (full report)


Failures on Postgres 16

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sharded_ingest[release-pg16-github-actions-selfhosted-1]"
Flaky tests (2)

Postgres 17

Postgres 16

Code coverage* (full report)

  • functions: 31.8% (7889 of 24837 functions)
  • lines: 49.4% (62454 of 126324 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
d06bf4b at 2024-11-14T11:56:19.490Z :recycle:

Please sign in to comment.