Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STORAGE USE REDUCTION] Commitlog segment compression #1592

Open
cloutiertyler opened this issue Aug 15, 2024 · 7 comments · May be fixed by #2034
Open

[STORAGE USE REDUCTION] Commitlog segment compression #1592

cloutiertyler opened this issue Aug 15, 2024 · 7 comments · May be fixed by #2034
Assignees

Comments

@cloutiertyler
Copy link
Contributor

No description provided.

@cloutiertyler cloutiertyler changed the title Commitlog segment compression [STORAGE USE REDUCTION] Commitlog segment compression Aug 15, 2024
@kim
Copy link
Contributor

kim commented Oct 9, 2024

Please keep in mind that closed commitlog segments may need to be accessed by tx offset, at least until the most recent snapshot. We maintain an offset index, mapping tx- to byte-offsets, for that purpose.

So either we compress segments when moving them to cold storage (i.e. only those before a snapshot).
Or, the index should remain functional. This can be achieved by compressing segments using zstd-seekable, which adds a dictionary allowing to seek to the original (uncompressed) byte offsets. Since zstd also employs a magic byte sequence, we don't even need to change anything about the commitlog format.

@cloutiertyler
Copy link
Contributor Author

It was decided that we can do this later in a non-breaking way by looking for either our magic number or the zstd magic number to determine whether this segment file is compressed or not.

@gefjon
Copy link
Contributor

gefjon commented Nov 20, 2024

MVP / definition of done, as I see it:

  • Determine some time at which to compress old commitlog segments. Possibly:
    • Whenever a commitlog segment is filled and a new segment is started, compress the old segment. Don't worry about the snapshots.
    • After taking a snapshot, compress all segments older than the one containing the snapshotted TX. Keep the segment(s) needed to replay from the most recent snapshot uncompressed.
  • When replaying from the commitlog, if a segment is compressed, decompress it in-memory. Do not store the uncompressed version to disk.
  • Benchmark to ensure that replaying from the most recent snapshot is not catastrophically slower. We do not care if replaying older snapshots is slow. We also can afford a small regression even on the most recent snapshot.
  • Test to ensure that replaying from an older snapshot, or no snapshot at all, is still possible, even if it is slow.

@kim
Copy link
Contributor

kim commented Nov 21, 2024

  • Test that traversing a compressed segment from an offset that is not the start of the segment is not catastrophically slower (e.g. by having to decompress and traverse from the start of the segment, instead of seeking using the offset index).

@gefjon
Copy link
Contributor

gefjon commented Nov 21, 2024

Test that traversing a compressed segment from an offset that is not the start of the segment is not catastrophically slower (e.g. by having to decompress and traverse from the start of the segment, instead of seeking using the offset index).

I contend that we don't actually care, as long as replaying from the most recent snapshot is still fast. I am not aware of any other performance-constrained case in which we traverse commitlog segments.

@kim
Copy link
Contributor

kim commented Nov 21, 2024

@gefjon replication will need to be able to randomly seek in segments, at least back to the latest snapshot.

@gefjon
Copy link
Contributor

gefjon commented Nov 21, 2024

at least back to the latest snapshot.

Ack. This is a significantly weaker constraint than the one you wrote originally. E.g. I believe we would accept a solution where replaying from or seeking within a compressed segment was slow, but where the segment(s) after the most recent snapshot were kept uncompressed and were therefore fast.

@mamcx mamcx linked a pull request Dec 3, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants