Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page cache improvements #112

Merged
merged 30 commits into from
Sep 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
a758d91
Fixes for #85, #86, #94, #99, #101.
tonyastolfi Aug 10, 2023
d51bc48
Fix for #87, #100; upgrade to batteries/0.44.1.
tonyastolfi Aug 10, 2023
81ed7f2
Fix for #87, #98, #100; upgrade to batteries/0.44.1.
tonyastolfi Aug 10, 2023
7db2e30
Fixes for #90, #95, #96, #97.
tonyastolfi Aug 10, 2023
11cb4fd
Merge remote-tracking branch 'upstream/main' into upgrade-batteries_v…
tonyastolfi Aug 15, 2023
72c6d37
wip - before major rewrite of VolumeTrimmer to simplify.
tonyastolfi Aug 20, 2023
24242c7
wip - save VolumeCommittedJobTracker before removing (just in case we…
tonyastolfi Aug 22, 2023
cce3b14
wip - new VolumeTrimmer.
tonyastolfi Aug 24, 2023
8e19c8a
wip - volume trimmer fixes (plus test fixes).
tonyastolfi Aug 26, 2023
4d28a73
(re-)Acquire job grant first on recover in trimmer test; this is
tonyastolfi Aug 26, 2023
d934d5f
Added cached pool for RingBuffer, fixed testdata bug (TrieTest).
tonyastolfi Aug 28, 2023
5e5ac2c
Upgrade to batteries/0.44.1.
tonyastolfi Aug 28, 2023
45a2e4f
Add caching of ring buffer mapped memory to slow resource growth.
tonyastolfi Aug 28, 2023
8f7341c
Use madvise; minimize calls to munmap.
tonyastolfi Aug 28, 2023
baf6f26
Fix for #108, Fix for #107.
tonyastolfi Aug 28, 2023
360cacd
Merge remote-tracking branch 'upstream/main' into ring-buffer-pool
tonyastolfi Aug 28, 2023
0832c6b
merge RingBuffer changes.
tonyastolfi Aug 28, 2023
e2e3804
Merge remote-tracking branch 'origin/ring-buffer-pool' into new-volum…
tonyastolfi Aug 28, 2023
e0ff0d8
Cleanup.
tonyastolfi Aug 28, 2023
0c9f517
Merge remote-tracking branch 'upstream/main' into page-cache-improvem…
tonyastolfi Aug 28, 2023
e6c3f13
Hand-merge refactoring of finalized_page_cache_job.*
tonyastolfi Aug 28, 2023
abe1a49
fix header order
tonyastolfi Aug 28, 2023
9771317
Merge remote-tracking branch 'origin/new-volume-trimmer' into page-ca…
tonyastolfi Aug 28, 2023
15b4a72
Remove dead code.
tonyastolfi Aug 28, 2023
6b69597
Merge remote-tracking branch 'origin/new-volume-trimmer' into page-ca…
tonyastolfi Aug 28, 2023
5c78c1a
Only clear one mirror when deallocating a RingBuffer::Impl.
tonyastolfi Aug 29, 2023
69ea4d8
memfd_create instead of tmpfile, normalized subpool sizes.
tonyastolfi Aug 29, 2023
961e2e1
Merge remote-tracking branch 'origin/ring-buffer-pool' into new-volum…
tonyastolfi Aug 29, 2023
b83b550
Merge remote-tracking branch 'origin/new-volume-trimmer' into page-ca…
tonyastolfi Aug 29, 2023
ba99620
Merge remote-tracking branch 'upstream/main' into page-cache-improvem…
tonyastolfi Sep 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions doc/new_prepare_commit_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Proposal: Replicate Root Page Refs and User Data to Simply Job Txns and Trimming

## Problem Statement

Currently, a multi-page transaction ("job" from hereafter) is considered
to be durable/committed only once all the following have occurred:

1. PackedPrepareJob slot written/flushed
2. All new Page data written
3. All PageAllocator ref count updates written/flushed
4. PackedCommitJob slot written

A PackedCommitJob is currently only a slot offset pointer to the
prepare slot it finalizes. Thus in order to present a committed job's
user data to the application, we need to reference information stored
in the prior PackedPrepareJob slot. This same dependency on both
slots is also present when the VolumeTrimmer is trim the Volume root
log, since the main function of the VolumeTrimmer is to update ref
counts in response to trimmed root refs.

This current design presents several significant problems:

1. Additional complexity when reading slots (currently implemented by
the VolumeSlotDemuxer class), since we must store a map from slot
offset to prepare job record until we see the corresponding commit,
in both reading and trimming workflows
2. This poses a dilemma to the trimmer: should we allow the trimming
of a prepare slot, but not its later commit slot? If we do, we
have a commit slot that is essentially useless from the standpoint
of the application. If we don't, this could introduce latency
spikes into trimming due to the fact that "interleaved" jobs
(prepare-1, prepare-2, commit-1, prepare-3, commit-2, prepare-4,
etc.) could indefinitely arrest trimming, up to the maximum
capacity of the log. This is possible even today, with serialized
jobs, since we allow pipelining of the job txn protocol steps
enumerated above.

As of 2023-08-23 (when this is being written), the VolumeTrimmer
implements further problematic behavior, namely it will happily trim
the prepare slot for an _ongoing_ transaction! When it does so, it
attempts to roll-forward the information from the prepare slot that it
cares about (the root ref list), which has resulted in the
implementation of a complicated and buggy scheme of Grant reservation
and management. Given the complexity of this system, it is very
difficult to achieve high confidence in its correctness.

Worse (and more to the point), the very scenario it presumes is
nonsensical: what use is it to trim a prepare _before the job has even
committed?_ Recall from above that the application is unable to
derive anything useful from just the commit slot, as it is just a
pointer to the slot offset of the prepare (where all useful
information is stored).

## Solution

The new design trades a small amount of write amplification for
drastically simpler system design by extending the PackedCommitJob
record to include the (opaque) user data and root page refs from the
prepare record. Typically the user data is small (<100 bytes), and
the increased I/O is largely offset by alleviating the need for
obscure hacks like reserving extra space in the PackedPrepareJob
record to balance the trimmer's current roll-forward scheme (this is
so that the Grant needed to roll prepare information forward can
always be reclaimed from the trimmed region itself). Since the commit
is no longer depend on the earlier prepare slot, the new design
eliminates the need to keep track of prepare slots as the root log is
read, and the trimmer no longer faces the dilemma described above: it
can simply treat commit slots as stand-alone records and take action
accordingly when they are trimmed. This means that log scanning gets
faster, and we can get rid of the `depends_on_slot` field in the
`SlotParse` structure, simplifying application code.
23 changes: 23 additions & 0 deletions src/llfs/appendable_job.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,27 @@ PrepareJob prepare(const AppendableJob& appendable)
};
}

//==#==========+==+=+=++=+++++++++++-+-+--+----- --- -- - - - -
//
u64 AppendableJob::calculate_grant_size() const noexcept
{
const usize user_data_size = packed_sizeof(this->user_data);
const usize root_refs_size = packed_array_size<PackedPageId>(this->job.root_count());

return //
packed_sizeof_slot_with_payload_size( //
sizeof(PackedPrepareJob) //
+ user_data_size //
+ root_refs_size //
+ packed_array_size<PackedPageId>(this->job.new_page_count()) //
+ packed_array_size<PackedPageId>(this->job.deleted_page_count()) //
+ packed_array_size<little_page_device_id_int>(this->job.page_device_count()) //
) //
+ packed_sizeof_slot_with_payload_size( //
sizeof(PackedCommitJob) //
+ user_data_size //
+ root_refs_size //
);
}

} // namespace llfs
9 changes: 8 additions & 1 deletion src/llfs/appendable_job.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
#ifndef LLFS_APPENDABLE_JOB_HPP
#define LLFS_APPENDABLE_JOB_HPP

#include <llfs/finalized_page_cache_job.hpp>
#include <llfs/committable_page_cache_job.hpp>
#include <llfs/packable_ref.hpp>
#include <llfs/page_cache_job.hpp>
#include <llfs/status.hpp>
Expand All @@ -29,6 +29,13 @@ struct PrepareJob;
struct AppendableJob {
CommittablePageCacheJob job;
PackableRef user_data;

//+++++++++++-+-+--+----- --- -- - - - -

/** \brief Returns the total grant size needed to append both the PrepareJob and CommitJob events
* for this job.
*/
u64 calculate_grant_size() const noexcept;
};

// Construct an AppendableJob.
Expand Down
Loading