Feature Proposal: Token-based consensus for conditional PUT #3

martinsumner · 2024-11-07T15:29:03Z

martinsumner
Nov 7, 2024
Maintainer

Background

Riak contains support for conditional PUT logic - it is possible to make PUTs conditional on both PB and HTTP API. There are two conditional headers:

if_not_modified - this is implemented using the bespoke Riak header "X-Riak-If-Not-Modified" in HTTP;
if_none_match.

With if_not_modified, the PUT will only be applied if the vector clock passed in the PUT is still the vector clock of the current object i.e. the object has not been changed by another actor since the object was fetched by this actor to prepare the PUT. if_none_match is used if the expectation is that this is the first write of the object - it will only PUT if the object is absent.

The test on these conditional PUTs does not handle parallel actors updating the same object at the same time. The check on the condition is made at the during the PUT process prior to the PUT_FSM being started - so two PUT_FSMs may concurrently update the same object even with the condition set - even where both PUTs are sent to the same node. In this case, assuming allow_mult = true, siblings will be created. There is no serialisation of PUTs, the conditions are checked potentially in parallel.

Riak has historically had support for strong conditional PUTs using riak_ensemble - however this is a radically different approach and is incompatible with the majority of other features, but also not fully consistent (as it doesn't support consistency across multiple clusters). The intention is to formally retire this feature.

There are Riak users, who have certain buckets where they are concerned about the number of siblings that may occur due to parallel writes, who consequently have bespoke mechanisms in their application to prevent parallelism. This can be the the case when sibling resolution requires end-user interaction (or operator intervention) - siblings are acceptable when they are rare events, but problematic when they are frequent.

Proposal

The proposal is to introduce token-based conditional checks in Riak. When a bucket has token-based conditional checks, a PUT which has a conditional check must request a token (unique to the key associated with the PUT). If the token for that key is available the PUT will be temporarily deferred until a token is available ... once the token becomes available the PUT will proceed.

If the token, for some reason, never becomes available within a timeout the conditional PUT will continue with the condition validated prior to the PUT (as in the current implementation). So siblings may still occur. The database is still eventually consistent, it is just that the probability of siblings is reduced.

Likewise with multi-cluster environments, tokens exist only within a cluster - parallel updates into different clusters will be protected from parallelism only at the present level (with a pre-PUT check).

The token requests are only made if conditions are applied to the PUT, and if the token-based mechanism has been configured. Any system that either doesn't use conditional PUTs, or doesn't enable the token protection would not be impacted.

The configuration schema is as proposed here, including the description for the end user:

%% @doc Mode for handling conditional checks on PUTs
%% Handling if-not-modified (vclock-based) and if-none-match conditionals on
%% PUTS, there are three possible modes:
%% - api_only (default)
%% - prefer_token
%% - mandate_token (not currently implemented)
%% 
%% In the api_only mode, a read will be done before the write within the API,
%% and if the read passes the condition the PUT will be allowed (even though
%% a parallel conditional PUT may be in-flight).
%% In the prefer_token mode a token must be requested for the key to be
%% updated, and the read and the write will be managed within the token
%% session. Only a single update will normally have access to the token,
%% requests will queue for use of the token.  When a token cannot be secured
%% within a timeout, then the api_only method will be used.
%% 
%% Future releases may support a mandate_token mode which will error on failure
%% to get a token, rather than proceed and accept eventual consistency (as with
%% prefer_token).
%%
%% Only conditional PUTs are impacted by this setting, non-conditional PUTs
%% being sent in parallel to conditional PUTs may cause siblings.  The
%% HTTP standard headers of If-Unmodified-Since and If-Match, are always
%% applied as api_only checks.  The riak-specific vector-clock based
%% if-not-modified header, and the HTTP-default if-none-match header are the
%% only conditional PUTs that will be subject to stronger, token-based
%% restrictions.
{mapping, "conditional_put_mode", "riak_kv.conditional_put_mode", [
  {default, api_only},
  {datatype, {enum, [api_only, prefer_token]}}
]}.

%% @doc Set the level of verification required on token access
%% When requesting access to a token, this can be done in three different
%% verification modes:
%% - head_only
%% - basic_consensus
%% - primary_consensus (default)
%%
%% The head_only mode will make the node currently at the head of each preflist
%% responsible for granting tokens in isolation. This is intended to meet
%% constraints only in healthy clusters (or single-node clusters).
%%
%% In a consensus mode, the token will be granted by the node at the head of
%% the preflist, and the issuance will be validated by up to two "downstream"
%% nodes.  This means that when a node recovers from failure, and becomes head
%% of the preflist it is prevented from making grants which are duplicates of
%% ones made by a downstream node during the failure.
%%
%% There are two forms of consensus - basic and primary.  With basic
%% consensus, any available unique nodes in the preflist (either primary or
%% fallback) can be used for consensus.  With primary consensus, the nodes
%% must be 3 of 5 primary nodes (and hence a target_n_val of at least 5 is 
%% required in this mode).
%%
%% With basic_consensus, tokens can still be granted in a wide range of failure
%% scenarios, but with a risk of duplicate grants, in particular should a
%% cluster be partitioned.
%%
%% No mode provides strict guarantees, including primary_consensus, especially
%% in complex partition scenarios where different nodes have alternative views
%% of node reachability.
{mapping, "token_request_mode", "riak_kv.token_request_mode", [
  {default, primary_consensus},
  {datatype, {enum, [head_only, basic_consensus, primary_consensus]}}
]}.

The design adds a general token handling service. So whereas here it is used to enforce conditions on individual PUTs, it would be possible to also add a token API to Riak - where external actors could request and wait-on the availability of specific tokens.

Design

Each node will have a riak_kv_token_manager. Thhe manager will receive token requests from riak_kv_token_session processes, which are local to that manager. A unique riak_kv_token_session process is created for every request. The riak_kv_token_manager monitors all riak_kv_token_session processes to which it has granted a token, and removes the grant when a session dies.

The riak_kv_token_session processes in effect act as a riak_client. When a session has been granted a process, a process then may use that session to run riak_client M/F/A requests (e.g. riak_client:put/2). A crash of the riak_client function will crash the session and prompt the token to be released by the manager.

As part of the change the conditional checks have been moved from the API, and put instead inside the PUT_FSM - this helps to ensure the behaviour between the APIs is identical. The API should wait on a token request if there are conditions present in the PUT, and if a token request is successful, use the session PID returned to make the riak_client:put/2 call. Once the PUT is complete the API releases the session, to release the grant.

The detail of riak_kv_token_manager implementation is described within the module: https://github.com/OpenRiak/riak_kv/blob/3a102b3ccc1e04b0a54b56cfc9d59acf0bca5587/src/riak_kv_token_manager.erl#L21-L101

Alternative Design Ideas

Two alternative designs were explored:

Simply using the Erlang global module;
Forcing the PUT co-ordinator to be the head of the preflist, and applying the conditional check at the coordinator.

In implementing a bespoke token manager it was possible to queue requests when the token was not presently available - so generally token requests are not rejected, they are delayed until the token is available. This made the handling of contention much more efficient than using backoff/retry mechanisms.

Using the head of the preflist would lead to failure of the conditions in all the operational-change and failure scenarios. It should be a high-level design goal to make the behaviour of Riak as predictable as possible in these scenarios.

Overall it was felt that the additional complexity of the bespoke token manager was not overly burdensome, and it provided both a better answer to conditional PUTs, and a more flexible building block for further functionality to be added.

Testing

The primary functional test starts a 6-node cluster and 24 clients per node. Then all clients attempt to make parallel changes to the same object ... and does this repeatedly. The aim is to check that the end-result reflects all the client additions, not just a subset (as would happen if the condition does not get applied in some PUTs), without at any stage siblings being returned. At commencement of each round there will be 144 requests, 1 will succeed and 143 will get a conditional PUT failure (to indicate updated), then next round 1 ill not participate (already updated), 1 will succeed and 142 will be rejected - so it will take over 100K PUT attempts for all 144 clients to succeed each round. In idealised test scenarios then this whole process still may take only o(1) seconds.

When using the primary_consensus a series of common failure scenarios are also covered, with parallel writes concurrent to these failures. Likewise with operational changes (e.g. joins and leaves). These failure/operational tests are:

Stop node;
Start node;
Leave node;
Join node;
Kill node;
Restart node;
ReKill node;
ReReStart node;

There is no intention to make this exhaustive with these tests. It is important to note that this is never expected to offer any guarantees of strong consistency.

A full volume test has also been run, with/without token-based checking of conditionals. The overall difference in throughput was <5%. The overhead was noticeable but not transformational.

Caveats

This does not represent any formal guarantee of consistency. Although the riak_test covers multiple failure scenarios, it is not a Jepsen test, there will be further edge cases of timing even in these scenarios where siblings will be generated. The token-based consensus provides an alternative path to eventual consistency, but one where siblings are less frequently encountered.

If automated sibling resolution is possible, this is still the preferred method, and would be a more efficient way of handling concurrent PUTs. The riak_test with 144 parallel clients indicates that resolving through automated resolution will be around 4 x faster than forcing clients to back-off and retry due to conditional PUT failures.

There also exists the concern of data-loss. If a client is forced to back-off and retry, then the data is not yet protected from loss in the presence of failure of that client.

Pull Requests

OpenRiak/riak_kv#35
OpenRiak/riak_test#19

Planned release for inclusion

Riak 3.4.0

Bob-The-Marauder · 2024-11-21T13:50:22Z

Bob-The-Marauder
Nov 21, 2024

Just to make sure I understand correctly, when the preflist owner is offline and primary_consensus is enabled, would that cause PUT requests to fail?

1 reply

martinsumner Nov 21, 2024
Maintainer Author

You need 3 of 5 primaries online, otherwise no token would be granted and it would fallback to eventual consistency.

This is why you need target_n_val of 5 when making your cluster for this mode (not the default of 4) ... and obviously, don't confuse target_val with n_val, all your buckets can still be n_val = 3. So as long as you give target_n_val = 5 when the cluster is built (or changed), primary_consensus will be token-based if no more than 2 nodes are down (and then only certain preflists will be impacted in that case).

Note the fallback is to eventual consistency. The PUT should still succeed, it is just that any conditional check will be applied prior to the PUT without that PUT first requesting an exclusive token.

tburghart · 2024-12-20T21:50:46Z

tburghart
Dec 20, 2024
Maintainer

We have implemented a global token (labelled lock) service at the cluster level, which we'll get around to contributing at some point, but I don't think there's any semantic conflict with this.

Ours is commonly accessed through a riak admin operation and is used by operations staff (and bots) to signal that a node is down for service, so please don't take another one down just now, with the state stored in cluster metadata. As such it's not entirely consistent - two requests for the lock on two different nodes within some small time window could potentially succeed - so race conditions are possible.

As this is proposed for 3.4, which we're nowhere near tinkering with yet, I don't see any conflict, but I'm flagging it here so we don't lose sight of it when we get our stuff pushed.

0 replies

WarpEngineer · 2024-12-22T00:43:02Z

WarpEngineer
Dec 22, 2024
Collaborator

I think this is a very well thoughtout approach. I do wonder about the fallback mechanism becoming confusing if it gets triggered if users expect stronger guarantees, but I think clarifying this in the documentation would be sufficient. Otherwise, I think this looks great.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Riak

Feature Proposal: Token-based consensus for conditional PUT #3

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Open Riak

Feature Proposal: Token-based consensus for conditional PUT #3

martinsumner Nov 7, 2024 Maintainer

Background

Proposal

Design

Alternative Design Ideas

Testing

Caveats

Pull Requests

Planned release for inclusion

Replies: 3 comments · 1 reply

Bob-The-Marauder Nov 21, 2024

martinsumner Nov 21, 2024 Maintainer Author

tburghart Dec 20, 2024 Maintainer

WarpEngineer Dec 22, 2024 Collaborator

martinsumner
Nov 7, 2024
Maintainer

Replies: 3 comments 1 reply

Bob-The-Marauder
Nov 21, 2024

martinsumner Nov 21, 2024
Maintainer Author

tburghart
Dec 20, 2024
Maintainer

WarpEngineer
Dec 22, 2024
Collaborator