Feature Proposal: Token-based consensus for conditional PUT #3
Replies: 3 comments 1 reply
-
Just to make sure I understand correctly, when the preflist owner is offline and primary_consensus is enabled, would that cause PUT requests to fail? |
Beta Was this translation helpful? Give feedback.
-
We have implemented a global token (labelled lock) service at the cluster level, which we'll get around to contributing at some point, but I don't think there's any semantic conflict with this. Ours is commonly accessed through a As this is proposed for 3.4, which we're nowhere near tinkering with yet, I don't see any conflict, but I'm flagging it here so we don't lose sight of it when we get our stuff pushed. |
Beta Was this translation helpful? Give feedback.
-
I think this is a very well thoughtout approach. I do wonder about the fallback mechanism becoming confusing if it gets triggered if users expect stronger guarantees, but I think clarifying this in the documentation would be sufficient. Otherwise, I think this looks great. |
Beta Was this translation helpful? Give feedback.
-
Background
Riak contains support for conditional PUT logic - it is possible to make PUTs conditional on both PB and HTTP API. There are two conditional headers:
With if_not_modified, the PUT will only be applied if the vector clock passed in the PUT is still the vector clock of the current object i.e. the object has not been changed by another actor since the object was fetched by this actor to prepare the PUT. if_none_match is used if the expectation is that this is the first write of the object - it will only PUT if the object is absent.
The test on these conditional PUTs does not handle parallel actors updating the same object at the same time. The check on the condition is made at the during the PUT process prior to the PUT_FSM being started - so two PUT_FSMs may concurrently update the same object even with the condition set - even where both PUTs are sent to the same node. In this case, assuming
allow_mult = true
, siblings will be created. There is no serialisation of PUTs, the conditions are checked potentially in parallel.Riak has historically had support for strong conditional PUTs using
riak_ensemble
- however this is a radically different approach and is incompatible with the majority of other features, but also not fully consistent (as it doesn't support consistency across multiple clusters). The intention is to formally retire this feature.There are Riak users, who have certain buckets where they are concerned about the number of siblings that may occur due to parallel writes, who consequently have bespoke mechanisms in their application to prevent parallelism. This can be the the case when sibling resolution requires end-user interaction (or operator intervention) - siblings are acceptable when they are rare events, but problematic when they are frequent.
Proposal
The proposal is to introduce token-based conditional checks in Riak. When a bucket has token-based conditional checks, a PUT which has a conditional check must request a token (unique to the key associated with the PUT). If the token for that key is available the PUT will be temporarily deferred until a token is available ... once the token becomes available the PUT will proceed.
If the token, for some reason, never becomes available within a timeout the conditional PUT will continue with the condition validated prior to the PUT (as in the current implementation). So siblings may still occur. The database is still eventually consistent, it is just that the probability of siblings is reduced.
Likewise with multi-cluster environments, tokens exist only within a cluster - parallel updates into different clusters will be protected from parallelism only at the present level (with a pre-PUT check).
The token requests are only made if conditions are applied to the PUT, and if the token-based mechanism has been configured. Any system that either doesn't use conditional PUTs, or doesn't enable the token protection would not be impacted.
The configuration schema is as proposed here, including the description for the end user:
The design adds a general token handling service. So whereas here it is used to enforce conditions on individual PUTs, it would be possible to also add a token API to Riak - where external actors could request and wait-on the availability of specific tokens.
Design
Each node will have a
riak_kv_token_manager
. Thhe manager will receive token requests fromriak_kv_token_session
processes, which are local to that manager. A uniqueriak_kv_token_session
process is created for every request. Theriak_kv_token_manager
monitors allriak_kv_token_session
processes to which it has granted a token, and removes the grant when a session dies.The
riak_kv_token_session
processes in effect act as ariak_client
. When a session has been granted a process, a process then may use that session to run riak_client M/F/A requests (e.g.riak_client:put/2
). A crash of the riak_client function will crash the session and prompt the token to be released by the manager.As part of the change the conditional checks have been moved from the API, and put instead inside the PUT_FSM - this helps to ensure the behaviour between the APIs is identical. The API should wait on a token request if there are conditions present in the PUT, and if a token request is successful, use the session PID returned to make the
riak_client:put/2
call. Once the PUT is complete the API releases the session, to release the grant.The detail of
riak_kv_token_manager
implementation is described within the module: https://github.com/OpenRiak/riak_kv/blob/3a102b3ccc1e04b0a54b56cfc9d59acf0bca5587/src/riak_kv_token_manager.erl#L21-L101Alternative Design Ideas
Two alternative designs were explored:
global
module;In implementing a bespoke token manager it was possible to queue requests when the token was not presently available - so generally token requests are not rejected, they are delayed until the token is available. This made the handling of contention much more efficient than using backoff/retry mechanisms.
Using the head of the preflist would lead to failure of the conditions in all the operational-change and failure scenarios. It should be a high-level design goal to make the behaviour of Riak as predictable as possible in these scenarios.
Overall it was felt that the additional complexity of the bespoke token manager was not overly burdensome, and it provided both a better answer to conditional PUTs, and a more flexible building block for further functionality to be added.
Testing
The primary functional test starts a 6-node cluster and 24 clients per node. Then all clients attempt to make parallel changes to the same object ... and does this repeatedly. The aim is to check that the end-result reflects all the client additions, not just a subset (as would happen if the condition does not get applied in some PUTs), without at any stage siblings being returned. At commencement of each round there will be 144 requests, 1 will succeed and 143 will get a conditional PUT failure (to indicate updated), then next round 1 ill not participate (already updated), 1 will succeed and 142 will be rejected - so it will take over 100K PUT attempts for all 144 clients to succeed each round. In idealised test scenarios then this whole process still may take only o(1) seconds.
When using the
primary_consensus
a series of common failure scenarios are also covered, with parallel writes concurrent to these failures. Likewise with operational changes (e.g. joins and leaves). These failure/operational tests are:There is no intention to make this exhaustive with these tests. It is important to note that this is never expected to offer any guarantees of strong consistency.
A full volume test has also been run, with/without token-based checking of conditionals. The overall difference in throughput was <5%. The overhead was noticeable but not transformational.
Caveats
This does not represent any formal guarantee of consistency. Although the riak_test covers multiple failure scenarios, it is not a Jepsen test, there will be further edge cases of timing even in these scenarios where siblings will be generated. The token-based consensus provides an alternative path to eventual consistency, but one where siblings are less frequently encountered.
If automated sibling resolution is possible, this is still the preferred method, and would be a more efficient way of handling concurrent PUTs. The riak_test with 144 parallel clients indicates that resolving through automated resolution will be around 4 x faster than forcing clients to back-off and retry due to conditional PUT failures.
There also exists the concern of data-loss. If a client is forced to back-off and retry, then the data is not yet protected from loss in the presence of failure of that client.
Pull Requests
OpenRiak/riak_kv#35
OpenRiak/riak_test#19
Planned release for inclusion
Riak 3.4.0
Beta Was this translation helpful? Give feedback.
All reactions