You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There exists two primary ways of agreeing on metadata within a Riak cluster.
The ring;
For non-typed buckets, their properties are stored on the ring.
Otherwise the ring is reserved for information about the allocation of partitions within the cluster (and current known status of cluster members).
The ring is gossiped (using riak_core_gossip) and persisted on changes, and conflicting changes are merged.
Although changes to the cluster membership are made through the claimant, changes to bucket properties bypass the claimant and are made directly.
There may be potential issues with changing bucket properties when planning cluster membership changes
When the ring is updated, the bucket properties are flushed and re-written to an ETS table from where they are read.
After 90s without a ring change, the latest ring has read access optimised using riak_core_mochiglobal (essentially compiling it as a module) - but requests for bucket properties only ever use the ETS table.
Each node sends its ring to another random node every 60s, even if there are no changes, so every node should eventually find out about a new version of the ring, even if the node was down when a ring change was gossiped.
Cluster metadata;
For typed buckets and bucket types, their properties are stored in cluster metadata.
Configuration for riak_core_security is also stored in cluster metadata.
There are no other uses for cluster metadata.
Cluster metadata is entirely separate to the ring. It is gossiped using riak_core_broadcast, and stored within a DETS file, with a read-only version promoted to an ETS table.
Reading bucket properties for typed/non-typed buckets follows two different code paths (as one is a read from the ring, and one a read from cluster metadata) - but code paths ultimately lead to an ets:lookup/2 (although in the case of a typed bucket two lookups are required).
If cluster metadata gets out of sync between nodes, the differences are detected by active anti-entropy (using the riak_core_metadata_hashtree module, which then uses hashtree and eleveldb).
For managing eventual consistency in cluster metadata the dvvset module is used. This is the only use of dvvset within riak_core.
Although typed bucket property changes do not use the ring, they are channeled via the claimant node.
Some notes:
Both the use of an ets table as a read-only cache, and the use of riak_core_mochiglobal seem to anachronistic in the context of erlang persistent_term.
The most obvious change is to replace the use of riak_core_mochiglobal with erlang persistent_term for storing the ring (and the ring epoch). There is already the delayed protection of promote_ring to prevent over-frequent updates.
If riak_core_security is not used, and buckets are either generally un-typed, or limited in number - the riak_core_metadata approach seems excessive.
The riak_core_metadata was added for a reason though - is riak_core_metadata/riak_core_broadcast a more robust and scalable approach than riak_core_ring/riak_core_gossip?
The need for AAE in riak_core_metadata creates a dependency on eleveldb. eleveldb itself can be removed, but that doesn't eliminate the complexity of the process. Also, this PR changes the underlying store for hashtree - and so has an impact on those using AAE for KV anti-entropy.
Riak core metadata is subject to hard limits with the 2GB maximum size of DETS table files.
The need for AAE is to ensure eventual consistency (if a node was down and missed broadcasts, it must discover the discrepancy via AAE), but given a 2GB maximum size, perhaps simpler methods than hashtree can be used for comparisons e.g. iterate over keys and comparing directly.
There is the additional complexity of fixups (which is currently only used by riak_repl); fixups allow an application to register a need to filter all bucket property changes before they are applied (in the case of riak_repl this is used to ensure that the riak_repl post-commit hook is applied on every bucket).
The net effect of the different methods for holding cluster metadata is confusing.
There are no known issues with the current setup; it may be confusing, but it works.
Should there be a long-term plan to change this? To unify or simplify the methods of storing and gossiping information required across the cluster? Is there sufficient potential reward to justify the risk of any change (particularly wrt to the overheads of testing backwards compatibility).
The text was updated successfully, but these errors were encountered:
Thank you for putting this together. I had some similar thoughts when working around some of this code.
cluster_metadata is an implementation of Plumtree. The riak_core_gossip creates a binary tree and messages a node's immediate children. Plumtree came much later and I'm not sure what initiated the change in gossip algorithm but I'd be interested in benchmarking how Plumtree performs managing the ring.
I'm for keeping cluster_metadata or at least the idea of it. I like keeping the ring small, safe and updated infrequently. I'd even look to move some fields in meta over to cluster_metadata, like default bucket-types. Using the ring as an ad-hoc store means that business logic and usage can get in the way of a core component in Riak. If the ring and metadata share much of the same implementation, then it won't feel overkill.
As an aside, I'm curious how other Riak users use bucket-types and whether we can just remove the CLI and move it to riak.conf.
There exists two primary ways of agreeing on metadata within a Riak cluster.
riak_core_gossip
) and persisted on changes, and conflicting changes are merged.riak_core_mochiglobal
(essentially compiling it as a module) - but requests for bucket properties only ever use the ETS table.riak_core_security
is also stored in cluster metadata.riak_core_broadcast
, and stored within a DETS file, with a read-only version promoted to an ETS table.ets:lookup/2
(although in the case of a typed bucket two lookups are required).riak_core_metadata_hashtree
module, which then useshashtree
andeleveldb
).dvvset
within riak_core.Some notes:
riak_core_mochiglobal
seem to anachronistic in the context of erlang persistent_term.riak_core_security
is not used, and buckets are either generally un-typed, or limited in number - theriak_core_metadata
approach seems excessive.riak_core_metadata
was added for a reason though - isriak_core_metadata
/riak_core_broadcast
a more robust and scalable approach thanriak_core_ring
/riak_core_gossip
?riak_core_metadata
creates a dependency oneleveldb
.eleveldb
itself can be removed, but that doesn't eliminate the complexity of the process. Also, this PR changes the underlying store for hashtree - and so has an impact on those using AAE for KV anti-entropy.Should there be a long-term plan to change this? To unify or simplify the methods of storing and gossiping information required across the cluster? Is there sufficient potential reward to justify the risk of any change (particularly wrt to the overheads of testing backwards compatibility).
The text was updated successfully, but these errors were encountered: