-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-database support to cluster mode #1671
base: unstable
Are you sure you want to change the base?
Conversation
This commit introduces multi-database support in cluster mode while maintaining backward compatibility and requiring no API changes. Key features include: - Database-agnostic hashing: The hashing algorithm is unchanged. Identical keys map to the same slot across all databases. No changes to slot calculation. This ensures consistency in key distribution and maintains compatibility with existing single-database setups. - Implementation is fully backward compatible with no API changes. - The core structure remains an array of databases, each containing a list of hashtables (one per slot). Cluster management commands are global commands, except for GETKEYSINSLOT and COUNTKEYSINSLOT, which run in selected-DB context. MIGRATE command operates a selected-db context. Please note that MIGRATE command parameter destination-db is used, when migrating keys they can be migrated to a different database in the target, like in non-cluster mode. Slot migration process changes when multiple databases are used: Iterate through all databases SELECT database keys = GETKEYSINSLOT MIGRATE source target keys Valkey-cli has been updated to support resharding across all databases. Signed-off-by: xbasel <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1671 +/- ##
============================================
+ Coverage 70.97% 71.18% +0.20%
============================================
Files 121 123 +2
Lines 65238 65667 +429
============================================
+ Hits 46305 46745 +440
+ Misses 18933 18922 -11
|
src/db.c
Outdated
@@ -1728,12 +1714,6 @@ void swapMainDbWithTempDb(serverDb *tempDb) { | |||
void swapdbCommand(client *c) { | |||
int id1, id2; | |||
|
|||
/* Not allowed in cluster mode: we have just DB 0 there. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be enough for swapdb to work in cluster mode? What will happen in setup with 2 shards, each responsible for half of slots in db's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this implementation SWAPDB must be executed in all primary nodes. There are three options:
- Allow SWAPDB and shift responsibility to the user – Risky, non-atomic, can cause temporary inconsistency and data corruption. Needs strong warnings.
- Keep SWAPDB disabled in cluster mode – Safest, avoids inconsistency.
- Make SWAPDB cluster-wide and atomic or – Complex, unclear feasibility.
I think option 2 is the safest bet. @JoBeR007 wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is SWAPDB replicated as a single command? Then it's atomic.
If it's risky, it's risky in standslone mode with replicas too, right?
I think we can allow it. Swapping the data can only be done in some non-realtime workloads anyway I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think risky because of replication and risky because of the need to execute SWAPDB on all primary nodes are unrelated just because as a user you can't control first, but user is the main risk in the second case.
I would keep SWAPDB disabled in cluster mode, if we decide to continue with this implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cluster mode, consistency is per slot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, FLUSHDB is very similar in this regard. If a failover happens just before this command has been propagated to replicas, it's a big thing, but it's no surprise I think. The client can use WAIT or check replication offset to make sure the FLUSHDB or SWAPDB was successful on the replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this, I think it is not just an issue of Multi-database but is more related to atomic slot migration. If a shard is in a stable state (not undergoing slot migration), then flushdb
/flushall
/swapdb
are safe. However, if slot migration is in progress, it might lead to data inconsistency.
I think this needs to be considered alongside atomic-slot-migration:
- During the ATM process, for slots being migrated, if we encounter
flushall
/flushdb
, we can send a command likeflushslot
orflushslotall
to the target shard - As for
swapdb
, I recommend temporarily prohibiting execution during the ATM process
@PingXie @enjoy-binbin , please also take note of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. @murphyjacob4 FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a comment on the issue about this, but also worth mentioning it's hard to orchestrate SWAPDB
. Even in steady state, flushdb
and flushall
are idempotent (you can send them multiple times) but swapdb isn't. If a command times out on one node, it's hard to reason about if it was successful and how to retry it. I think we should continue to disable SWAPDB
in cluster mode for now, unless we introduce an idempotent way to do the swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe introduce UUID tracking for SWAPDB requests works.
disabling SWAPDB for now.
Signed-off-by: zhaozhao.zz <[email protected]>
Signed-off-by: zhaozhao.zz <[email protected]>
src/cluster.c
Outdated
@@ -1102,7 +1110,7 @@ getNodeByQuery(client *c, struct serverCommand *cmd, robj **argv, int argc, int | |||
* NODE <node-id>. */ | |||
int flags = LOOKUP_NOTOUCH | LOOKUP_NOSTATS | LOOKUP_NONOTIFY | LOOKUP_NOEXPIRE; | |||
if ((migrating_slot || importing_slot) && !pubsubshard_included) { | |||
if (lookupKeyReadWithFlags(&server.db[0], thiskey, flags) == NULL) | |||
if (lookupKeyReadWithFlags(c->db, thiskey, flags) == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I modified it to use c->db
, so for most commands, the key it wants to access can be correctly located. However, some cross-DB commands, such as COPY
, still require additional checks. The ultimate solution is atomic-slot-migration I believe. Once ATM is implemented, the TRYAGAIN
issue will no longer occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that getNodeByQuery doesn't follow selects either, so this might not be the right database. If you for example have:
SELECT 0
GET FOO
SELECT 1
GET FOO
c->db won't be correct here either. COPY and move are also such problems as mentioned. I wonder if there is some way to make this correct without having ATM so we can limit the breakage if you're moving from standalone to cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, c->db can obtain the correct context information. Are you referring to the scenario where the select command is also used within a transaction (MULTI/EXEC)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed that getNodeByQuery
is being invoked while queuing commands in a MULTI
context. Is this intentional? It seems unnecessary to check for key existence before execution, as the database state can change and keys might be migrated. I would expect this check to happen when EXEC
is executed instead. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only benefit of this early validation, as I see it, is detecting cross-slot keys sooner. I think key existence validation should happen during EXEC execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madolson / @soloestoy
Can you check da1ee65 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, but COPY and MOVE still have the problem. A simple way is refuse these commands during slot migration, or we can wait atomic slot migration finished that we don't need to check the migrating status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the commit above does not address COPY
and MOVE
.
I've merged a comment to reject these commands during slot migration.
I'm happy that we did "Unified db rehash method for both standalone and cluster #12848" when developing kvstore , which made the implementation of multi-database simpler. |
Signed-off-by: xbasel <[email protected]>
Signed-off-by: xbasel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add history to SWAPDB
, SELECT
, MOVE
json files to indicate it's supported since 9.0.
@@ -0,0 +1,481 @@ | |||
# Tests multi-databases in cluster mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the legacy clustering system. Ideally this test should be in unit/cluster
src/cluster.c
Outdated
@@ -1102,7 +1110,7 @@ getNodeByQuery(client *c, struct serverCommand *cmd, robj **argv, int argc, int | |||
* NODE <node-id>. */ | |||
int flags = LOOKUP_NOTOUCH | LOOKUP_NOSTATS | LOOKUP_NONOTIFY | LOOKUP_NOEXPIRE; | |||
if ((migrating_slot || importing_slot) && !pubsubshard_included) { | |||
if (lookupKeyReadWithFlags(&server.db[0], thiskey, flags) == NULL) | |||
if (lookupKeyReadWithFlags(c->db, thiskey, flags) == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that getNodeByQuery doesn't follow selects either, so this might not be the right database. If you for example have:
SELECT 0
GET FOO
SELECT 1
GET FOO
c->db won't be correct here either. COPY and move are also such problems as mentioned. I wonder if there is some way to make this correct without having ATM so we can limit the breakage if you're moving from standalone to cluster.
src/db.c
Outdated
@@ -1728,12 +1714,6 @@ void swapMainDbWithTempDb(serverDb *tempDb) { | |||
void swapdbCommand(client *c) { | |||
int id1, id2; | |||
|
|||
/* Not allowed in cluster mode: we have just DB 0 there. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a comment on the issue about this, but also worth mentioning it's hard to orchestrate SWAPDB
. Even in steady state, flushdb
and flushall
are idempotent (you can send them multiple times) but swapdb isn't. If a command times out on one node, it's hard to reason about if it was successful and how to retry it. I think we should continue to disable SWAPDB
in cluster mode for now, unless we introduce an idempotent way to do the swap.
@@ -1,5 +1,12 @@ | |||
start_server {tags {"lazyfree"}} { | |||
test "UNLINK can reclaim memory in background" { | |||
|
|||
# The test framework invokes "flushall", replacing kvstores even if empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we did a sync flushall then in the test framework, so we don't have these random waits all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wait here is only to allow lazy free to complete and for used_memory to update. We don't need to sleep after using FLUSHALL
in other tests.
Additionally, once #1609 is merged, it's unlikely that this sleep will be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you guaranteeing we never need to wait for the FLUSHALL
in other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t. If FLUSHALL
is just wiping databases, there’s no need to wait. The wait here is only for observing memory impact. Why do you think we need to wait every time FLUSHALL
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that other people might not be aware of this constraint, and might encounter similar issues to you where the memory is not behaving the way they expect. We actually recently made the change (FLUSHALL
used to default to being sync until Valkey 8.0). So maybe we should be explicitly doing the FLUSHALL SYNC
inside the test framework itself.
Also, now that I read this again, is this true? The cluster test framework invokes flushall, this test doesn't seem to invoke flushall at all. Is this still necessary?
… in cluster mode Previously, key lookup validation in cluster mode was performed both when queuing and executing commands in a `MULTI/EXEC` transaction. However, this was unnecessary because: 1. If we check for key existence when queuing, the keys might not exist anymore when `EXEC` runs. 2. The only check that matters at queuing time is cross-slot validation, since commands in a transaction must operate within the same slot. 3. Key lookups should only happen at `EXEC` time when the command actually runs. - Removed key lookup validation at queuing time, keeping only cross-slot validation. - Modified `getNodeByQuery` to detect `SELECT` when scanning `MULTI` commands and update the database pointer accordingly. - Now, key lookups are performed **only** at `EXEC` time, ensuring validation happens when the command actually executes. - **Before:** Key lookups were performed both when queuing and executing `MULTI/EXEC`, which was redundant and could lead to incorrect assumptions. - **Now:** Only cross-slot validation is done at queuing. Key lookups are performed at `EXEC`, ensuring accuracy and correctness. Signed-off-by: xbasel <[email protected]>
…stencies. Signed-off-by: xbasel <[email protected]>
Signed-off-by: xbasel <[email protected]>
unsigned int numkeys = maxkeys > keys_in_slot ? keys_in_slot : maxkeys; | ||
addReplyArrayLen(c, numkeys); | ||
kvstoreHashtableIterator *kvs_di = NULL; | ||
kvs_di = kvstoreGetHashtableIterator(server.db->keys, slot, 0); | ||
kvs_di = kvstoreGetHashtableIterator(c->db->keys, slot, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change I thought is not compatible with current user expectation.
Now CLUSTER GETKEYSINSLOT and CLUSTER COUNTKEYSINSLOT only return the value of db0. The changes return the sum of all db.
I think the better way is to add more parameters (such as db number) on CLUSTER GETKEYSINSLOT and CLUSTER COUNTKEYSINSLOT to the specific db. And add one more cluster command to get all db keys.
Maybe need to discuss details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hwware we are discussing two commands at #1319 (comment)
This commit introduces multi-database support in cluster mode while maintaining backward compatibility and requiring no API changes. Key features include:
Database-agnostic hashing: The hashing algorithm is unchanged. Identical keys map to the same slot across all databases. No changes to slot calculation. This ensures consistency in key distribution and maintains compatibility with existing single-database setups.
Implementation is fully backward compatible with no API changes.
The core structure remains an array of databases, each containing a list of hashtables (one per slot).

Multi-DB support in cluster mode affects slot migration—tools need to iterate over DBs.
Command-Level Changes
GETKEYSINSLOT
,COUNTKEYSINSLOT
andMIGRATE
, which run in selected-DB context.SWAPDB
remains disabled in cluster mode due to its non-atomic nature and potential inconsistencies across primaries.Behavior Changes
Transaction Handling Changes (MULTI/EXEC)
getNodeByQuery
key lookup behavior will be changed:MIGRATE command operates a selected-db context. Please note that MIGRATE command parameter destination-db is used, when migrating keys they can be migrated to a different database in the target, like in non-cluster mode.
Slot migration process changes when multiple databases are used:
Valkey-cli has been updated to support resharding across all databases.
Implements #1319