Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Introduce First-Class Durability Support in Valkey #1355

Open
PingXie opened this issue Nov 26, 2024 · 5 comments
Open

[NEW] Introduce First-Class Durability Support in Valkey #1355

PingXie opened this issue Nov 26, 2024 · 5 comments

Comments

@PingXie
Copy link
Member

PingXie commented Nov 26, 2024

Problem Statement

Valkey offers several options for users seeking to ensure data durability, but each has notable limitations.

A commonly used approach is the WAIT command, which ensures that changes are replicated to a specified number of replicas before confirming success to the client. However, this places the burden of managing durability on the application, introducing complexity and increasing the risk of errors.

Another option is to use Append-Only File (AOF) with ALWAYS fsync, while avoiding replicas altogether. This ensures that writes are persisted to disk before being acknowledged to the client. However, AOF is not a true write-ahead log (WAL). For instance, in cases where writes or fsync operations fail, the changes might still be appended to the AOF file even though the client was informed of a failure. This creates a mismatch between the client’s perception and the actual system state, leading to what can be described as durability inconsistency. Moreover, this approach provides durability guarantees only on a single node.

Similarly, the WAITAOF command, which ensures data is written to the AOF file before acknowledgment, shares the same challenges of client-side complexity and durability inconsistency. While it provides durability under specific conditions, it does not guarantee correctness in all failure scenarios.

As a result of these limitations, Valkey currently lacks a mechanism to provide a robust durability contract. Specifically, there is no reliable way to guarantee that at least N copies of data are consistently maintained—where N is the user-configured replication factor—in the face of single-point failures such as a node outage or an availability zone (AZ) failure.

Proposal

This proposal introduces first-class support for synchronous replication and deterministic elections in Valkey, enabling users to maintain at least N consistent copies of their data even in the event of single-point failures. Here, N refers to the user-configured replication factor. Additionally, this proposal ensures that only replicas fully synchronized with the primary are eligible to contend for primaryship, guaranteeing predictable and reliable behavior across all failure scenarios.

Synchronous Replication

When operating in "durability" mode—a new server configuration—the primary node acknowledges a write command only after successfully replicating the change to the expected number of replicas and ensuring that those changes are committed.

To prevent unbounded delays, the primary enforces a configurable timeout for replica acknowledgments. If the number of responding replicas falls below the durability threshold after the timeout expires, the write will fail with a "not enough replicas" error. This ensures a balance between durability guarantees and write latency.

Deterministic Elections

A replica’s eligibility for election is explicitly controlled by the primary. All replicas start in an ineligible state, which prevents them from contending for the primary role.

The full synchronization process with the primary is the first step for a replica to catch up. Once this is complete, the replica begins receiving incremental updates accumulated during the sync. When the replica’s replication offset matches the primary’s replication offset, the primary explicitly grants it election eligibility by sending a specific command. Until this command is issued, the replica remains ineligible to contend for primaryship.

To ensure deterministic behavior, the primary temporarily pauses writers during this process—similar to the manual failover.

If a replica disconnects from the primary, it immediately loses its eligibility. Upon reconnection, the replica reverts to an ineligible state and must complete the same synchronization process before regaining eligibility. This ensures that only fully synchronized replicas can participate in elections, eliminating the risk of promoting an inconsistent primary.

Faster Full Synchronization

In the event of a primary failure, one of its eligible replicas will assume the primary role. When the original primary restarts, it will rejoin as a replica and must quickly synchronize with the new primary before writes can resume. Minimizing downtime during this process is critical to maintaining availability.

Currently, when persistence is enabled, the restarting node reads from disk to reconstruct its data before synchronizing with the primary. This approach is slow and can introduce significant delays to write operations. With Valkey 8.0's support for dual-channel replication, it is now possible to always perform full synchronization directly from the new primary's memory. This significantly accelerates data transfer without adding too much load on the primary node.

Further performance improvements can be achieved by leveraging additional CPU resources during the full sync process. By parallelizing data transfer and processing, the system can dramatically reduce the time required for a restarting node to catch up with the new primary, thereby shortening the unavailability window.

Analysis

Steady State

In the steady state, where all replicas are healthy and eligible, a Valkey server running in durability mode ensures that every committed change is replicated to the user-configured number of replicas, in addition to the primary. This guarantees that the durability configuration set by the user is fully honored under normal operating conditions, assuming no more than single-point failures.

When a new replica is added, it does not participate in the write quorum until it has fully synchronized with the primary. During this synchronization period, the replica remains ineligible for elections. Once its replication offset matches the primary’s replication offset, the primary explicitly grants it eligibility and updates the expected number of replicas for the write quorum.

If a replica is removed from the cluster, it must first go through the "cluster forget" process. This ensures that the primary stops considering the replica as part of the quorum, preventing unnecessary write timeouts caused by waiting for acknowledgments from a node that is no longer present.

Failure Scenarios

An eligible replica loses connection to the primary.

A write operation racing against this event may experience delays or timeouts, which is an intentional trade-off to ensure durability. The primary closes the client representing the disconnected replica, which then loses its eligibility. Upon reconnection, the replica must catch up with the primary before it can regain eligibility.

An eligible replica requires full synchronization.

This scenario is treated the same as a disconnection. The replica remains ineligible during the synchronization process and only regains eligibility after completing the synchronization and being explicitly granted eligibility.

Multiple eligible replicas require resynchronization.

If more than one eligible replica becomes disconnected or requires resynchronization, the behavior remains identical with the previous cases. Writes will only proceed if the required number of eligible replicas are available. Otherwise, writes are delayed or rejected until the system can restore the desired durability guarantees.

The primary fails.

In the event of a primary failure, only replicas explicitly granted eligibility are allowed to participate in the election process. This ensures that no stale replica becomes the new primary. If some eligible replicas are also unavailable during this event, the election will proceed as long as at least one eligible replica remains. However, if the remaining number of eligible replicas is insufficient to meet the user-defined durability threshold, future writes will be rejected until the durability guarantee is restored.

Availability Trade-offs

This proposal prioritizes durability over availability by requiring all eligible replicas to acknowledge writes in durability mode. Once a replica is deemed eligible, it is expected to remain in that state and participate in all future replication synchronously. This strict acknowledgment process ensures that all replicas participating in elections are fully in sync, eliminating the risk of promoting an inconsistent/stale primary.

Allowing configurable durability thresholds, where only a subset of eligible replicas need to acknowledge writes, might appear to offer greater flexibility. However, this introduces significant challenges. Going with this option, if a replica fails to acknowledge a write, the system would need to dynamically revoke its eligibility. This revocation process complicates the protocol in several ways:

  • Performance Overhead: Revoking eligibility requires explicit and synchronous communication to replicas. This adds latency and increases the volume of internal messages during write operations.

  • Error-Handling Complexity: The revocation process introduces new failure scenarios. If the message revoking a replica's eligibility is lost or delayed, replicas may hold inconsistent views of eligibility status, leading to election ambiguity or delays.

By requiring strict acknowledgment from all eligible replicas, the proposed protocol avoids these challenges/complexity. The system guarantees durability and consistency at the cost of temporary unavailability during failure scenarios. This trade-off simplifies the protocol, reduces error-handling complexity, and ensures predictable behavior under both normal and failure conditions.

While this strict approach may lead to write rejections if the number of available eligible replicas drops below the user-configured replication factor, it is a necessary tradeoff that prioritizes data durability over availability.

Incremental Perfection

While the described in-memory solution guarantees durability in the face of single-point failures, it cannot prevent data loss if all nodes in a shard experience simultaneous downtime. In such extreme cases, a proper write-ahead log (WAL) implementation would be necessary to ensure data durability. The WAL acts as an extra layer of defense, complementing synchronous replication and deterministic elections.

A critical aspect of this proposal is the requirement that replicas must be explicitly granted permission by the primary to participate in elections. This design ensures that if all nodes in a shard go down and then recover, no replica can independently initiate an election unless it has previously received explicit eligibility from the old primary. Consequently, in the absence of the old primary granting voting permissions, no election can take place, and the old primary is guaranteed to resume its role once it comes back online.

When all nodes reload their WAL after a full shard downtime event:

  • The old primary’s WAL will contain the most complete and up-to-date state, making it the natural candidate to resume the primaryship.

  • Replicas remain ineligible to vote or become primary until explicitly granted eligibility by the old primary during normal operation.

This mechanism eliminates the risk of replicas contending for primaryship prematurely, ensuring data consistency.
This deterministic behavior avoids ambiguity and split-brain scenarios during recovery. The trade-off inherent in this approach is reduced availability after a full shard down event. The shard remains unavailable until the old primary is operational, as no other node is authorized to assume the primary role. However, this unavailability is intentional and aligns with the proposal’s core goal of prioritizing durability and consistency over availability in failure scenarios.

Decentralized vs. Centralized WAL

Introducing a WAL to enhance resilience in Valkey presents two potential architectural approaches, each with its own trade-offs:

Decentralized WAL

In this approach, each node maintains its own independent WAL. This design aligns closely with Valkey’s self-sufficient and distributed nature, allowing nodes to manage durability independently. However, as described above, a decentralized WAL achieves durability at the potentially higher cost of availability. For instance, after a full shard down event, only the old primary can resume its primary role. This ensures data consistency but delays recovery until the old primary becomes operational.

Distributed Centralized WAL

A centralized WAL provides a unified log that all nodes in a shard share, but it is implemented in a distributed and highly available manner. This design simplifies recovery by offering a single source of truth, making it easier to reconcile state across nodes. Essentially, centralized WALs delegate deterministic election and synchronous replication to the log service, reducing the complexity required on the engine side. However, relying on a specialized log service introduces an external dependency, which makes Valkey less self-sufficient and potentially more complex to deploy and manage.

Why AOF Is Not a True WAL

AOF is often used to persist data in Valkey, but it is not a true WAL. A true WAL guarantees that every change is recorded before the corresponding data operation is performed, ensuring an exact and replayable sequence of events for recovery. AOF, however, has several limitations that make it unsuitable as a WAL:

  • Client-Perceived Inconsistency: AOF may persist a write operation to disk even if the client perceives the operation as failed. For example, consider a SET x y command. If the primary writes the command to the AOF file but crashes before acknowledging the client, the client assumes the operation failed. Upon recovery, the primary replays the AOF file, resulting in x being updated to y—a state inconsistent with the client’s expectation. This creates a durability inconsistency because the client’s view of the operation's success does not match the system’s actual state.

  • Lack of Separation Between Commit and Application: In a proper WAL, there is a clear distinction between logging a change (commit record) and applying the change to the in-memory data structure. AOF lacks this separation, meaning that partial updates can lead to ambiguous recovery behavior, especially in failure scenarios.

  • Error Handling Limitations: AOF lacks mechanisms to address certain edge cases, such as partially written commands during disk failures. While checksum validation can mitigate some issues, it does not fully replicate the robust recovery guarantees provided by a true WAL.

@madolson
Copy link
Member

This is a great conversation to have, and thanks for spending some time putting together this issue to build off of.

The first thing that comes to mind is we had a lot of these conversations when we launched MemoryDB. The most specific one was what problem were we really trying to solve. For caching, tail data loss is generally OK as long as most of the data is persisted. A better solution there might be better bounded dataloss solutions where we stop accepting writes. Message queues are basically all write workloads, and usually don't like the higher latency that comes with durability.

Full durability takes a lot of time to get right, so I guess I want us to be committed that this is a good long term path for us.

Some other adhoc comments:

Why AOF Is Not a True WAL

It's hard to maintain Valkey semantics while building a WAL log. Non-deterministic commands like SPOP or EVAL are hard to write ahead. Other databases solve this with MVCC, which is expensive and requires a lot of effort. Maybe there is a clever solution here though. We don't need as fancy of a solution as say postgreSQL, which keeps a bunch of transactions in flight at a given time.

This proposal prioritizes durability over availability by requiring all eligible replicas to acknowledge writes in durability mode.

Your proposal prioritizes consistency over availability I think. There are other algorithms like raft which was can use to get durability and availability, at the cost of consistency. For most workloads, I think this is the better tradeoff. For a single replica case, then it becomes synchronous replication.

Although, RedisRaft exists, and never got launched.

Decentralized vs. Centralized WAL

This feels more like a "Should we take a dependency" conversation. We'll probably have this conversation with cluster v2 soon since it requires durability as well. We're a database service, whether we like it or not, I think we should own our destiny with durability.

For example, consider a SET x y command. If the primary writes the command to the AOF file but crashes before acknowledging the client, the client assumes the operation failed. Upon recovery, the primary replays the AOF file, resulting in x being updated to y—a state inconsistent with the client’s expectation. This creates a durability inconsistency because the client’s view of the operation's success does not match the system’s actual state.

This is not an inconsistency. The same can happen if you simply lost the write because of a disconnect while sending it to a database, you can't assume it was committed one way or another. As long as the database doesn't serve any traffic until it recovers, it's in a consistent state.

Faster Full Synchronization

This section seems a bit like a tangent. Can't we do the improvements mentioned like parallel full sync regardless? Seems like maybe this deserves its own issue.

@soloestoy
Copy link
Member

Thanks to @PingXie for bringing up this discussion, it's a very helpful idea.

For a long time, Valkey has been used as a cache. This isn't to say that Valkey can only function as a cache, rather, its high performance and ability to accelerate caching scenarios made it fit that role well. However, in the recent years, we've encountered many users who use Valkey as a primary database and also have requirements for durability. They hope that Valkey can ensure no data loss, so enhancing durability is a very necessary discussion (it can be offered as an optional configuration, with the default still being asynchronous to prioritize performance).

This issue is very detailed, allowing us to discuss the specific implementation methods. As I understand it, there are roughly several dimensions to the topic:

  1. Data persistence across multiple replicas and single-machine data persistence
  2. Ensuring data integrity during fault recovery
  3. Ensuring data integrity during failover (automatic or manual)

As it stands, Valkey's current replication mechanism seems unable to fully meet the demands for durability. Let's set aside 2PC and Raft for now. Considering Valkey's current capabilities, the scenarios described in this issue all pose significant challenges in implementation (please correct me if I'm wrong).

First, regarding the Synchronous Replication section:

When operating in "durability" mode—a new server configuration—the primary node acknowledges a write command only after successfully replicating the change to the expected number of replicas and ensuring that those changes are committed.

To prevent unbounded delays, the primary enforces a configurable timeout for replica acknowledgments. If the number of responding replicas falls below the durability threshold after the timeout expires, the write will fail with a "not enough replicas" error. This ensures a balance between durability guarantees and write latency.

Here the primary sends a command to the replicas, and if no response is received within a specified time, the write command is considered a failure. However, the command has already been executed on the primary, so how can it be rolled back?

In the section about Deterministic Elections, as well as the Analysis of Steady State and Failure Scenarios, there is a high degree of relevance, and I have a few questions:

  1. Does this mean that as long as the full synchronization is completed, the replica is eligible to be elected? Or does it require entering the incremental synchronization state, and the replica's offset needs to be consistent with the primary's offset to be eligible for election?
  2. If it's the latter, for a write command, the current process is "execute on the primary first" -> "confirm any modifications" -> "increase its own offset" -> "then replicate to replicas." Theoretically, the replica should immediately revoke its election eligibility upon receiving new data from the primary until the primary explicitly sends an authorization command to regain election eligibility. However, if there is a network failure during this period and the replica does not receive the primary's authorization command, it would never be able to elect a new master, which would mean that automatic failure recovery cannot be completed?

Regarding the Faster Full Synchronization section, I simply understand it as not loading data when the replica restarts. This reminds me of a similar discussion we had with Redis before redis/redis#12812, adding a switch to control whether data is loaded during startup. We could consider reopening this PR in Valkey to discuss it separately.

Regarding the final section on Why AOF Is Not a True WAL, I agree with @madolson

This is not an inconsistency. The same can happen if you simply lost the write because of a disconnect while sending it to a database, you can't assume it was committed one way or another. As long as the database doesn't serve any traffic until it recovers, it's in a consistent state.

I also don't think this is an inconsistency :)

Additionally, let's brainstorm a bit. It currently seems like durability is a server-side configuration, but in practical applications, different services using the same instance may have different expectations. I think it could be made at the client level, similar to MongoDB's write concern, where the client specifies whether the write should achieve durability to be considered successful.

@hwware
Copy link
Member

hwware commented Nov 26, 2024

I finish reading half of this new proposal, I just want to ask if the term consistent is better than durability? Before I read it, I thought it is about AOF or RDB, and even other storage.

@hwware
Copy link
Member

hwware commented Nov 27, 2024

  1. If it's the latter, for a write command, the current process is "execute on the primary first" -> "confirm any modifications" -> "increase its own offset" -> "then replicate to replicas." Theoretically, the replica should immediately revoke its election eligibility upon receiving new data from the primary until the primary explicitly sends an authorization command to regain election eligibility. However, if there is a network failure during this period and the replica does not receive the primary's authorization command, it would never be able to elect a new master, which would mean that automatic failure recovery cannot be completed?

For this situation, it is mentioned here:

Going with this option, if a replica fails to acknowledge a write, the system would need to dynamically revoke its eligibility. I think the way is that the primary send a revoke message to replica. So Ping indicates there is one issue: The revocation process introduces new failure scenarios. If the message revoking a replica's eligibility is lost or delayed, replicas may hold inconsistent views of eligibility status, leading to election ambiguity or delays

@hwware
Copy link
Member

hwware commented Nov 27, 2024

Some notes I would like to write here:

  1. The 'durability mode'—a new server configuration must be set on every node, whatever it is standalone mode or cluster, otherwise vote process will be switched between 'non-durability mode' and 'durability mode'. So it leads to one old issue: set once and apply everywhere.

  2. the second issue is related to the 'durability mode' as well---- sentinel election mechanism. Current sentinel think all replicas have permission to as candidate for new primary and replicas could have different offset with primary.
    Thus, the question is: if the possible durability mode only consider the version which remove the sentinel ?

  3. In 'durability mode', if multiply write commands have dependency, and one of the first write commands fail, how we can deal with the following depending write commands?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants