Skip to content

Latest commit

 

History

History
182 lines (114 loc) · 8.4 KB

48. DDIA-Replication.md

File metadata and controls

182 lines (114 loc) · 8.4 KB

Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network.

Replication can serve several purposes:

  • High availability

    Keeping the system running, even when one machine (or several machines, or an entire datacenter) goes down.

  • Disconnected operation

    Allowing an application to continue working when there is a network interruption.

  • Latency

    Placing data geographically close to users, so that users can interact with it faster

  • Scalability

    Being able to handle a higher volume of reads than a single machine could handle, by performing reads on replicas.

In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset. We have relaxed that assumption and discussed partitioning (sharding) of datasets that are too big for a single machine.

If the data that you’re replicating does not change over time, then replication is easy: you just need to copy the data to every node once, and you’re done. All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: single-leader, multi-leader, and leaderless replication.

Each node that stores a copy of the database is called a replica. Every write to the database needs to be processed by every replica.

1. Single-leader Replication

Clients send all writes to a single node (the leader), which sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica, but reads from followers might be stale.

Used in:

  • relational databases and non-relational databases
  • distributed message brokers such as Kafka
  • network filesystems

1.1 Implementation of Replication Logs

  • Statement-based replication

In the simplest case, the leader logs every write request (statement) that it executes and sends that statement log to its followers.

  • Write-ahead log (WAL) shipping

We discussed how storage engines represent data on disk, and we found that usually every write is appended to a log.

We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers. When the follower processes this log, it builds a copy of the exact same data structures as found on the leader.

2. Multi-leader Replication

A natural extension of the leader-based replication model is to allow more than one node to accept writes.

Clients send each write to one of several leader nodes, any of which can accept writes. The leaders send streams of data change events to each other and to any follower nodes.

2.1 Use cases

  • Clients with offline operation

Multi-leader replication is appropriate if you have an application that needs to continue to work while it is disconnected from the internet.

If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.

In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices.

  • Collaborative editing

Real-time collaborative editing applications allow several people to edit a document simultaneously.

When one user edits a document, the changes are instantly applied to their local replica (the state of the document in their web browser or client application) and asynchronously replicated to the server and any other users who are editing the same document.

If you want to guarantee that there will be no editing conflicts, the application must obtain a lock on the document before a user can edit it. If another user wants to edit the same document, they first have to wait until the first user has committed their changes and released the lock. This collaboration model is equivalent to single-leader replication with transactions on the leader.

However, for faster collaboration, you may want to make the unit of change very small (e.g., a single keystroke) and avoid locking. This approach allows multiple users to edit simultaneously, but it also brings all the challenges of multi-leader replication, including requiring conflict resolution.

2.2 Handling Write Conflicts

The biggest problem with multi-leader replication is that write conflicts can occur.

  • Synchronous versus asynchronous conflict detection

In a multi-leader setup, the conflict detection is asynchronous.

In principle, you could make the conflict detection synchronous—i.e., wait for the write to be replicated to all replicas before telling the user that the write was successful. However, by doing so, you would lose the main advantage of multi-leader replication: allowing each replica to accept writes independently. If you want synchronous conflict detection, you might as well just use single-leader replication.

  • Conflict avoidance

The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur.

  • Converging toward a consistent state

    • Give each write a unique ID, pick the write with the highest ID as the winner, and throw away the other writes.

    • Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).

  • Custom conflict resolution logic

As the most appropriate way of resolving a conflict may depend on the application, most multi-leader replication tools let you write conflict resolution logic using application code.

3. Leaderless Replication

Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data.

3.1 Writing to the database when a node is down

Both write and read requests are sent to several nodes in parallel.

The client may get different responses from different nodes for read; i.e., the up-to-date value from one node and a stale value from another. Version numbers are used to determine which value is newer.

The replication system should ensure that eventually all the data is copied to every replica. After an unavailable node comes back online, how does it catch up on the writes that it missed?

  • Read repair

When a client makes a read from several nodes in parallel, it can detect any stale responses and repair stale data in nodes. This approach works well for values that are frequently read.

  • Anti-entropy process

In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another.

3.2 Quorums for reading and writing

If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes.

The parameters n, w, and r are typically configurable. For example, a workload with few writes and many reads may benefit from setting w = n and r = 1. This makes reads faster, but has the disadvantage that just one failed node causes all database writes to fail.

4. Replication Lag

Replication can be synchronous or asynchronous. Although asynchronous replication can be fast when the system is running smoothly, it’s important to figure out what happens when replication lag increases.

4.1 Consistency models

Consistency models are helpful for describing how an application should behave under replication lag.

  • Read-after-write consistency

    Users should always see data that they submitted themselves.

  • Monotonic reads

    After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.

  • Consistent prefix reads

    Users should see the data in a state that makes causal sense: for example, seeing a question and its reply in the correct order.

References

Designing Data-Intensive Applications By Martin Kleppmann