Support for linearizable/quorum reads #207

cole-miller · 2022-09-19T14:43:44Z

Raft as described in the whitepaper guarantees linearizability by (among other things) having leaders respond to read-only requests only after contacting a majority of the cluster. This ensures that leaders who have been deposed can't respond to a read-only request with stale data. Our implementation of raft doesn't support this kind of checked read-only request, and dqlite leaders will answer queries without contacting a majority of the cluster first; as noted in the documentation, this means that queries can return stale data. I think it would be useful for dqlite (and go-dqlite) to provide an API for performing linearizable reads/queries that do check that the current leader hasn't been deposed.

One subtlety: dqlite leaders will run a barrier before performing queries if there are uncommitted entries in the leader's log. In that case, the barrier does the work of contacting a majority of the cluster, and the operation will be linearizable. But it's possible for the leader to have been deposed without having uncommitted entries in its log, so linearizability is not guaranteed in all circumstances.

cole-miller · 2022-09-19T14:45:19Z

With support for linearizable reads in go-dqlite, we could add this Jepsen test to our current battery:

http://jepsen-io.github.io/jepsen/jepsen.tests.linearizable-register.html

freeekanayaka · 2022-09-20T08:42:24Z

My understanding is that linearizability applies naturally to atomic/single operations (like reading a register). The SQL model actually has transactional/multi operations, where usually the most you can have is serializability.

In particular SQLite in WAL mode uses snapshot isolation. I'm not sure how can you avoid stale reads with snapshot isolation: it's always possible to start a read transaction and read a stale value that was overwritten by a write transaction just after the read transaction started. I don't think there's any raft modification that can avoid that, since it's a property of the SQL/snapshot-isolation mode.

For example, if you use the sqlite3 command line to create a database like:

sqlite3 foo "pragma journal_mode=WAL; begin; create table foo(a text, b int); insert into foo (a,b) values ('a', 1); insert into foo (a,b) values ('b', 2); commit;"

then in 2 separate shells you start 2 sqlite3 shells and run these commands in the following order:

# shell 1
sqlite> begin;
sqlite> select * from foo where a='a';
a|1

this starts a read transaction on shell 1, then:

# shell 2
sqlite> begin;
sqlite> update foo set b=3 where a='b';
sqlite> begin;

this overwrites the value 'b', which transaction 1 still has to read. Now:

# shell 1
sqlite> select * from foo where a='b';
b|2

which is a stale read. This is kind of non-linearizable "by design", in the sense that the snapshot isolation transaction model is serializable but not linearizable (and indeed in the SQL standard there's no isolation mode guarantees linearizability, because it doesn't quite match with multi-operation transactions).

I've used the sqlite3 command line, but with dqlite is the same.

cole-miller · 2022-09-20T18:54:09Z

My understanding is that linearizability applies naturally to atomic/single operations (like reading a register). The SQL model actually has transactional/multi operations, where usually the most you can have is serializability.

But as far as dqlite is concerned, the whole database is one big piece of state, right? Is the inclusion of transactions in the SQL model relevant when all query/exec operations are atomic and opaque to raft (i.e. they might as well be modifying and reading a single register)?

I'm not sure how can you avoid stale reads with snapshot isolation: it's always possible to start a read transaction and read a stale value that was overwritten by a write transaction just after the read transaction started.

I don't think the situation you describe is a stale read, is it? It's consistent with a sequencing of transactions where the read happens before the right, and linearizability (or strict serializability, I guess, since we're talking about transactions) isn't violated because each transaction appears to execute atomically at some point between invocation and return.

freeekanayaka · 2022-09-20T20:01:33Z

My understanding is that linearizability applies naturally to atomic/single operations (like reading a register). The SQL model actually has transactional/multi operations, where usually the most you can have is serializability.

But as far as dqlite is concerned, the whole database is one big piece of state, right? Is the inclusion of transactions in the SQL model relevant when all query/exec operations are atomic and opaque to raft (i.e. they might as well be modifying and reading a single register)?

It is relevant because a dqlite SQL transaction can't be quite mapped to a raft operation, because it's not atomic in terms of client interactions. Typically the client starts the transaction, sends SQL commands, gets back results, possibly sends other SQL commands, until commit. It's only at commit time that we can possibly start a raft operation. However, precisely because they involve multiple client interactions, transactions can interleave. And if a write and read transaction interleave you might get the effect described in the example, without any possibility of resolving it with raft.

I believe the only way would be to allow either N concurrent read transactions OR one single write transaction. At the moment rN ead transactions can proceed concurrently with one single write transactions thanks to the WAL and its snapshot isolation.

Another way to see it, is that although the whole database is one big piece of state as you say, the operations that are run through raft are not the SQL transactions, but rather the result of the SQL transactions (i.e. the WAL pages that get added to the WAL).

Not sure how well I've articulated this, if it's confusing I'm happy to elaborate further.

I'm not sure how can you avoid stale reads with snapshot isolation: it's always possible to start a read transaction and read a stale value that was overwritten by a write transaction just after the read transaction started.

I don't think the situation you describe is a stale read, is it? It's consistent with a sequencing of transactions where the read happens before the right, and linearizability (or strict serializability, I guess, since we're talking about transactions) isn't violated because each transaction appears to execute atomically at some point between invocation and return.

The thing is that the definition of linearizable refers to single objects. If you take just that definition and apply it to the single object "second row of table foo", then this is a stale read because it violates real-time ordering. Likewise if you take the stronger multi-object version of linearizability, called strict serializability, this example is still a stale read since it violates real-time ordering.

As you point, these two transactions can be logically sequenced (i.e. serialized). However real-time ordering is violated, which is the key of linearizability (or actually strict serializability for transactions) vs serializability.

cole-miller · 2022-09-21T17:12:45Z

As you point, these two transactions can be logically sequenced (i.e. serialized). However real-time ordering is violated, which is the key of linearizability (or actually strict serializability for transactions) vs serializability.

Hmm, I'm looking at this part from the jepsen.io page about strict serializability that you linked:

When Herlihy and Wing introduced linearizability, they defined strict serializability in terms of a serializable system which is compatible with real-time order.

A history is serializable if it is equivalent to one in which transactions appear to execute sequentially, i.e., without interleaving. A (partial) precedence order can be defined on non-overlapping pairs of transactions in the obvious way. A history is strictly serializable if the transactions’ order in the sequential history is compatible with their precedence order.

“The obvious way” means that transaction A precedes transaction B if A completes before B begins; this is the real-time constraint from linearizability.

Under that partial precedence order, the two transactions in your example wouldn't even be comparable, hence strict serializability doesn't impose any requirement on the order they should be applied -- right?

freeekanayaka · 2022-09-21T19:31:57Z

As you point, these two transactions can be logically sequenced (i.e. serialized). However real-time ordering is violated, which is the key of linearizability (or actually strict serializability for transactions) vs serializability.

Hmm, I'm looking at this part from the jepsen.io page about strict serializability that you linked:

When Herlihy and Wing introduced linearizability, they defined strict serializability in terms of a serializable system which is compatible with real-time order.

A history is serializable if it is equivalent to one in which transactions appear to execute sequentially, i.e., without interleaving. A (partial) precedence order can be defined on non-overlapping pairs of transactions in the obvious way. A history is strictly serializable if the transactions’ order in the sequential history is compatible with their precedence order.

“The obvious way” means that transaction A precedes transaction B if A completes before B begins; this is the real-time constraint from linearizability.

Under that partial precedence order, the two transactions in your example wouldn't even be comparable, hence strict serializability doesn't impose any requirement on the order they should be applied -- right?

If you call the read transaction in the example "A" and the write transaction "B", you have B completing before A in real-time ordering but the serialization (logical) ordering needs to be that A completes before B (because it does not see the modifications made by B). This violates strict linearizability.

The ordering between these 2 specific transactions can't be partial, because they operate on the same data.

In case you aren't yet convinced, it might be worth writing to jepsen's author Kyle Kingsbury (aphyr at jepsen.io) to confirm or provide more details. Canonical had paid for training/consulting from him back then, so I believe he will be happy to reply. I'm not nearly as expert as him.

MathieuBordere added the Feature New feature, not a bug label Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for linearizable/quorum reads #207

Support for linearizable/quorum reads #207

cole-miller commented Sep 19, 2022

cole-miller commented Sep 19, 2022

freeekanayaka commented Sep 20, 2022 •

edited

Loading

cole-miller commented Sep 20, 2022

freeekanayaka commented Sep 20, 2022

cole-miller commented Sep 21, 2022 •

edited

Loading

freeekanayaka commented Sep 21, 2022 •

edited

Loading

Support for linearizable/quorum reads #207

Support for linearizable/quorum reads #207

Comments

cole-miller commented Sep 19, 2022

cole-miller commented Sep 19, 2022

freeekanayaka commented Sep 20, 2022 • edited Loading

cole-miller commented Sep 20, 2022

freeekanayaka commented Sep 20, 2022

cole-miller commented Sep 21, 2022 • edited Loading

freeekanayaka commented Sep 21, 2022 • edited Loading

freeekanayaka commented Sep 20, 2022 •

edited

Loading

cole-miller commented Sep 21, 2022 •

edited

Loading

freeekanayaka commented Sep 21, 2022 •

edited

Loading