-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for linearizable/quorum reads #207
Comments
With support for linearizable reads in go-dqlite, we could add this Jepsen test to our current battery: http://jepsen-io.github.io/jepsen/jepsen.tests.linearizable-register.html |
My understanding is that linearizability applies naturally to atomic/single operations (like reading a register). The SQL model actually has transactional/multi operations, where usually the most you can have is serializability. In particular SQLite in WAL mode uses snapshot isolation. I'm not sure how can you avoid stale reads with snapshot isolation: it's always possible to start a read transaction and read a stale value that was overwritten by a write transaction just after the read transaction started. I don't think there's any raft modification that can avoid that, since it's a property of the SQL/snapshot-isolation mode. For example, if you use the
then in 2 separate shells you start 2 sqlite3 shells and run these commands in the following order:
this starts a read transaction on shell 1, then:
this overwrites the value 'b', which transaction 1 still has to read. Now:
which is a stale read. This is kind of non-linearizable "by design", in the sense that the snapshot isolation transaction model is serializable but not linearizable (and indeed in the SQL standard there's no isolation mode guarantees linearizability, because it doesn't quite match with multi-operation transactions). I've used the |
But as far as dqlite is concerned, the whole database is one big piece of state, right? Is the inclusion of transactions in the SQL model relevant when all query/exec operations are atomic and opaque to raft (i.e. they might as well be modifying and reading a single register)?
I don't think the situation you describe is a stale read, is it? It's consistent with a sequencing of transactions where the read happens before the right, and linearizability (or strict serializability, I guess, since we're talking about transactions) isn't violated because each transaction appears to execute atomically at some point between invocation and return. |
It is relevant because a dqlite SQL transaction can't be quite mapped to a raft operation, because it's not atomic in terms of client interactions. Typically the client starts the transaction, sends SQL commands, gets back results, possibly sends other SQL commands, until commit. It's only at commit time that we can possibly start a raft operation. However, precisely because they involve multiple client interactions, transactions can interleave. And if a write and read transaction interleave you might get the effect described in the example, without any possibility of resolving it with raft. I believe the only way would be to allow either N concurrent read transactions OR one single write transaction. At the moment rN ead transactions can proceed concurrently with one single write transactions thanks to the WAL and its snapshot isolation. Another way to see it, is that although the whole database is one big piece of state as you say, the operations that are run through raft are not the SQL transactions, but rather the result of the SQL transactions (i.e. the WAL pages that get added to the WAL). Not sure how well I've articulated this, if it's confusing I'm happy to elaborate further.
The thing is that the definition of linearizable refers to single objects. If you take just that definition and apply it to the single object "second row of table foo", then this is a stale read because it violates real-time ordering. Likewise if you take the stronger multi-object version of linearizability, called strict serializability, this example is still a stale read since it violates real-time ordering. As you point, these two transactions can be logically sequenced (i.e. serialized). However real-time ordering is violated, which is the key of linearizability (or actually strict serializability for transactions) vs serializability. |
Hmm, I'm looking at this part from the jepsen.io page about strict serializability that you linked:
Under that partial precedence order, the two transactions in your example wouldn't even be comparable, hence strict serializability doesn't impose any requirement on the order they should be applied -- right? |
If you call the read transaction in the example "A" and the write transaction "B", you have B completing before A in real-time ordering but the serialization (logical) ordering needs to be that A completes before B (because it does not see the modifications made by B). This violates strict linearizability. The ordering between these 2 specific transactions can't be partial, because they operate on the same data. In case you aren't yet convinced, it might be worth writing to jepsen's author Kyle Kingsbury (aphyr at jepsen.io) to confirm or provide more details. Canonical had paid for training/consulting from him back then, so I believe he will be happy to reply. I'm not nearly as expert as him. |
Raft as described in the whitepaper guarantees linearizability by (among other things) having leaders respond to read-only requests only after contacting a majority of the cluster. This ensures that leaders who have been deposed can't respond to a read-only request with stale data. Our implementation of raft doesn't support this kind of checked read-only request, and dqlite leaders will answer queries without contacting a majority of the cluster first; as noted in the documentation, this means that queries can return stale data. I think it would be useful for dqlite (and go-dqlite) to provide an API for performing linearizable reads/queries that do check that the current leader hasn't been deposed.
One subtlety: dqlite leaders will run a barrier before performing queries if there are uncommitted entries in the leader's log. In that case, the barrier does the work of contacting a majority of the cluster, and the operation will be linearizable. But it's possible for the leader to have been deposed without having uncommitted entries in its log, so linearizability is not guaranteed in all circumstances.
The text was updated successfully, but these errors were encountered: