introduction.tex

\section{Introduction}
\label{sec:introduction}

Large-scale web services and decentralized applications often rely on
geo-distributed state replication to meet their latency and availability
needs. An application's state is replicated asynchronously in such a
setting, meaning that operations are independently executed at each
replica, and updates are applied asynchronously at other replicas after a
possible network delay. Asynchronous execution complicates programming and
reasoning about distributed applications as it induces the possibility of
conflicting updates leading to non-convergence and application integrity
violations. Two basic approaches have been proposed to address this
problem. The first approach is to selectively strengthen the system
consistency to pre-empt the conflicting updates. The second is to redefine
the application state in terms of \emph{Replicated Data Types} (RDTs) that
are specially engineered to handle conflicting updates. Strengthening
system consistency necessarily entails inter-replica coordination,
therefore applications prefer to utilize RDTs as much as possible,
resorting to consistency strengthening only when it is necessary to maintain
application integrity. The common design principle guiding the development
of replicated data types is commutativity. The idea is that if the
replicated state is only updated by commutative operations, then updates
can be applied in any order and the replica states are still guaranteed to
converge. Indeed, there exist common use cases in web applications that can
be implemented using Commutative Replicated Data Types
(CRDTs)~\cite{crdts}. For instance, a video view counter on a streaming
application such as YouTube can be implemented as a Counter CRDT that
supports commutative increments.

In general, however, commutativity is not a common occurrence among
data types. Most common data type definitions come with at least a
pair of operations that do not commute. For instance \C{add} and
\C{remove} operations on a set do not commute if both are for the same
element.  Likewise two \C{insert} operations for the same position in
a list do not commute. Such non-commutative operations translate to
conflicting updates in a distributed setting, whose cumulative result
cannot be determined by the sequential specification of the data type
alone. To use a non-commutative data type in a replicated setting,
creative re-engineering of its internal representation and algorithms
is needed to turn it into a bonafide CRDT with a sensible distributed
semantics.
% In the case of set data type, that could involve changing the internal
% representation to maintain two sets -- a set containing the current
% elements and a \emph{tombstone} set that contains every element that
% has ever been deleted from the set. A \C{remove $k$} operation removes
% $k$ from the element set and adds it to the tombstone set, and an
% \C{insert $k$} operation adds $k$ to the element set only if it is not
% present in the tombstone set (i.e., only if $k$ was not previously
% deleted from the set). This results in a CRDT set called
% \emph{two-phase set} where an element once removed cannot be re-added. 
% A more complex re-engineering of the set data type could result in a set
% CRDT with a different (and arguably more useful) semantics.
% Sec.~\ref{sec:motivation} describes a \emph{remove-wins} CRDT set, which
% allows elements to be readded, but which prioritizes \C{remove}s over
% concurrent \C{insert}s. This transformation requires vector
% clocks~\cite{vectorclock} to keep track of casual ordering among
% \C{insert}s and \C{remove}s.
Indeed, most CRDTs have such carefully-crafted implementations that
rely on non-trivial technical devices to keep track of causal
dependencies and avoid or resolve conflicts. For instance, various
CRDT variants of the set abstract data type rely on \emph{vector
clocks} to identify conflicting \C{add}s and
\C{remove}s~\cite{zawirski-thesis, zhang}, Replicated Growable Array
(RGA) -- a CRDT for collaborative editing~\cite{rga}, and JSON CRDT --
a CRDT variant of the JSON storage format, both use \emph{Lamport
timestamps}~\cite{rga, json-crdt}; and TreeDoc, another CRDT for
collaborative editing, makes use of \emph{dense linear
orders}~\cite{treedoc}. Such advanced implementations makes it hard to
reason about basic correctness properties of CRDTs, such as
convergence. Indeed, formal verification of \emph{strong eventual
consistency} (i.e., eventual convergence) for few of the
aforementioned CRDTs required considerable effort on behalf of
multiple authors who are experts in the field~\cite{kleppmann2017}.
The extraordinary effort and the expertise required to build CRDTs is
a deterrent towards using type-safe abstractions with strong
guarantees in distributed applications. To overcome this deterrent,
there is a need for an alternative approach that lets developers reuse
their sequential abstractions in a distributed setting with little to
no additional overhead of reasoning about their correctness properties
such as convergence.

In this paper we describe an alternative approach to building
convergent replicated data types that realizes the aforementioned
virtues. Our approach is based on Mergeable Replicated Data Types
(MRDTs)~\cite{mrdt} -- an alternative take on RDTs that is inspired by
the Git version control system. MRDTs adopt a state-centric model of
replication based on version-controlled \emph{mergeable} states
instead of an operation-centric model based on commutative operations.
Unlike commutativity, mergeability does not require a data type
definition to be refactored to suit distributed execution. Instead,
the type definition only needs to be extended with a \C{merge}
function to reconcile concurrent versions of the type in presence of
their common ancestor version. However, mergeability itself doesn't
guarantee convergence; there exist MRDTs with well-defined and
intuitive merge semantics that nonetheless admit divergent executions.
While constraining the data type semantics to conform to algebraic
laws such as monotonicity might restore convergence, it would burden
the developer with the task of enforcing such constraints, which we
would like to avoid. We therefore propose an alternative approach to
convergence -- one that is based on runtime orchestration rather than
static enforcement. In particular, we extend MRDTs with a distributed
runtime that orchestrates only the \emph{well-formed} executions which
are guaranteed to converge. Notably, our approach guarantees
convergence of MRDTs \emph{regardless} of their merge semantics, thus
obviating the need to constrain the type semantics or their
implementations. The key insight behind our approach is the
observation that a data type's \C{merge} is a complete function, and
thefore can be safely invoked to merge concurrent versions of the type
notwithstanding their individual operation histories. Thus, a
replica's current version can always be made available for user
operations with the guarantee that it can be safely merged with a
concurrent version at a later point of time. The ability to
arbitrarily delay merges gives us the opportunity to orchestrate them
in a way that is \emph{effectively} linear, which inturn guarantees
convergence.

Note that it is easy to construct a trivial runtime that guarantees
convergence of all executions. Such a runtime would make an extensive
use of inter-replica coordination to synchronize the execution of
every operation and thus induce a sequentially-consistent behavior.
This would however defeat the purpose of state replication as the
latency incurred for synchronized execution of an operation is
prohibitively high, and the availability of the system during the
execution is correspondingly low. It is therefore important for a
distributed runtime of RDTs to \emph{not} interfere with the execution
of RDT operations. Orchestrating a well-formed convergent execution
without impacting latency and availability is possible despite the the
CAP theorem~\cite{cap}. While CAP theorem does rule out
linearizability, it does not prohibit enforcing weaker consistency
constraints on distributed executions. Our approach exploits this
observation by (\rom{1}).  Identifying a novel set of constraints on
distributed executions that guarantee their convergence, and
(\rom{2}). Localizing such constraints on (asynchronously executed)
state merges such that per-operation latency and system-wide
availability remain unaffected. The downside of interfering with
merges, however, is that the replicas may now take longer to converge
-- a phenomenon we quantify using a metric called \emph{staleness}
(Sec.~\ref{sec:eval}). Our approach thus brings to fore a fundamental
tradeoff that has not been explored in the context of RDTs.

\paragraph{Contributions} The contributions of this paper are summarized
below:
\begin{itemize}
  \item We present a runtime-assisted approach to MRDT state
    replication that sufficiently constrains distributed executions so
    as to guarantee their convergence. This is in contrast to the
    existing models of replication, where each data type is obligated
    to prove its convergence by discharging algebraic proof
    obligations.

  \item We formalize the aforementioned constraints in the context of
    an abstract machine (named \quark) that manifests distributed
    executions as version history graphs similar to Git.  We prove the
    \emph{convergence} and \emph{progress} properties of the \quark
    abstract machine that together guarantee its soundness. The
    formalization has also been mechanized in Ivy~\cite{ivy} and
    automatically verified with the help of Z3~\cite{z3}. To the best
    of our knowledge, this is the first time a formal system's
    meta-theory is proven completely inside an SMT solver.

  \item We formalize \quark distributed machine that implements the
    semantics of the \quark abstract machine in the context of an
    asynchronous distributed system. The formalization addresses
    several practical concerns of distribution and serves as a
    blueprint for an efficient implementation.

  \item We describe an implementation of \quark as a light-weight shim
    layer on top of Scylla -- an off-the-shelf distributed key-value
    store~\cite{scylla}. We describe an extensive evaluation to
    quantify the staleness metric and novel tradeoffs in the context
    of case studies involving a data structure benchmark and a
    collaborative editing application. 
\end{itemize}

While our focus in this paper is on RDTs, we believe our contributions
are also meaningful in the context of the long line of work on
programming models for shared-memory concurrency. In particular, this
work generalizes concurrent revisions~\cite{BBL+10} beyond fork-join
parallelism and server-client paradigm. Our approach demonstrates how
causal coherency-based semantics can be implemented more generally in
decentralized systems while guaranteeing the convergence of concurrent
executions.