rts_work_stealing2.tex


\section{Improved work stealing implementation}
\label{sec:rts_work_stealing2}

In this section we attempt to improve on our work stealing
implementation from Section~\ref{sec:rts_work_stealing}.
While we made these improvements,
we also found it useful to change how idle engines behave.
Although these two changes are conceptually distinct,
they were made together and their implementations are interlinked.
Therefore we will present and benchmark them together as one set of changes.
While our changes are an improvement in terms of design,
they do not always yield an improvement in performance.

\plan{Describe problems with associating stacks with contexts}
Sparks were originally stored on context-local deques.
This has the following significant problems.

\begin{description}

\item[There is a dynamic number of spark deques.]

The number of contexts changes during a program's execution,
therefore the number of spark deques also changes.
This means that we must have code to manage this changing number of
deques.
This code makes the runtime system more complicated than necessary,
both when stealing a spark and when creating or destroying a context.

\item[Locking is required.]

The management of contexts must be thread safe so that the set of
spark deques is not corrupted.
We store spark deques in a global array protected by a lock.
It may be possible to replace the array with a lock free data structure,
however it is better to remove the need for thread safety by using a
constant set of deques.

\item[A large number of contexts makes work stealing slower.]

In prior sections
we have shown that the number of contexts in use can often be very high,
much higher than the number of Mercury engines.
If there are $N$ engines and $M$ contexts,
then there can be at most $N$ contexts running and
at least $M-N$ contexts suspended (blocked and waiting to run).
A context can become suspended in one of two ways:
by blocking on a future's value in a call to \wait,
or by blocking on an incomplete conjunct in a call to \joinandcontinue.
We attempt to minimise the former case (see Chapter~\ref{chap:overlap})
and the latter case cannot occur
if the context has a spark on its local spark deque (it would run the
spark rather than block).
Therefore the large majority of the suspended contexts will not have
sparks on their deques,
and the probability of selecting a deque at random with a spark on its
stack is low.
Furthermore the value of $M$ can be very high in pathological cases.
If an engine does not successfully steal a spark from a deque,
it will continue by trying to steal a spark from a different deque.
An engine can exhaust all its attempts, even when there is work
available:
a spark may be placed on a deque after the engine has already attempted
to steal from it.
Upon exhausting all its attempts or all the deques,
the engine will sleep before making another round of attempts
(see \trystealspark in Algorithm~\ref{alg:try_steal_spark_initial} on
page~\pageref{alg:try_steal_spark_initial}).
Each round of attempts (regardless of success or failure) has a
complexity of $O(M)$.

\end{description}

\plan{We associate stacks with engines}
The approach we have taken to solving these problems is based on associating
spark deques with engines rather than with contexts.
A running Mercury program has a constant number of engines,
and therefore the number of spark deques will not vary.
This allows us to remove the code used to resize the deque array.
This makes context creation and destruction much simpler.
It also removes the need for the lock protecting the global array of spark
deques.
We can also remove the locking code in \trystealspark
(whose original version was shown in
Algorithm~\ref{alg:try_steal_spark_initial})
which was used to
ensure that the array is not changed while a thief is trying to steal a
spark.
The cost of work stealing also becomes linear in the number of engines
rather than the number of contexts.

\begin{algorithm}[tbp]
\begin{verbatim}
 1  MR_Code* MR_try_steal_spark(void) {
 2      int         victim_incex;
 3      MR_Deque    *deque;
 4      MR_Spark    spark
 5
 6      if ((MR_ENGINE(MR_eng_current_context) != NULL) ||
 7              (MR_num_outstanding_contexts < MR_max_contexts))
 8      {
 9          for (int attempt = 0; attempts < MR_num_engines; attempt++) {
10              victim_index = (MR_ENGINE(MR_eng_victim_counter) + attempt)
11                  % MR_num_engines;
12              if (victim_index == MR_ENGINE(MR_eng_id)) {
13                  continue;
14              }
15              if(MR_steal_spark(MR_spark_deques[victim_index], &spark)) {
16                  MR_ENGINE(MR_eng_victim_counter) = victim_index;
17                  MR_prepare_engine_for_spark(&spark)
18                  return spark.MR_spark_code;
19              }
20          }
21          MR_ENGINE(MR_eng_victim_counter) = victim_index;
22          return NULL;
23  }
\end{verbatim}
\caption{\trystealspark --- revised work stealing version}
\label{alg:try_steal_spark_revised}
\end{algorithm}

\begin{algorithm}[tbp]
\begin{verbatim}
void MR_prepare_engine_for_spark(MR_Spark *spark) {
    MR_Context *ctxt;

    ctxt = MR_ENGINE(MR_eng_current_context);
    if (ctxt == NULL) {
        ctxt = MR_get_free_context();
        if (ctxt == NULL) {
            ctxt = MR_create_context();
        }
        MR_load_context(ctxt);
    }
    MR_parent_sp = spark->MR_spark_parent_sp;
    ctxt->MR_ctxt_thread_local_mutables =
        spark->MR_spark_thread_local_mutables;
}
\end{verbatim}
\caption{\prepareengineforspark}
\label{alg:prepare_engine_for_spark}
\end{algorithm}

\plan{Show new \trystealspark}
We have made some other changes to \trystealspark,
whose new code is shown in Algorithm~\ref{alg:try_steal_spark_revised}.
Previously \trystealspark required its caller to check that stealing a spark
would not exceed the context limit;
this check has been moved into \trystealspark (lines 6--7).
Another change is that \trystealspark
will now call \prepareengineforspark (line 12)
which will load a context into the engine if it does not already have one
(Algorithm~\ref{alg:prepare_engine_for_spark}).
Finally \trystealspark will return the address of the spark's code;
the caller will jump to this address.
If \trystealspark could not find a spark it will return NULL,
in this case its caller will look for other work.

In the previous version,
the lock was also used to protect the victim counter;
it ensured that round-robin selection of the victim was maintained.
Now each engine independently performs its own round-robin selection of the
victim using a new field \code{MR\_victim\_counter} in the engine structure.
If the engine finds a spark,
\trystealspark will save the current victim index to this field,
as this deque may contain more sparks which the engine can try to
steal later.
We have not evaluated whether this policy is an improvement.
However when compared with the improvement due to removing the lock and
potentially large number of spark deques,
any difference in performance due to the selection of victims will be
negligible.
The two other changes are minor:
we have removed the configurable limit of the number of work stealing
attempts per round and also
the test for null array slots;
both are now unnecessary.

\begin{algorithm}[tbp]
\begin{verbatim}
 1  void MR_idle(void) {
 2      int                 engine_id = MR_ENGINE(MR_eng_id);
 3      MR_EngineSleepSync  *eng_data = MR_engine_sleep_data[engine_id];
 4      MR_Context          *ctxt;
 5      MR_Context          *current_context = MR_ENGINE(MR_eng_current_context);
 6      MR_Code             *resume;
 7      MR_Spark            spark;
 8
 9      eng_data->MR_es_state = MR_LOOKING_FOR_WORK;
10
11      // Try to find a runnable context.
12      MR_acquire_lock(&MR_runqueue_lock);
13      ctxt = MR_get_runnable_context();
14      MR_release_lock(&MR_runnable_lock);
15      if (ctxt != NULL) {
16          eng_data->MR_es_state = MR_WORKING;
17          resume = MR_prepare_engine_for_context(ctxt);
18          MR_GOTO(resume);
19      }
20
21      // Try to run a spark from our local spark deque.
22      if (MR_pop_spark(MR_ENGINE(MR_eng_spark_deque), &spark)) {
23          eng_data->MR_es_state = MR_WORKING;
24          MR_prepare_engine_for_spark(&spark);
25          MR_GOTO(spark->resume);
26      }
27
28      if (!(MR_compare_and_swap(&(eng_data->MR_es_state),
29              MR_LOOKING_FOR_WORK, MR_STEALING)))
30      {
31          // Handle the notification that we have received.
32      }
33      // Try to steal a spark
34      resume = MR_try_steal_spark();
35      if (resume != NULL) {
36          eng_data->MR_es_state = MR_WORKING;
37          MR_GOTO(resume);
38      }
39      MR_GOTO(MR_sleep);
40  }
\end{verbatim}
\caption{\idle --- improved work stealing version}
\label{alg:idle}
\end{algorithm}

We have also made changes to the code that idle engines execute.
An idle engine with a context that is free (it can be used for any spark) or
without a context will call \idle to acquire new work.
\idle has changed significantly;
the new version is shown in Algorithm~\ref{alg:idle}.
This algorithm uses a structure pointed to by \var{eng\_data} to receive
notifications;
we will discuss this structure and how it is used to send and receive
notifications later in this section.
For now we will discuss how \idle looks for work in lieu of any notification.

\idle tries to find a runnable context before trying to find a local
spark (lines 12--19).
A context that has already been created usually represents work that is
\emph{to the left of}
any work that is still a spark (Section~\ref{sec:rts_work_stealing}).
Executing $G_1$ of a parallel conjunction $G_1~\&~G_2$ can signal futures
that $G_2$ may otherwise block on.
Executing $G_1$ may also create sparks leading to more parallelism.
Once a context is resumed and its execution finishes,
then its memory can be used to execute a spark from the engine's local spark
deque.
Checking for runnable contexts first also helps to avoid the creation of new
contexts before existing contexts' computations are finished.
This partially avoids deadlocks that can occur when the context limit is
exceeded:
executing a waiting context rather than attempting to create a new context
for a spark does not increase the number of contexts that exist.

\begin{algorithm}
\begin{verbatim}
MR_Code* MR_prepare_engine_for_context(MR_Context *ctxt) {
    MR_Code *resume;

    if (MR_ENGINE(MR_eng_current_context) != NULL) {
        MR_release_context(MR_ENGINE(MR_eng_current_context));
    }
    MR_load_context(ctxt);
    resume = ctxt->MR_ctxt_resume;
    ctxt->MR_ctxt_resume = NULL;
    return resume;
}
\end{verbatim}
\caption{\prepareengineforcontext}
\label{alg:prepare_engine_for_context}
\end{algorithm}

\idle tries to run a context by attempting to take a context from the
context run queue while holding the context run queue lock (lines 12--14).
If it finds a context,
it will execute \prepareengineforcontext
(Algorithm~\ref{alg:prepare_engine_for_context}),
which checks for and unloads the engine's current context,
and then loads the new context.
It also clears the context's resume field and returns the field's previous
value so that \idle can jump to the resume address.
The context's resume field must be cleared here as
\joinandcontinue may use it later for synchronisation.
Previously the run queue's lock protected all of \idle;
this critical section has been made smaller,
which is probably more efficient.

\begin{figure}
\begin{center}
\begin{verbatim}
nested_par(...) ;-
    (
        p(..., X),
    &
        (
            q(X, ...)
        &
            r(..., Y)
        ),
        s(...)
    &
        t(Y, ...)
    ).
\end{verbatim}
\end{center}
\caption{Nested dependent parallelism.}
\label{fig:nested_dep_par}
\end{figure}

If \idle cannot find a context to resume then it
will attempt to run a spark from its local spark deque (lines 22--26).
If there is a spark to execute and the engine does not have a context,
a new one needs to be created;
this is handled by \prepareengineforspark above.
We considered creating a context and executing the spark only if the
context limit had not been reached,
however this can create a problem, which we explain using an example.
Consider Figure~\ref{fig:nested_dep_par},
which shows two nested parallel conjunctions and
two futures that create dependencies between the parallel
computations.
\var{X} creates a dependency between \code{p} and \code{q},
and \var{Y} creates a dependency between \code{r} and \code{t}.
When an engine, called E1,
executes this procedure,
it pushes the spark for the second and third
conjuncts of the outer conjunction onto its local spark deque.
E1 then begins the execution of \code{p}.
A second engine, called E2,
steals the spark from E1's spark deque and immediately creates two new
sparks; the first is for \code{t} and the second for \code{r}.
It pushes the sparks onto its deque,
therefore the spark for \code{t} is on the bottom (cold end).
E2 then begins the execution of \code{q}.
A third engine, E3,
wakes up and steals the spark for \code{t} from E2.
Next,
E2 attempts to read the value of \var{X} in \code{q},
but it has not yet been produced.
Therefore E2 suspends its context and calls \idle, which cannot find a context
to execute and therefore checks for and finds the local spark for \code{r}.
E2 needs a new context to execute \code{r};
if the context limit is exceeded then E2 cannot proceed.
Without E2's execution of \code{r} and production of \var{Y},
E3 cannot complete the execution of \code{t} and free up its context.
The system will not deadlock, as the execution of \code{p} by E1 can still
continue, and will eventually allow the execution of \code{q} to continue.
However, we still wish to avoid such situations,
therefore we do not check the context limit before running local contexts.
Without this check,
E2 is able to create a context and execute \code{r}.
This can never create more than a handful of extra contexts:
while the limit is exceeded,
engines will not steal work and will therefore execute sparks in a
left-to-right order;
dependencies and barriers cannot cause more contexts to become blocked.

The two choices we have made,
executing local sparks without checking the context limit and
attempting to resume contexts before attempting to execute sparks,
work together to prevent deadlocks that could be caused by
the context limit and dependencies.
The context limit is a work-around to prevent some kinds of worst case
behaviour and should not normally affect other algorithmic choices.
These algorithmic choices are valid and desirable,
even if we did not need or use the context limit,
as they keep the number of contexts low.
This is good for performance as contexts consume significant amounts of
memory and the garbage collector spends time scanning that memory.

If an engine executing \idle cannot find a local spark then it will
check for any notifications (lines 28--32).
If it has not been notified then it will try to steal a spark from another
engine using \trystealspark (above) (lines 34--38).
\trystealspark is a C function,
allowing us to call it from multiple places in the runtime system.
It returns the address of the spark's entry point to its caller,
which jumps to the entry point.
In this case the caller is \idle,
which is C code that uses the Mercury calling convention.
This convention passes all its arguments and results in the Mercury abstract
machine registers, including the return address.
Additionally \idle does not call any other Mercury procedures.
Therefore, it does not need to create a frame on any of the C or Mercury stacks:
any local C variables are stored on the C stack frame shared between all
Mercury procedures (and must be saved by the caller).
\idle has another difference with other Mercury procedures:
it never returns control to its caller.
Instead it continues execution by jumping to the code address of the next
instruction to execute,
such as the resume address of a context,
or the entry point of a spark.
If its callees, \trystealspark or \prepareengineforcontext were to make this
jump rather than returning and then letting \idle jump,
then this would leak the stack frame created for the C call.
Therefore, \trystealspark and \prepareengineforcontext must return the
address of the next instruction to execute and allow \idle to jump to this
code or jump to \sleep.

\pagebreak
\plan{Why we need notifications, their purpose}.
A sleeping engine may need to wake up occasionally and handle new parallel
work.
There are two ways to do this:

\begin{description}

    \item[polling:] engines can wake up occasionally and \emph{poll} for new
    work.

    \item[notification:] engines can sleep indefinitely and be woken up by
    some other engine when work becomes available.

\end{description}

\noindent
Each of these options has its own drawbacks and benefits.
Polling is very simple, but it has a number of drawbacks.
An engine that wakes up will often find that there is no parallel work,
and it will have to go to sleep again.
In first impression this uses only a very small amount of CPU time which
seems harmless,
however many idle processes on a system behaving in this way can add up to a
significant amount of CPU time.
%\paul{I know of an anecdote, I've asked my friend if he documented it so I can
%produce a citation.}
The other problem with polling is that a parallel task will not be executed
immediately as an engine's polling timer must first expire.
These two problems trade off against one another:
the more frequently an engine polls for work the quicker it can react but
the more CPU time it will consume.

Notification has neither of these problems, and has some additional
benefits.
An engine making a notification performs a deliberate action.
We take advantage of this by attaching additional information to the
notification.
We also send the notification to a particular engine,
allowing us to control which of several sleeping engines gets woken up and
begins the parallel work.
The major drawback of a notification system is that every time parallel work
is created a notification may need to be made,
and this adds to the overheads of creating parallel work.

The previous version of the runtime system used polling for work
stealing and notification for contexts.
Engines waiting for a POSIX condition variable to be notified about free
contexts would time out every two milliseconds and attempt to steal work.
The system did not support selecting which engine would be woken when a
context became runnable.
The decision between polling and notification can be hard to make,
and it may depend on the application and its work load.
We have chosen to support both methods in our new work stealing code;
notification is used by default when a spark is created,
but the user may elect to use polling.
Notification is always used when a context becomes runnable.

\begin{figure}
\begin{verbatim}
struct MR_engine_sleep_sync {
    volatile unsigned                   MR_es_state;
    volatile unsigned                   MR_es_action;
    union MR_engine_wake_action_data    MR_es_action_data;
    sem_t                               MR_es_sleep_sem;
    lock                                MR_es_lock;
};

union MR_engine_wake_action_data {
    MR_EngineId     MR_ewa_worksteal_engine;
    MR_Context      *MR_ewa_context;
};
\end{verbatim}
\caption{\enginesleepsync structure}
\label{fig:engine_sleep_sync}
\end{figure}

\plan{Normal state transitions and the \enginesleepsync structure.}
Notifications are implemented using the \enginesleepsync structure shown in
Figure~\ref{fig:engine_sleep_sync}.
A global array allows engines to access one another's \enginesleepsync
structures.
Each structure contains the following fields:
\code{MR\_es\_state} represents the engine's current state,
and is used by both the owner and other engines to read and write the state
of the engine;
\code{MR\_es\_action} and \code{MR\_es\_action\_data} are a closure describing
the next action that the engine should perform;
\code{MR\_es\_sleep\_sem} is a semaphore that the owner will wait on in order to
sleep and other engines will signal if they want the engine to wake up;
and \code{MR\_es\_lock} is used to prevent multiple engines trying to wake the same
engine at once.
The \idle code above manipulates the state field in its structure,
which is pointed to by \code{eng\_data}.

\begin{description}
    \item[\code{MR\_WORKING}:] The engine has work to do and is doing it.

    \item[\code{MR\_LOOKING\_FOR\_WORK}:] The engine is looking for work in
        the form of a runnable context or local spark.  It may have work to
        do in the form of a local spark.

    \item[\code{MR\_STEALING}:] The engine is currently trying to steal work.

    \item[\code{MR\_SLEEPING}:] The engine is sleeping; it could not find any
        work and is now waiting for a notification.

    \item[\code{MR\_BUSY}:] The engine state is busy: this is a special value
        used for synchronisation.

    \item[\code{MR\_NOTIFIED}:] The engine has been notified of some work to
        do but has not started the work yet.
\end{description}

\begin{figure}
\begin{center}
\includegraphics[width=0.8\textwidth]{pics/engine_state}
\end{center}
\caption{Engine states and transitions}
\label{fig:engine_states}
\end{figure}

\noindent
The states are shown in Figure~\ref{fig:engine_states} as oval-shaped nodes
in the graph except for \code{MR\_NOTIFIED},
which is a single value that represents all the notification types,
depicted as rectangular nodes in the graph.
The oval node with a dashed outline is illustrative only,
it exists in the graph to make it easier to visualise how work stealing may
fail.
An engine can move itself between any of the states connected by a black or
blue edge:
blue denotes a transition that is done with a compare and swap on the
\code{MR\_es\_state} field of the \enginesleepsync structure,
whilst other transitions are made with an assignment.
The parallel runtime system is under the most load when there are a large
number of sparks that represent small computations.
When this occurs, engines spend most of their execution time in the
\code{MR\_WORKING}, \code{MR\_LOOKING\_FOR\_WORK} and \code{MR\_STEALING}
states, or transitioning between them.
Therefore these transitions are \emph{busier}
and their edges in the graph are drawn with thicker lines.

\plan{Notification transitions}
When an engine creates a spark or makes a context runnable
it can notify a recipient using the recipient's
\enginesleepsync structure.
The sender will update the action closure and
\code{MR\_es\_state} fields.
This is modelled as a state transition of the recipient;
such transitions are shown with red and green edges in the graph.
Red edges denote transitions from the sleeping state:
the sender acquires the \code{MR\_es\_lock} in the
\enginesleepsync structure while updating the structure's contents and
signalling the sleep semaphore.
Green edges denote transitions from other states:
the sender uses a compare and swap to write the \code{MR\_BUSY} value to the
\code{MR\_es\_state} field,
then updates the action closure,
and finally writes \code{MR\_NOTIFIED} to the \code{MR\_es\_state} field.
A sender using this latter method must not use the semaphore.
Using compare and swap with the state field allows the recipient to also
use compare and swap, and in some cases to simply write to the field,
when changing states.
This allows us to make the busy transitions more efficient.
These are the transitions from
\code{MR\_LOOKING\_FOR\_WORK} to
\code{MR\_STEALING} and
\code{MR\_WORKING}.
Note that in the diagram all but one of the transitions from
\code{MR\_LOOKING\_FOR\_WORK} and \code{MR\_STEALING}
are protected,
% I think this semicolon is a supercomma.  If I understand grammer
% correctly.
the exceptions being transitions to \code{MR\_WORKING},
which is acceptable because we allow some notifications to be ignored if the
recipient has found some other work.
The other transition and state not listed above occurs when the system shuts
down;
this is another exception as an engine could not possibly find work if the
runtime were shutting down\footnote{
    Mercury supports exceptions and it is conceivable that an exception may
    make the system abort.
%    Supporting exceptions in the context of parallel Mercury is discussed in
%    as further work (Section~\ref{sec:further_work}).
}.

There are several different actions that a sender can specify when notifying
another engine.

\begin{description}

    \item[\code{MR\_ACTION\_CONTEXT}:]
    A context has become runnable and is pointed to by the
    \esactiondata field.
    We attach the context to the message so that we do not have to place it
    in the context run queue and then retrieve it again when we know that
    this engine is capable of executing the context.
    However this notification cannot be ignored as that would cause the
    context to be leaked.
    Leaking a context means dropping or losing the context;
    a context that must be executed may not be leaked as the computation it
    represents would get lost.
    Therefore we only ever send \code{MR\_ACTION\_CONTEXT} messages to engines in
    the \code{MR\_SLEEPING} state,
    the only state that does not have an owner-initiated transition from
    it.

    \item[\code{MR\_ACTION\_RUNQUEUE\_CHECK}:]
    A context has become runnable and was placed on the context run queue.
    Because we cannot use the \code{MR\_ACTION\_CONTEXT} message with
    non-sleeping engines,
    we have created this message to be used with non-sleeping engines.
    When a context can only be executed by a single engine,
    because that engine has a frame on its C stack that the context needs,
    then a deadlock would occur if the context were placed in the context run
    queue after the engine had checked the run queue,
    but before the engine had gone to sleep (and allowed the message above).
    By adding this message we can notify the engine that it should check the
    run queue before going to sleep.

    \item[\code{MR\_ACTION\_WORKSTEAL}:]
    A spark has been placed on the spark deque indicated by the
    \esactiondata field.
    This message is used to notify an engine of a new spark that it might be
    able to steal and execute.
    Informing the thief which spark deque it should check is an
    optimisation.
    Before this message is sent, the sender will check that the recipient
    would not exceed the context limit by stealing and executing the spark.

    \item[\code{MR\_ACTION\_SHUTDOWN}:]
    The runtime system is shutting down, this engine should exit.

\end{description}

\noindent
\plan{Behaviour of \idle WRT notifications.}
The code for \idle in Algorithm~\ref{alg:idle} shows several of these
transitions,
including the transition to the \code{MR\_STEALING} state on lines 28--29.
If this compare and swap fails,
it indicates that the current engine should
wait for \code{MR\_es\_state} to become a value other than \code{MR\_BUSY} and then,
depending on the action closure,
either re-check the run queue or exit.
We have not shown the code that interprets and acts on the action closure,
however it is similar to the switch statement in \sleep
(Algorithm~\ref{alg:sleep}).

When an engine creates a spark,
it attempts to notify other engines of the new spark.
However, looking for and notifying an idle engine on every spark creation
would be prohibitively expensive.
Therefore we maintain a counter of the number of engines in an \emph{idle}
state.
When this counter is zero, an engine creating a spark will not attempt to
notify any other engine.
Idle states are \code{MR\_STEALING} and \code{MR\_SLEEPING}
and are shown in Figure~\ref{fig:engine_states} as shaded ovals.
Only engines in these states are sent the \code{MR\_ACTION\_WORKSTEAL} message.
We considered including the \code{MR\_LOOKING\_FOR\_WORK} state as an idle
state and allowing engines in this state to receive the
\code{MR\_ACTION\_WORKSTEAL} message,
However we think that most engines that execute \idle will find a runnable
context or a spark from their own deque and therefore trying to steal work
would be counter productive (and such a message would often be ignored).
Sending such a message would be \emph{too} speculative
(and is still a little speculative when sent to an engine in the
\code{MR\_STEALING} state).
When the user selects polling for work stealing rather notification,
then an engine will never attempt to notify another engine during spark
creation.

\begin{algorithm}[tbp]
\begin{verbatim}
 1  void MR_sleep(void) {
 2      int                 engine_id = MR_ENGINE(MR_eng_id);
 3      int                 victim_id;
 4      MR_EngineSleepSync  *eng_data = MR_engine_sleep_data[engine_id];
 5      MR_bool             got_spark;
 6      MR_Spark            spark;
 7      MR_Code             *resume;
 8
 9      if (!MR_compare_and_swap(&(eng_data->MR\_es\_state),
10              MR_STEALING, MR_SLEEPING)) {
11          // Handle the notification that we have received.
12      }
13      MR_sem_wait(&(eng_data->MR_es_sleep_sem));
14      switch(eng_data->MR_es_action) {
15          case MR_ACTION_SHUTDOWN:
16              MR_GOTO(MR_do_shutdown);
17          case MR_ACTION_WORKSTEAL:
18              victim_id = eng_data->MR_es_action_data.MR_ewa_worksteal_engine;
19              do {
20                  MR_spin_loop_hint();
21                  got_spark = MR_steal_spark(MR_spark_deques[victim_id],
22                      &spark);
23              } while (got_spark == BUSY);
24              if (got_spark == SUCCESS) {
25                  eng_data->MR_es_state = MR_WORKING;
26                  MR_ENGINE(MR_eng_victim_counter) = victim_id;
27                  MR_prepare_engine_for_spark(&spark);
28                  MR_GOTO(spark->MR_spark_code);
29              } else {
30                  MR_ENGINE(MR_eng_victim_counter) = victim_id + 1;
31                  MR_GOTO(MR_idle);
32              }
33          case MR_ACTION_CONTEXT:
34              eng_data->state = MR_WORKING;
35              ctxt = eng_data->MR_es_action_data.MR_ewa_context;
36              MR_GOTO(MR_prepare_engine_for_context());
37          case MR_ACTION_RUNQUEUE_CHECK:
38              MR_GOTO(MR_idle);
39      }
40  }
\end{verbatim}
\caption{\sleep}
\label{alg:sleep}
\end{algorithm}

An engine that executes \idle and cannot find any work will jump to the
\sleep procedure,
which is shown in Algorithm~\ref{alg:sleep}.
\sleep is another Mercury procedure written in C;
it starts by attempting to put the engine into the sleeping state,
checking for any notifications using a compare and swap as we saw in \idle.
If there are no notifications it will wait on the sleep semaphore in the
\enginesleepsync structure.
When another engine signals \code{MR\_es\_sleep\_sem} then the engine is woken up.
The woken engine will retrieve and act on the action closure.
We have shown the full switch/case statement for interpreting the values of
\code{MR\_es\_action} and \code{MR\_es\_action\_data} in \sleep.
The implied switch/case statement in \idle is similar,
except that it does not need to cover some actions.
When polling is enabled the call to \code{MR\_sem\_wait} is used with a
timeout.
If the timeout expires,
then the engine will acquire the lock and attempt to
steal a spark from any other engine, if it finds a spark it moves to the
\code{MR\_WORKING} state.
It holds the lock until it either decides to sleep again or sets its state
so that, in the meantime, no context notification is lost,
as contexts always use notification rather than polling.

When an engine receives a \code{MR\_ACTION\_WORKSTEAL} message it executes
the code at lines 18--31.
As mentioned in Section~\ref{sec:rts_work_stealing},
a data race when attempting to steal a spark will result in
\steal returning an error.
In \trystealspark this error is ignored and the algorithm moves to the next
deque.
However,
when an engine has received a message telling it that there is a spark on
this deque, the engine will use the loop on lines 19--23 to retry when a
collision occurs.
It stops once either it has been successful or the queue is empty.
We made this decision because it is possible that many sparks may be placed
on the deque at the same time,
and even though our engine lost a race to another thief, there may be more
sparks available.
If the engine fails to steal a spark from this deque,
the engine will jump to \idle where it will check the run queue and then
recheck all the spark deques.
The other actions performed when receiving a message are simple to
understand from Figure~\ref{fig:engine_states} and
Algorithm~\ref{alg:sleep}.

\begin{algorithm}[tbp]
\begin{verbatim}
 1  void MR_join_and_continue(MR_SyncTerm *st, MR_Code *cont_label) {
 2      MR_bool     finished, got_spark;
 3      MR_Context  *current_context = MR_ENGINE(MR_eng_current_context);
 4      MR_Context  *orig_context = st->MR_st_orig_context;
 5      MR_Spark    spark;
 6
 7      finished = MR_atomic_dec_and_is_zero(&(st->MR_st_num_outstanding));
 8      if (finished) {
 9          if (orig_context != current_context) {
10              while (orig_context->MR_ctxt_resume != cont_label) {
11                  CPU_spin_loop_hint();
12              }
13              MR_release_context(current_context);
14              MR_load_context(orig_context);
15          }
16          MR_GOTO(cont_label);
17      } else {
18          if (MR_pop_spark(MR_ENGINE(MR_eng_spark_deque), &spark)) {
19              if ((orig_context == current_context) &&
20                      (spark.MR_spark_sync_term != st)) {
21                  MR_save_context(current_context);
22                  current_context->MR_ctxt_resume = cont_label;
23                  MR_ENGINE(MR_eng_current_context) = NULL;
24                  if (nonempty(MR_context_runqueue)) {
25                      MR_push_spark(MR_ENGINE(MR_eng_spark_deque), &spark);
26                      MR_GOTO(MR_idle);
27                  }
28              }
29              MR_prepare_engine_for_spark(&spark);
30              MR_GOTO(spark.MR_spark_code);
31          } else {
32              if (orig_context == current_context) {
33                  MR_save_context(current_context);
34                  current_context->MR_ctxt_resume = cont_label;
35                  MR_ENGINE(MR_eng_current_context) = NULL;
36              }
37              MR_GOTO(MR_idle);
38          }
39      }
40  }
\end{verbatim}
\caption{\joinandcontinue --- improved work stealing version}
\label{alg:join_and_continue_ws2}
\end{algorithm}

We have also had to make two changes to \joinandcontinue,
as shown in Algorithm~\ref{alg:join_and_continue_ws2}.
The first change is an optimisation on lines 13--14,
when the parallel conjunction's original context is suspended
and the engine's current context executes the last conjunct in the parallel
conjunction,
then the previous version of the algorithm would schedule the original
context by placing it on the run queue and then execute \idle as its current
context had finished.
Since we know that the current context is finished and the original one is
runnable,
we now resume the original context's execution directly.

The second change is on lines 19--28.
This new if statement checks whether the context that the engine holds is
compatible with the spark it has just popped from the deque.
A context is incompatible if it contains stack frames of a computation that
is not a parent of the computation the spark represents.
This means that the context is the original context for the parallel
conjunction it has been working on,
but the spark is not part of this parallel conjunction.
When this happens we must change the context
(we cannot choose a different spark without stealing and we try to minimise
stealing),
therefore we suspend the current context (lines 21--23).
Since a context switch is mandatory, then we know that it is better to
switch to a context that is already runnable,
so if the run queue is non-empty we put the spark back on the deque and jump
to \idle (lines 24--27);
otherwise we execute the spark.

\input{tab_work_stealing_revised3pt1}

\plan{Benchmark}
Table~\ref{tab:work_stealing_revised_mandelbrot} shows a comparison of the previous
work stealing implementation with this one using mandelbrot.
There are two row groups with three rows each.
The first group shows the results for right recursive mandelbrot while the
second shows the results for left recursive mandelbrot.
The first row in each group gives the results for the previous work stealing
system and
the second and third rows in each group show the results for the current
system using either notifications (second row) and polling (third row).
Each result is presented with its standard deviation in parentheses.
We compiled the right recursive mandelbrot program with the conjunct
reordering transformation from Section~\ref{sec:rts_reorder} disabled,
otherwise it would have had the same results as the left recursive version
of the same program.
It is still important to test right recursion such as in this example,
as the conjunct reordering transformation cannot reorder dependent parallel
conjunctions so right recursion can still occur.

The differences between the old and new results are minimal;
we have shown the standard deviation for each result to make it easier to
see where comparisons are valid.
In both left and right recursive mandelbrot programs using one Mercury
engine,
the new work stealing code with notifications appears to perform slightly worse than
the original work stealing code;
however the figures are so close that such a comparison may not be
valid.
Neither of the two configurations of the new implementation is faster than
the original implementation.
For right recursive mandelbrot both configurations of the new
implementation improved on
the old implementation.
In left recursive mandelbrot all three versions are comparable.
The old implementation used one spark deque per context, of which there are
many in right-recursive code;
whereas the new implementation uses one spark deque per engine, of which
there are a constant number.
Therefore,
we believe that this supports our decision to associate deque with engines
rather than contexts, in order to make work stealing simpler.

%Using only one engine means that work stealing will not occur but spark
%creation does, the engine will also execute \joinandcontinue often.
%We believe that this slow down is due to the notification code executed
%during each spark's creation.
%While no notifications are ever made, a check is performed to see if a
%notification should be made.
%This makes sense since the new implementation with notifications was slowest
%than the other two versions.
%The new implementation's changes to \joinandcontinue may also be having a
%performance impact;
%now \joinandcontinue must check that the current context is compatible with
%a spark before executing the spark.
%When we look at the cases for two or more Mercury engines making meaningful
%comparisons is still very difficult.
%The result of 8.1 seconds for right recursive mandelbrot with
%the new runtime system using notifications is an outlier:
%it has an unusually large standard deviation of 1.3s and we cannot use it
%for comparison.
%The right recursive results with four Mercury engines have low standard
%deviations and seem to suggest that the new runtime system is faster,
%especially when using polling.

% The left recursive mandelbrot program generates all its
% sparks at once and the right recursive one does not.
% The revised work stealing system may be slightly faster because it
% notifies an engine when there is a spark to execute in parallel,
% while in the original version, a sleeping engine polls the spark deques
% every two milliseconds.
% However, the more likely explanation is that the search through all the
% deques takes longer in the original version as the number of deques is equal
% to the number of contexts,
% and as we saw earlier in this chapter right recursion can create a large
% number of contexts.
% This is especially true with larger numbers of engines;
% fewer sparks are executed on the engine on which they were created where
% they can re-use their parent context.

\begin{figure}
\begin{center}
\hfill
\begin{minipage}[b]{0.39\textwidth}
\begin{center}
\subfigure[\fibs, no granularity control]{
\begin{tabular}{l}
\code{:- func fibs(int) = int.} \\
\\
\code{fibs(N) = F :-} \\
\code{~~~~( N < 2 ->} \\
\code{~~~~~~~~F = 1} \\
\code{~~~~;} \\
\code{~~~~~~~~(} \\
\code{~~~~~~~~~~~~F1 = fibs(N-1)} \\
\code{~~~~~~~~\&} \\
\code{~~~~~~~~~~~~F2 = fibs(N-2)} \\
\code{~~~~~~~~),} \\
\code{~~~~~~~~F = F1 + F2} \\
\code{~~~~).} \\
% Why is this the only way that I can add vertical space that works?!?
\\
\\
\\
\\
\\
\end{tabular}
\label{fig:fibs}}
\end{center}
\end{minipage}
%
\hfill
\begin{minipage}[b]{0.59\textwidth}
\begin{center}
\subfigure[\fibsgc, with granularity control]{
\begin{tabular}{l}
\code{:- func fibs\_gc(int, int) = int.} \\
\code{} \\
\code{fibsgc(N, Depth) = F :-} \\
\code{~~~~( N < 2 ->} \\
\code{~~~~~~~~F = 1} \\
\code{~~~~;} \\
\code{~~~~~~~~( Depth > 0 ->} \\
\code{~~~~~~~~~~~~(} \\
\code{~~~~~~~~~~~~~~~~F1 = fibs\_gc(N-1, Depth-1)} \\
\code{~~~~~~~~~~~~\&} \\
\code{~~~~~~~~~~~~~~~~F2 = fibs\_gc(N-2, Depth-1)} \\
\code{~~~~~~~~~~~~)} \\
\code{~~~~~~~~;} \\
\code{~~~~~~~~~~~~F1 = fibs\_seq(N-1),} \\
\code{~~~~~~~~~~~~F2 = fibs\_seq(N-2)} \\
\code{~~~~~~~~),} \\
\code{~~~~~~~~F = F1 + F2} \\
\code{~~~~).} \\
\end{tabular}
\label{fig:fibsgc}}
\end{center}
\end{minipage}
\hfill
%
\caption{Naive parallel Fibonacci with and without granularity control}
\end{center}
\end{figure}

\label{page:fibs}
To make it easier to see the effects of the new work stealing code,
we created a new benchmark named fibs.
Fibs calculates the value of the 43\textsuperscript{rd} Fibonacci number
using the naive algorithm:
summing the previous two Fibonacci numbers where these are defined
recursively.
This calculation is shown as the function \fibs in Figure~\ref{fig:fibs}.
    (The arity of a Mercury function is written as \code{N+1}.
    The \code{+1} refers to the return value,
    while \code{N} refers to the function's input arguments.)
The naive fibs program is a useful micro-benchmark for work stealing.
It creates many small tasks whose overheads dominate the program's execution
time,
allowing us easily measure the overheads of parallel execution.
However as most engines will be creating sparks,
engines are more likely to find sparks on their own deques and therefore will
attempt to steal work less frequently.
By comparison in the mandelbrot program, only one engine at a time creates
sparks,
and therefore all the other engines will steal work from it whenever they can.
Results for this version of \fibs
are shown in the ``without GC'' (fifth) row group of 
Table~\ref{tab:work_stealing_revised_fibs}.

%making the program slower than its sequential version;
%this can be seen in the table.

To avoid embarrassing parallelism and create a point for comparison,
we have also shown results for fibs using \emph{granularity control} (GC) to
create fewer but larger parallel tasks
(parallelism with coarser grains).
We can achieve this by introducing a new function \fibsgc,
which is shown in Figure~\ref{fig:fibsgc}.
\fibsgc takes an additional argument named \Depth which is
the number of recursion levels near the top of the call graph in
which parallel execution should be used.
Below this depth sequential execution is used.
The first four rows of
Table~\ref{tab:work_stealing_revised_fibs} show results using \fibsgc,
with varying initial values for \Depth.
%This is just one type of granularity control.
%Granularity control is any method used to avoid
%embarrassing parallelism by turning a computation with
%many small parallel tasks into one with fewer larger parallel tasks.

\input{tab_work_stealing_revised3pt2}

Fibs has better parallel performance when \Depth's initial value is
smaller,
demonstrating how granularity control improves performance.
In parallelised divide and conquer code such as this,
the context limit does not affect performance as much as it does with right
single recursive code.
This is because both conjuncts have similar costs compared to the different
costs in singly recursive code.
The original context is less likely to block on the completion of the second
conjunct,
and if it does it will not need to block for as long as it would
with right recursive code.
We do not compare the context limit behaviour of divide and conquer with that
of left recursion,
as work stealing solved the context limit problems in left recursive code
(Section~\ref{sec:rts_work_stealing}).
The coarse grained fibs results (smaller values for \Depth)
show little difference between the different work stealing implementations.
However,
in the fine grained fibs tests we can clearly see that the new work stealing
system has higher overheads,
both with notifications and with polling.
This makes sense for the notification tests
as each time an engine creates a spark it may need to notify another
engine.
Even when no notification is made,
the test to determine if a notification is needed adds overhead.
But the new results with polling are also slower than the old runtime
system.
In the cases with a \var{Depth} of 30 or 40,
the new implementation with notifications performs more slowly than the new
implementation with polling.
This is not a big surprise as the cost of notification might be occurred any
time a spark is created,
and the cost can vary depending on whether a notification needs to be made,
whereas the cost of polling only affects idle engines.
We cannot see this in the results as engines spend very little time idle
while running fibs,
and the impact on idle engines seldom has an effect on the elapsed execution
time of a program (it may affect CPU utilisation slightly).

When we consider the single engine cases,
there is also a clear difference between the old and new implementations,
even though no work stealing or notification occurs with a single engine.
Therefore this difference in performance comes from some code that is
executed when only one engine is used.
Work stealing itself, or the code in \idle is probably not the cause of the
slow down.
This leaves only \joinandcontinue,
whose new version contains an extra test to ensure that the engine's current
context is compatible with the spark that it found on its local deque.
Further tests could be performed to confirm this hypothesis.

The benefits of the new implementation appear to be mixed.
Notifications make the new system slower,
as does the new implementation of \joinandcontinue.
However,
the smaller number of spark deques make work stealing faster.
Overall,
the benefit of the new implementation depends on the program being executed.
However it is important to remember that fibs is a micro-benchmark and does
not represent typical programs,
especially the non-granularity controlled and under-granularity controlled
versions as they create embarrassing parallelism.

There are some other interesting trends in the results.
Most obviously,
as expected and explained above,
parallel fibs program with fine granularity
(high values for \Depth or disabled granularity control),
run more slowly than those with coarse granularity.
% This is easy to understand so we will not discuss it further.
However
when we consider the sequential results for varying amounts of granularity,
we see some strange results.
We expected to see the overheads of granularity control in these
results;
the results with large values for \Depth should be slower than the ones with
smaller values.
We also expected the results without granularity control to be the fastest,
but not significantly:
in practice \fibsgc contributes to a tiny part of the call graphs of tests
with a low \Depth value.
However the results are very different:
the sequential execution times for fibs with a \Depth of 40
are \emph{faster} than those with a \Depth of 10
and those without granularity control.
We are not sure of the cause,
however one hypothesis is that of the two procedures used in the granularity
control code,
\fibsseq and \fibsgc,
one is faster, namely \fibsgc, due to
some strange interaction of the processor's branch predictor and the code's
alignment.
This type of effect can be common with micro-benchmarks,
which is one reason why their use is often discouraged.
This could be tested by running several tests with randomised code layout
\citep{curtsinger:2012:stabilizer}.

\plan{Further potential work}
\plan{Steal half}
We have noticed that in many workloads,
such as a left recursive parallel loop,
one engine creates all the sparks.
In situations like this,
work stealing is more common than executing sparks from an engine's own
deque.
One processor would execute its own sparks while $P - 1$ processors act as
thieves,
and the original engine's work queue can become a bottleneck.
To avoid this behaviour,
a thief could steal more than one spark from a victim and put all but one of
the stolen sparks
onto its own deque, then it immediately executes the single spark which was
not placed on the deque.
With each stealing operation,
sparks become distributed among more of the engines,
which improves the probability that a given victim has work on its own deque.
Work stealing will become less frequent as engines try to execute their own
sparks before stealing other's,
this should improve the efficiency of the program as a whole.

We are currently using the deque data structure and algorithms described by
\citet{Chase_2005_wsdeque} (C.L.\ Deque).
It is a circular resizable array whose operations
(Section~\ref{sec:rts_work_stealing})
allow us to steal one item at a time.
It is based on \citet{arora:1998:work-stealing}.
We have considered using the deque described by
\citet{hendler:2002:stealhalf} (H.S.\ Deque),
this is also a deque using a circular array
based on \citet{arora:1998:work-stealing}.
It allows a thief to steal up to half the items on a victim's deque
in a single operation.
However, the H.S.\ Deque does not automatically resize,
a feature that we identified as important (Section~\ref{sec:rts_work_stealing}).
We should investigate using a deque similar to the H.S.\ Deque with resizing
support in the future.
At that time we will also have to carefully consider spark execution order.

\plan{memory hierarchy awareness}
Another potential improvement is in the selection of a thief's victim.
A thief may wish to prefer victims that are nearby in terms of memory
topology, so that communication of the data relevant to the spark is
cheaper.
Likewise, selecting an engine to wake up when creating a spark could also
consider memory topology.