This repository has been archived by the owner on Apr 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
rts_work_stealing.tex
865 lines (805 loc) · 34.4 KB
/
rts_work_stealing.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
\section{Initial work stealing implementation}
\label{sec:rts_work_stealing}
\plan{Refer to Peter Wang's conclusion of the early scheduling problem}
Work stealing addresses the premature spark scheduling problem that we
described in the previous section.
\citet{wang:2006:hons} recognised this problem and proposed work stealing as
its solution.
In a work stealing system,
sparks placed on a context's local spark stack
are not committed to running in that context;
they may be executed in a different context if they are stolen
(we describe below why we use a stack to store sparks).
This delays the decision of where to execute a spark until the moment
before it is executed.
\plan{Introduce work stealing}
Work stealing is a popular method for managing parallel work in a shared
memory multiprocessor system.
% XXX: There may be earlier papers by keller 1984 as cited by halstead.
Multilisp (\citet{halstead:1985:multilisp}) was one of the first systems to
use work stealing,
which Halstead calls ``an unfair scheduling policy''.
The term ``unfair'' is not an expression of the morals of stealing (work),
instead it refers to the unfairness of \emph{cooperative multitasking} when
compared to something like \emph{round-robin scheduling}.
Each processor in Multilisp has a currently running task,
and a stack of suspended tasks.
If the current task is suspended on a future or finishes,
the processor will execute the task at the top of its task stack.
If there is no such task,
it attempts to steal a task from the bottom of another processor's stack.
Halstead's ``unfair scheduling'' is motivated by his need to reduce the
problems of embarrassing parallelism.
Most significantly,
he says that the system can deadlock due to running out of resources but
needing resources to proceed.
We would like to point out that
this is not the same as our right recursion problem,
even though the two have similar symptoms.
Furthermore,
work stealing does not fix the right recursion problem.
Work stealing's other benefits can be summarised as follows.
\begin{itemize}
\item
It is quicker to schedule sparks because
a context can place sparks on its local spark stack more quickly
than it can place sparks on the global spark queue.
This is because there is less contention on a context's
local spark stack.
The only source of contention is work stealing,
which is balanced among all of the contexts.
Assume each of the $C$ contexts' spark stacks contains sparks.
Contention on any one of these stacks is $1/C$ times less than it would be
on a global queue.
\item
An idle worker can take work from its own queue quickly.
Again, there is less contention on a local queue.
\item
In ideal circumstances,
an idle worker rarely needs to steal work from another's queue.
Stealing work is more costly as it requires communication between
processors.
\end{itemize}
Mercury's initial work stealing implementation was built jointly by
Peter Wang and myself,
Wang contributed about 80\% of the work
and I contributed the remaining 20\%.
Wang's honours thesis \citep{wang:2006:hons} describes his proposal on which
our implementation is heavily based.
\begin{figure}
\begin{center}
\begin{minipage}[b][3in]{0.29\textwidth}
\subfigure[Nested parallel conjunctions]{%
\label{fig:nested_par_conjunctions}
\begin{tabular}{l}
\code{p(...) :-} \\
\code{~~~~(} \\
\code{~~~~~~~~q(Y::out, ...)} \\
\code{~~~~\&} \\
\code{~~~~~~~~r(Y::in, ...)} \\
\code{~~~~\&} \\
\code{~~~~~~~~s(...)} \\
\code{~~~~).} \\
\code{} \\
\code{q(Y::out, ...) :-} \\
\code{~~~~(} \\
\code{~~~~~~~~t(X::out, ...)} \\
\code{~~~~\&} \\
\code{~~~~~~~~u(X::in, Y::out)} \\
\code{~~~~).} \\
\end{tabular}
}
\hfill
\end{minipage}
\begin{minipage}[b][3in]{0.69\textwidth}
\begin{minipage}[b]{0.32\textwidth}
\subfigure[Step 1]{%
\label{fig:spark_stack_step1}
\picfigurenofloat{spark_stack_step1} }
\hfill
\end{minipage}
\begin{minipage}[b]{0.32\textwidth}
\subfigure[Step 2]{%
\label{fig:spark_stack_step2}
\picfigurenofloat{spark_stack_step2} }
\hfill
\end{minipage}
\begin{minipage}[b]{0.32\textwidth}
\subfigure[Step 3]{%
\label{fig:spark_stack_step3}
\picfigurenofloat{spark_stack_step3} }
\hfill
\end{minipage}
\vspace{0.5in}
\begin{minipage}[b]{0.32\textwidth}
\subfigure[Step 4]{%
\label{fig:spark_stack_step4}
\picfigurenofloat{spark_stack_step2} }
\hfill
\end{minipage}
\begin{minipage}[b]{0.32\textwidth}
\subfigure[Step 5]{%
\label{fig:spark_stack_step5}
\picfigurenofloat{spark_stack_step5} }
\hfill
\end{minipage}
\begin{minipage}[b]{0.32\textwidth}
\subfigure[Step 6]{%
\label{fig:spark_stack_step6}
\picfigurenofloat{spark_stack_step1} }
\hfill
\end{minipage}
\hfill
\end{minipage}
\end{center}
\caption{Spark execution order}
\label{fig:spark_execution_order}
\end{figure}
\plan{Mercury's needs for a deque}
Until now, each context has used a stack to manage sparks.
The last item pushed onto the stack is the first to be popped off the
stack.
This \emph{last-in-first-out} order is important for Mercury's parallel
conjunctions.
Parallel conjunctions may be nested, either in the same procedure or through
procedure calls such as in
Figure~\ref{fig:nested_par_conjunctions}.
Consider a context whose spark stack's contents are initially undefined.
The spark stack is represented by
Figure~\ref{fig:spark_stack_step1}.
When the context calls \code{p} it creates a spark
for the second and third conjuncts \code{r(\ldots)~\&~q(\ldots)}
and places the spark on its local stack
(Figure~\ref{fig:spark_stack_step2}).
It then calls \code{q}, where it creates another spark.
This spark represents the call to \code{u},
and now the spark stack looks like
Figure~\ref{fig:spark_stack_step3}.
Assuming that no work stealing has occurred,
when the context finishes executing \code{t} it pops the spark from the top
of its stack;
the popped spark represents the call to \code{u} because
the stack returns items in a last-in-first-out order.
This is desirable because it follows the same execution order as sequential
execution would.
We call this left-to-right execution since by default the left operand of a
conjunction is executed first.
The conjunction in \code{q} contains a shared variable named \code{X};
this execution order ensures that the \signal operation for \code{X}'s future
occurs before the \wait operation;
the context is not blocked by the call to \wait.
(Note that the mode annotations are illustrative,
they are not part of Mercury's syntax.)
Once the spark for \code{u} is popped off the spark stack then the spark
stack looks like Figure~\ref{fig:spark_stack_step4}.
After the execution of \code{u}, \code{q} returns control to \code{p},
which then continues the execution of its own parallel conjunction.
\code{p} pops a spark off the spark stack;
the spark popped off is the parallel conjunction
\code{r(\ldots) \& s(\ldots)}.
This immediately creates a spark for \code{s} and pushes it onto the spark
stack (Figure~\ref{fig:spark_stack_step5}) and
then executes \code{r}.
The execution of \code{r} at this point is also the same as what would occur
if sequential conjunctions were used.
This is straightforward because
the spark represented the \emph{rest} of the parallel
conjunction.
Finally \code{r} returns and \code{p} pops the spark for \code{s} off of the
spark stack and executes it.
At this point the context's spark stack looks like
Figure~\ref{fig:spark_stack_step6}.
If at step 3,
the context executed the spark for \code{r(\ldots) \& s(\ldots)} then
\code{r} would have blocked on the future for \code{Y}.
This would have used more contexts than necessary and more overheads than
necessary.
It may have also caused a deadlock:
since stealing a spark may create a context which may not be permitted due
to the context limit.
Furthermore
the original context would not be able to continue the execution in \code{p}
without special stack handling routines;
this would make the implementation more complicated than necessary.
%The system must be able to make progress by executing contexts from the
%global context run queue or by executing sparks from a context's own local
%spark stack.
Therefore from a context's perspective its local spark storage must behave
like a stack.
Keeping this sequential execution order for parallel conjunctions has
an additional benefit:
it ensures that a popped spark's data has a good chance of being hot in the
processor's cache.
Prior to work stealing,
context local spark stacks returned sparks in last-in-first-out
order,
the global spark queue returns sparks in \emph{first-in-first-out} order.
This does not encourage a left-to-right execution order,
however \citet{wang:2006:hons} proposes that this order may be better:
\begin{quote}
The global spark queue, however, is a queue because we assume that a
spark generated earlier will have greater amounts of work underneath it
than the latest generated spark, and hence is a better candidate for
parallel execution.
\end{quote}
\noindent
If we translate this execution order into the work stealing system,
it means that work is stolen from the bottom end of a context's local spark
stack.
Consider step 3 of the example above.
If a thief removes the spark for \code{r \& s}
then only the spark for \code{u} is on the stack, and
the original context can continue its execution by taking the spark to
\code{u}.
By comparison if the spark for \code{u} was stolen,
it would force the original context to try to execute \code{r \& s}
which, for the reasons above, is less optimal.
Therefore a thief steals work from the bottom of a context local stack
so that a first-in-first-out order is used for parallel tasks.
\citet{halstead:1985:multilisp} also uses processor-local task stacks,
so that when a processor executes a task from its own stack,
a last-in-first-out order is used.
He chose to have thieves steal work from the top of a task stack so
that a thief also executes work in the last-in-first-out order.
However, Halstead remarks that it may be better to steal work from the
bottom of a stack.
Halstead chose a last-in-first-out execution order for the same reasons
that we did:
so that when a processor runs its own task,
the task is executed in the same order as it would be if the program were
running sequentially.
We agree with his claim that this reduces resource usage overheads.
We also agree with his speculation that it results in more efficient cache
usage.
However,
Halstead also claims that this can improve performance for
embarrassingly parallel workloads.
This claim is only true insofar as work stealing reduces runtime costs
\emph{in general}.
No runtime method can completely eliminate the overheads of parallel
execution.
%therefore only during development or compile time can
%embarrassing parallelism be addressed effectively.
Furthermore, in our experience,
using sparks to represent computations whose execution has
not begun has had a more significant improvement in reducing runtime costs.
\plan{Describe the data structure used to implement these stacks and its
properties.}
To preserve the last-in-first-out behaviour of context-local spark
stacks,
and first-in-first-out behaviour of the global queue,
we chose to use double ended queues (deques) for context local spark
storage.
Since the deque will be used by multiple Mercury engines we must either
choose a data structure that is thread safe or use a mutex to protect a non
thread safe data structure.
The pop and push operations are always made by the same thread,%
\footnote{
Contexts move between engines by being suspended and resumed,
but this always involves the context run queue's lock or other
protection.
Therefore the context's data structures including the
spark deque are protected from concurrent access.}
therefore the stealing of work by another context is what creates the need
for synchronisation.
The deque described by \citet{Chase_2005_wsdeque} supports lock free,
nonblocking
operation, has low overheads (especially for pop and push operations),
and is dynamically resizable;
all of these qualities are very important to our runtime system's
implementation.
The deque provides the following operations.
\begin{description}
\item[\code{void MR\_push\_spark(deque *d, spark *s)}]
The deque's owner can call this to push an item onto the
top%
\footnote{
Note when we refer to the top of the deque
\citet{Chase_2005_wsdeque} refers to the bottom and vice-versa.
Most people prefer to imagine that stacks grow upwards,
and since the deque is most often used as a stack we consider this
end of the deque to be the top.
We also call this the \emph{hot end} as it is the busiest end of the
deque.}
or \emph{hot end} of the deque.
This can be done with only a \emph{memory barrier} for synchronisation
(see below).
If necessary,
\push will grow the array that is used to implement the
deque.
\item[\code{MR\_bool MR\_pop\_spark(deque *d, spark *s)}]
The deque's owner can call this to pop an item from the top of the
queue.
In the case that the deque contains only one item,
a compare and swap
operation is used to determine if the thread lost a race to another
thread attempting to steal the item from the deque (a \emph{thief}).
When this happens, the single item was stolen by the thief and the
owner's call to \pop returns false,
indicating that the deque was empty.
In the common case the compare and swap is not used.
In all cases a memory barrier is also used (see below).
Internally the deque is stored as an array of sparks, not spark
pointers.
This is why the second argument, in which the result is returned,
is not a double pointer as one might expect.
This implementation detail avoids memory allocation for sparks inside
the deque implementation.
A caller of any of these functions temporarily stores the spark on
its program stack.
\item[\code{MR\_ws\_result MR\_steal\_spark(deque *d, spark *s)}]
A thread other than the deque's owner can steal
items from the bottom of the deque.
This always uses an atomic compare and swap operation as multiple
thieves may call \steal on the same deque at the same time.
\steal can return one of three different values:
``success'', ``failure'' and ``abort''.
``abort'' indicates that the thief lost a race with either the owner
(\emph{victim}) or another thief.
\end{description}
%All the atomic compare and swap operations here are wait free,
%rather than looping \pop will return false and \steal will abort.
%\push and \pop are very fast, this is good as a context uses its own
%spark deque more often than calling \steal on another's.
%\steal is slightly slower than either \push or \pop,
%but is still quite fast.
%This is good because a context will
%\push and \pop sparks onto and off of its local deque
%more often than it will \steal a spark from another's deque.
%This means that the top of the deque is used more often than the bottom.
%Therefore, we refer to the top as the \emph{hot} end,
%and to the bottom as the \emph{cold} end.
%These terms are less confusing than the disagreement between the stack
%like behaviour of a deque from its context's perspective,
%and the terminology used by \citet{Chase_2005_wsdeque}.
\begin{algorithm}[tbp]
\begin{verbatim}
void MR_push_spark(MR_SparkDeque *dq, const MR_Spark *spark)
{
int bot;
int top;
volatile MR_SparkArray *arr;
int size;
bot = dq->MR_sd_bottom;
top = dq->MR_sd_top;
arr = dq->MR_sd_active_array;
size = top - bot;
if (size >= arr->MR_sa_max) {
/* grow array omitted */
}
MR_sa_element(arr, top) = *spark;
/*
** Make sure the spark data is stored before we store the value of
** bottom.
*/
__asm__ __volatile__ ("sfence");
dq->MR_sd_top = top + 1;
}
\end{verbatim}
\caption{\push with memory barrier}
\label{alg:MR_push_spark}
\end{algorithm}
\begin{algorithm}[tb]
\begin{verbatim}
volatile MR_Spark* MR_pop_spark(MR_SparkDeque *dq)
{
int bot;
int top;
int size;
volatile MR_SparkArray *arr;
bool success;
volatile MR_Spark *spark;
top = dq->MR_sd_top;
arr = dq->MR_sd_active_array;
top--;
dq->MR_sd_top = top;
/* top must be written before we read bottom. */
__asm__ __volatile__ ("mfence");
bot = dq->MR_sd_bottom;
size = top - bot;
if (size < 0) {
dq->MR_sd_top = bot;
return NULL;
}
spark = &MR_sa_element(arr, top);
if (size > 0) {
return spark;
}
/* size = 0 */
success = MR_compare_and_swap_int(&dq->MR_sd_bottom, bot, bot + 1);
dq->MR_sd_top = bot + 1;
return success ? spark : NULL;
}
\end{verbatim}
\caption{\pop with memory barrier}
\label{alg:MR_pop_spark}
\end{algorithm}
Mercury was already using \citet{Chase_2005_wsdeque}'s deques for spark
storage,
most likely because Wang had always planned to implement work stealing.
We did not need to replace the data structure used for a context local
spark storage,
but we will now refer to it as a deque rather than a stack.
We have removed the global spark queue,
as work stealing does not use one.
Consequently,
when a context creates a spark,
that spark is always placed on the context's local spark deque.
\citet{Chase_2005_wsdeque}'s paper uses Java to describe the deque structure
and algorithms.
Java has a strong memory model compared to C (the implementation language of
the Mercury runtime system).
C's \code{volatile} variable storage keyword prevents the compiler from
reordering operations on that variable with respect to other code and
prevents generated code from caching the value of the variable.
CPUs will generally reorder memory operations including executing
several operations concurrently with one another and other instructions.
Therefore, C's \code{volatile} keyword is insufficient when the order of
memory operations is important.
In contrast,
Java's \code{volatile} keyword has the constraints of C's keyword
plus it guarantees that memory operations become visible to other
processors in the order that they appear in the program.
% Java has a strong memory model;
% memory operations on \code{volatile} variables become visible to other
% processors in the order that they appear in the program.
% This is much stronger than C's \code{volatile} keyword which only controls
% how the compiler may order instructions and cache values.
\citet{Chase_2005_wsdeque} use Java's \code{volatile} qualifier in the
declaration for the \var{top} and \var{bottom} fields of the deque
structure.
When adding the deque algorithms to Mercury's runtime system,
we translated them to C.
To maintain correctness we had to introduce two \emph{memory barriers},\footnote{
Memory barriers are also known as memory fences.
In particular the x86/x86\_64 instructions contain the word fence.}
CPU instructions which place ordering constraints on memory operations.
First, we placed a store barrier in the \push procedure
(the inline assembly\footnote{
In our implementation a C macro is used to abstract away the assembly
code for different architectures;
we have shown our figures without the macro to aid the reader.}
in Algorithm~\ref{alg:MR_push_spark}).
This barrier ensures that the spark is written into the array before the pointer to
the top of the array is updated,
indicating that the spark is available.
Second, we placed a full barrier in the \pop procedure
(the inline assembly in Algorithm~\ref{alg:MR_pop_spark}).
This barrier requires that the procedure updates the pointer to the top of the
deque before it reads the pointer to the bottom (which it uses to
calculate the size).
This barrier prevents a race between an owner popping a spark and two
thieves stealing sparks.
It ensures that each of the threads sees the correct number of items on the
deque and therefore executes correctly.
Both these barriers have a significant cost,
in particular the full barrier in \pop.
This means that the \push and \pop functions are not as efficient as they
appear to be.
The deque algorithms are still efficient enough;
we have not needed to improve them further.
For more information about C's weak memory ordering see
\citet{Boehm:2005:threads-as-a-library}.
See also \citet{Adve:2010:memory-models} which discusses the Java memory
model and the new C++0x\footnote{C++0x is also known as C++11} memory model.
\begin{algorithm}[tbp]
\paul{XXX: Place these environments once we know what pagination will be
used}
\begin{verbatim}
1 void MR_join_and_continue(MR_SyncTerm *st, MR_Code *cont_label) {
2 MR_bool finished, got_spark;
3 MR_Context *current_context = MR_ENGINE(MR_eng_current_context);
4 MR_Context *orig_context = st->MR_st_orig_context;
5 MR_Spark spark;
6
7 finished = MR_atomic_dec_and_is_zero(&(st->MR_st_num_outstanding));
8 if (finished) {
9 if (orig_context == current_context) {
10 MR_GOTO(cont_label)
11 } else {
12 while (orig_context->MR_ctxt_resume != cont_label) {
13 CPU_spin_loop_hint();
14 }
15 MR_schedule_context(orig_context);
16 MR_GOTO{MR_idle};
17 }
18 } else {
19 got_spark = MR_pop_spark(current_context->MR_ctxt_spark_deque, &spark);
20 if (got_spark) {
21 MR_GOTO(spark.MR_spark_code);
22 } else {
23 if (orig_context == current_context) {
24 MR_save_context(current_context);
25 current_context->MR_ctxt_resume = cont_label;
26 MR_ENGINE(MR_eng_current_context) = NULL;
27 }
28 MR_GOTO(MR_idle);
29 }
30 }
31 }
\end{verbatim}
\caption{\joinandcontinue --- initial work stealing version}
\label{alg:join_and_continue_ws1}
\end{algorithm}
\plan{Barrier code}
A context accesses its own local spark queue in the \joinandcontinue barrier
introduced in Section~\ref{sec:rts_original_scheduling}.
The introduction of work stealing has allowed us to optimise
\joinandcontinue.
The new version of \joinandcontinue is shown in
Algorithm~\ref{alg:join_and_continue_ws1}, and
the previous version is shown in
Algorithm~\ref{alg:join_and_continue_peterw}
on page~\pageref{alg:join_and_continue_peterw}.
The first change to this algorithm
is that this version is lock free.
All the synchronisation is performed by atomic CPU instructions, memory
write ordering and one spin loop.
The number of outstanding contexts in the synchronisation term is
decremented and the result is checked for zero atomically.\footnote{
On x86/x86\_64 this is a \instruction{lock dec} instruction.
We read the zero flag to determine if the decrement caused the value to
become zero.}
This optimisation could have been made without introducing work stealing,
but it was convenient to make both changes at the same time.
Note that in the previous version, a single lock was shared between all of
the synchronisation terms in the system.
The next change prevents a race condition that would otherwise be possible
without locking, which could occur as follows.
A conjunction of two conjuncts is executed in parallel.
The original context, $C_{Orig}$,
enters the barrier, decrements the counter from two to one,
and because there is another outstanding conjunct,
it executes the else branch on lines 19--29.
At almost the same time another context, $C_{Other}$,
enters the barrier, decrements the counter and finds that there is no more
outstanding work.
$C_{Other}$ attempts to schedule $C_{Orig}$ on line 15.
However attempting to schedule $C_{Orig}$ before it has finished suspending
would cause an inconsistent state and memory corruption.
Therefore lines 12--14 wait until $C_{Orig}$ has been suspended.
The engine that is executing $C_{Orig}$ first suspends $C_{Orig}$ and then
indicates that it has been suspended by setting $C_{Orig}$'s resume label
(lines 24--25).
The spin loop on lines 12--14 includes a use of a macro named
\code{CPU\_spin\_loop\_hint},
This resolves to the \instruction{pause} instruction on x86 and x86\_64,
which instructs the CPU to pipeline the loop differently in order to reduce
memory traffic and allow the loop to exit without a pipeline stall
\citep{intel:pause}.
Lines 24--25 also include memory write ordering (not shown).
The other change is around line 20:
when the context pops a spark off its stack, it does not check if the spark
was created by a callee's parallel conjunction.
This check can be avoided
because all sparks are placed on the context local spark deques and
thieves never steal work from the hot end of the deque.
Hence if there is an outstanding conjunct then either
its spark will be at the hot end of the deque
or the deque will be empty and the spark will have been stolen.
Therefore any spark at the hot end of the deque cannot possibly be created
by a callee's parallel conjunction.
\begin{algorithm}[tbp]
\begin{verbatim}
1 void MR_idle() {
2 MR_Context *ctxt;
3 MR_Context *current_context = MR_ENGINE(MR_eng_current_context);
4 MR_Code *resume;
5 MR_Spark spark;
6
7 MR_acquire_lock(&MR_runqueue_lock);
8 while(MR_True) {
9 if (MR_exit_now) {
10 MR_release_lock(MR_runqueue_lock);
11 MR_destroy_thread(); // does not return.
12 }
13
14 // Try to run a context
15 ctxt = MR_get_runnable_context();
16 if (ctxt != NULL) {
17 MR_release_lock(&MR_runqueue_lock);
18 if (current_context != NULL) {
19 MR_release_context(current_context);
20 }
21 MR_load_context(ctxt);
22 resume = ctxt->MR_ctxt_resume;
23 ctxt->MR_ctxt_resume = NULL;
24 MR_GOTO(resume);
25 }
26
27 // Try to run a spark.
28 if ((current_context != NULL) ||
29 (MR_num_outstanding_contexts < MR_max_contexts))
30 {
31 if (MR_try_steal_spark(&spark)) {
32 MR_release_lock(&MR_runqueue_lock);
33 if (current_context == NULL) {
34 ctxt = MR_get_free_context();
35 if (ctxt == NULL) {
36 ctxt = MR_create_context();
37 }
38 MR_load_context(ctxt);
39 current_context = ctxt;
40 }
41 MR_parent_sp = spark.MR_spark_parent_sp;
42 current_context->MR_ctxt_thread_local_mutables =
43 spark.MR_spark_thread_local_mutables;
44 MR_GOTO(spark.MR_spark_code);
45 }
46 }
47
48 MR_timed_wait(&MR_runqueue_cond, &MR_runqueue_lock);
49 }
50 }
\end{verbatim}
\caption{\idle --- initial work stealing version}
\label{alg:MR_idle_wsinitial}
\end{algorithm}
\plan{Other changes to the idle loop?}
We have also modified \idle to use spark stealing;
its new code is shown in Algorithm~\ref{alg:MR_idle_wsinitial}.
We have replaced the old code which dequeued a spark from the global spark
queue
with a call to \trystealspark (see below).
Before attempting to steal a spark,
\idle must ensure that the context limit will not be exceeded
(this was previously done when choosing where to place a new spark).
This check is on lines 28--29; it is true if either
the engine already has a context (which is guaranteed to be available for a
new spark),
or the current number of contexts is lower than the limit.
When sparks are placed on a context's local deque,
the run queue's condition variable is not signalled,
as doing so would be wasteful.
Therefore engines must wake up periodically to check if there is any
parallel work available.
This is done on line 48 using a call to \code{MR\_timed\_wait} with a
configurable timeout (it defaults to 2ms).
We discuss this timeout later in Section~\ref{sec:rts_work_stealing2}.
\plan{Accessing the array of deques}
Engines that attempt to steal a spark must have access to all the spark
deques.
We do this with a global array of pointers to contexts' deques.
When a context is created,
the runtime system attempts to add the context's deque to this array by
finding an empty slot,
one containing a null pointer,
and writing the pointer to the context's deque to that index in the array.
If there is no unused slot in this array, the runtime system will resize the
array.
When the runtime system destroys a context,
it writes \NULL to that context's index in the deque array.
To prevent concurrent access from corrupting the array,
these operations are protected by \code{MR\_spark\_deques\_lock}.
\begin{algorithm}[tbp]
\begin{verbatim}
1 MR_bool MR_try_steal_spark(MR_Spark *spark) {
2 int max_attempts;
3 MR_Deque *deque;
4
5 MR_acquire_lock(&MR_spark_deques_lock);
6 max_attempts = MR_MIN(MR_max_spark_deques, MR_worksteal_max_attempts);
7 for (int attempt = 0; attempts < max_attempts; attempt++) {
8 MR_victim_counter++;
9 deque = MR_spark_deques[MR_victim_counter % MR_max_spark_deques];
10 if (deque != NULL) {
11 if (MR_steal_spark(deque, spark));
12 MR_release_lock(&MR_spark_deques_lock);
13 return MR_true;
14 }
15 }
16 }
17 MR_release_lock(&MR_spark_deques_lock);
18 return MR_false;
19 }
\end{verbatim}
\caption{\trystealspark --- initial work stealing version}
\label{alg:try_steal_spark_initial}
\end{algorithm}
\plan{Thief behaviour}
The algorithm for \trystealspark is shown in
Algorithm~\ref{alg:try_steal_spark_initial}.
It must also acquire the lock before using the array.
A thief may need to make several attempts before it successfully finds a
deque with work it can steal.
Line 6 sets the number of attempts to make, which is either the user
configurable \var{MR\_worksteal\_max\_attempts} or the size of the array,
whichever is smaller.
A loop (beginning on line 7) attempts to steal work until it succeeds or it
has made \var{max\_attempts}.
We use a global variable, \var{MR\_victim\_counter},
to implement round-robin selection of the victim.
On line 11 we attempt to steal work from a deque,
provided that its deque pointer in the array is non-null.
If the call to \steal succeeded,
it will have written spark data into the memory pointed to by
\var{spark},
then \trystealspark releases the lock and returns true.
Eventually \trystealspark may give up (lines 17--18).
If it does so it will release the lock and return false.
Whether or not a thief steals work or gives up,
the next thief will resume the round-robin selection where the previous
thief left off.
This is guaranteed because \var{MR\_victim\_counter} is protected by the
lock acquired on line 5.
\plan{Work stealing fixes the premature scheduling decision problem}
All sparks created by a context are placed on its local spark deque.
Sparks are only removed to be executed in parallel when an idle engine
executes \trystealspark.
Therefore
the decision to execute a spark in parallel is only made once an engine is
idle and able to run the spark.
We expect this to correct the premature scheduling decision problem we
described in Section~\ref{sec:rts_original_scheduling_performance}.
\input{tab_work_stealing_initial3}
\plan{Benchmark}
We benchmarked our initial work stealing implementation with the mandelbrot
program from previous sections.
Table~\ref{tab:work_stealing_initial} shows the results of our benchmarks.
The first and third row groups are the same results as the first and second
row groups from Table~\ref{tab:2009_left_nolimit} respectively.
They are included to allow for easy comparison.
The second and fourth row groups were generated with a newer version of
Mercury that implements work stealing as described above.\footnote{
The version of Mercury used is slightly modified,
due to an oversight we had forgotten to include the context limit
condition in \idle.
The version we benchmarked includes this limit and matches the
algorithms shown in this section.}
The figures in parenthesis are speedup figures.
The first conclusion that we can draw from the results is that
left recursion with work stealing is much faster than left recursion without
work stealing.
We can see that work stealing has solved the premature spark scheduling problem
for mandelbrot.
Given that left recursive mandelbrot represented a pathological case of this
problem,
we are confident that the problem has generally been solved.
We can also see that the context limit no longer affects performance
with left recursion.
There is no significant difference between the results for the various
settings of the context limit.
\paul{XXX: Consider always putting left and right in italics.}
When we consider the case for right recursion we can see that
work stealing has improved performance when there is a
high context limit.
This is somewhat expected since work stealing is known as a very
efficient way of managing parallel work on a shared memory system.
However we did not expect such a significant improvement in
performance.
In the non work stealing system,
a spark is placed on the global spark queue if both the context limit has
not been reached,
and there is an idle engine.
When a spark is being created there may not be an idle engine,
however an engine may become idle soon and wish to execute a spark.
With work stealing,
all sparks are placed on local deques and an idle
engine will attempt to steal work when it is ready ---
delaying the decision to run the spark in parallel.
This suggests that the premature scheduling problem was also affecting right
recursion.
But the effect was minor compared to the pathological case we saw with left
recursion.
There is a second observation we can make about right recursion.
In the cases for 256 or 512 contexts work stealing performs worse than
without work stealing.
We believe that the original results were faster because more work was done
in parallel.
When a context spawned off a spark,
it would often find that there was no free engine,
and place the spark on its local queue.
After completing the first conjunct of the parallel conjunction it would
execute \joinandcontinue.
\joinandcontinue would find that the context had a spark on its local
stack and would execute this spark directly.
When a spark is executed directly the existing context is re-used and so the
context limit does not increase as quickly.
The work stealing version does not do this,
it always places the spark on its local stack and almost always
another engine will steal that spark,
and the creation of a new context will bring the system closer to the
context limit.
The work stealing program is more likely to reach the context limit earlier,
where it will be forced into sequential execution.
%\paul{XXX: Consider verifying this point with instrumentation.}