This repository has been archived by the owner on Apr 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
rts_work_stealing2.tex
1186 lines (1106 loc) · 49.9 KB
/
rts_work_stealing2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\section{Improved work stealing implementation}
\label{sec:rts_work_stealing2}
In this section we attempt to improve on our work stealing
implementation from Section~\ref{sec:rts_work_stealing}.
While we made these improvements,
we also found it useful to change how idle engines behave.
Although these two changes are conceptually distinct,
they were made together and their implementations are interlinked.
Therefore we will present and benchmark them together as one set of changes.
While our changes are an improvement in terms of design,
they do not always yield an improvement in performance.
\plan{Describe problems with associating stacks with contexts}
Sparks were originally stored on context-local deques.
This has the following significant problems.
\begin{description}
\item[There is a dynamic number of spark deques.]
The number of contexts changes during a program's execution,
therefore the number of spark deques also changes.
This means that we must have code to manage this changing number of
deques.
This code makes the runtime system more complicated than necessary,
both when stealing a spark and when creating or destroying a context.
\item[Locking is required.]
The management of contexts must be thread safe so that the set of
spark deques is not corrupted.
We store spark deques in a global array protected by a lock.
It may be possible to replace the array with a lock free data structure,
however it is better to remove the need for thread safety by using a
constant set of deques.
\item[A large number of contexts makes work stealing slower.]
In prior sections
we have shown that the number of contexts in use can often be very high,
much higher than the number of Mercury engines.
If there are $N$ engines and $M$ contexts,
then there can be at most $N$ contexts running and
at least $M-N$ contexts suspended (blocked and waiting to run).
A context can become suspended in one of two ways:
by blocking on a future's value in a call to \wait,
or by blocking on an incomplete conjunct in a call to \joinandcontinue.
We attempt to minimise the former case (see Chapter~\ref{chap:overlap})
and the latter case cannot occur
if the context has a spark on its local spark deque (it would run the
spark rather than block).
Therefore the large majority of the suspended contexts will not have
sparks on their deques,
and the probability of selecting a deque at random with a spark on its
stack is low.
Furthermore the value of $M$ can be very high in pathological cases.
If an engine does not successfully steal a spark from a deque,
it will continue by trying to steal a spark from a different deque.
An engine can exhaust all its attempts, even when there is work
available:
a spark may be placed on a deque after the engine has already attempted
to steal from it.
Upon exhausting all its attempts or all the deques,
the engine will sleep before making another round of attempts
(see \trystealspark in Algorithm~\ref{alg:try_steal_spark_initial} on
page~\pageref{alg:try_steal_spark_initial}).
Each round of attempts (regardless of success or failure) has a
complexity of $O(M)$.
\end{description}
\plan{We associate stacks with engines}
The approach we have taken to solving these problems is based on associating
spark deques with engines rather than with contexts.
A running Mercury program has a constant number of engines,
and therefore the number of spark deques will not vary.
This allows us to remove the code used to resize the deque array.
This makes context creation and destruction much simpler.
It also removes the need for the lock protecting the global array of spark
deques.
We can also remove the locking code in \trystealspark
(whose original version was shown in
Algorithm~\ref{alg:try_steal_spark_initial})
which was used to
ensure that the array is not changed while a thief is trying to steal a
spark.
The cost of work stealing also becomes linear in the number of engines
rather than the number of contexts.
\begin{algorithm}[tbp]
\begin{verbatim}
1 MR_Code* MR_try_steal_spark(void) {
2 int victim_incex;
3 MR_Deque *deque;
4 MR_Spark spark
5
6 if ((MR_ENGINE(MR_eng_current_context) != NULL) ||
7 (MR_num_outstanding_contexts < MR_max_contexts))
8 {
9 for (int attempt = 0; attempts < MR_num_engines; attempt++) {
10 victim_index = (MR_ENGINE(MR_eng_victim_counter) + attempt)
11 % MR_num_engines;
12 if (victim_index == MR_ENGINE(MR_eng_id)) {
13 continue;
14 }
15 if(MR_steal_spark(MR_spark_deques[victim_index], &spark)) {
16 MR_ENGINE(MR_eng_victim_counter) = victim_index;
17 MR_prepare_engine_for_spark(&spark)
18 return spark.MR_spark_code;
19 }
20 }
21 MR_ENGINE(MR_eng_victim_counter) = victim_index;
22 return NULL;
23 }
\end{verbatim}
\caption{\trystealspark --- revised work stealing version}
\label{alg:try_steal_spark_revised}
\end{algorithm}
\begin{algorithm}[tbp]
\begin{verbatim}
void MR_prepare_engine_for_spark(MR_Spark *spark) {
MR_Context *ctxt;
ctxt = MR_ENGINE(MR_eng_current_context);
if (ctxt == NULL) {
ctxt = MR_get_free_context();
if (ctxt == NULL) {
ctxt = MR_create_context();
}
MR_load_context(ctxt);
}
MR_parent_sp = spark->MR_spark_parent_sp;
ctxt->MR_ctxt_thread_local_mutables =
spark->MR_spark_thread_local_mutables;
}
\end{verbatim}
\caption{\prepareengineforspark}
\label{alg:prepare_engine_for_spark}
\end{algorithm}
\plan{Show new \trystealspark}
We have made some other changes to \trystealspark,
whose new code is shown in Algorithm~\ref{alg:try_steal_spark_revised}.
Previously \trystealspark required its caller to check that stealing a spark
would not exceed the context limit;
this check has been moved into \trystealspark (lines 6--7).
Another change is that \trystealspark
will now call \prepareengineforspark (line 12)
which will load a context into the engine if it does not already have one
(Algorithm~\ref{alg:prepare_engine_for_spark}).
Finally \trystealspark will return the address of the spark's code;
the caller will jump to this address.
If \trystealspark could not find a spark it will return NULL,
in this case its caller will look for other work.
In the previous version,
the lock was also used to protect the victim counter;
it ensured that round-robin selection of the victim was maintained.
Now each engine independently performs its own round-robin selection of the
victim using a new field \code{MR\_victim\_counter} in the engine structure.
If the engine finds a spark,
\trystealspark will save the current victim index to this field,
as this deque may contain more sparks which the engine can try to
steal later.
We have not evaluated whether this policy is an improvement.
However when compared with the improvement due to removing the lock and
potentially large number of spark deques,
any difference in performance due to the selection of victims will be
negligible.
The two other changes are minor:
we have removed the configurable limit of the number of work stealing
attempts per round and also
the test for null array slots;
both are now unnecessary.
\begin{algorithm}[tbp]
\begin{verbatim}
1 void MR_idle(void) {
2 int engine_id = MR_ENGINE(MR_eng_id);
3 MR_EngineSleepSync *eng_data = MR_engine_sleep_data[engine_id];
4 MR_Context *ctxt;
5 MR_Context *current_context = MR_ENGINE(MR_eng_current_context);
6 MR_Code *resume;
7 MR_Spark spark;
8
9 eng_data->MR_es_state = MR_LOOKING_FOR_WORK;
10
11 // Try to find a runnable context.
12 MR_acquire_lock(&MR_runqueue_lock);
13 ctxt = MR_get_runnable_context();
14 MR_release_lock(&MR_runnable_lock);
15 if (ctxt != NULL) {
16 eng_data->MR_es_state = MR_WORKING;
17 resume = MR_prepare_engine_for_context(ctxt);
18 MR_GOTO(resume);
19 }
20
21 // Try to run a spark from our local spark deque.
22 if (MR_pop_spark(MR_ENGINE(MR_eng_spark_deque), &spark)) {
23 eng_data->MR_es_state = MR_WORKING;
24 MR_prepare_engine_for_spark(&spark);
25 MR_GOTO(spark->resume);
26 }
27
28 if (!(MR_compare_and_swap(&(eng_data->MR_es_state),
29 MR_LOOKING_FOR_WORK, MR_STEALING)))
30 {
31 // Handle the notification that we have received.
32 }
33 // Try to steal a spark
34 resume = MR_try_steal_spark();
35 if (resume != NULL) {
36 eng_data->MR_es_state = MR_WORKING;
37 MR_GOTO(resume);
38 }
39 MR_GOTO(MR_sleep);
40 }
\end{verbatim}
\caption{\idle --- improved work stealing version}
\label{alg:idle}
\end{algorithm}
We have also made changes to the code that idle engines execute.
An idle engine with a context that is free (it can be used for any spark) or
without a context will call \idle to acquire new work.
\idle has changed significantly;
the new version is shown in Algorithm~\ref{alg:idle}.
This algorithm uses a structure pointed to by \var{eng\_data} to receive
notifications;
we will discuss this structure and how it is used to send and receive
notifications later in this section.
For now we will discuss how \idle looks for work in lieu of any notification.
\idle tries to find a runnable context before trying to find a local
spark (lines 12--19).
A context that has already been created usually represents work that is
\emph{to the left of}
any work that is still a spark (Section~\ref{sec:rts_work_stealing}).
Executing $G_1$ of a parallel conjunction $G_1~\&~G_2$ can signal futures
that $G_2$ may otherwise block on.
Executing $G_1$ may also create sparks leading to more parallelism.
Once a context is resumed and its execution finishes,
then its memory can be used to execute a spark from the engine's local spark
deque.
Checking for runnable contexts first also helps to avoid the creation of new
contexts before existing contexts' computations are finished.
This partially avoids deadlocks that can occur when the context limit is
exceeded:
executing a waiting context rather than attempting to create a new context
for a spark does not increase the number of contexts that exist.
\begin{algorithm}
\begin{verbatim}
MR_Code* MR_prepare_engine_for_context(MR_Context *ctxt) {
MR_Code *resume;
if (MR_ENGINE(MR_eng_current_context) != NULL) {
MR_release_context(MR_ENGINE(MR_eng_current_context));
}
MR_load_context(ctxt);
resume = ctxt->MR_ctxt_resume;
ctxt->MR_ctxt_resume = NULL;
return resume;
}
\end{verbatim}
\caption{\prepareengineforcontext}
\label{alg:prepare_engine_for_context}
\end{algorithm}
\idle tries to run a context by attempting to take a context from the
context run queue while holding the context run queue lock (lines 12--14).
If it finds a context,
it will execute \prepareengineforcontext
(Algorithm~\ref{alg:prepare_engine_for_context}),
which checks for and unloads the engine's current context,
and then loads the new context.
It also clears the context's resume field and returns the field's previous
value so that \idle can jump to the resume address.
The context's resume field must be cleared here as
\joinandcontinue may use it later for synchronisation.
Previously the run queue's lock protected all of \idle;
this critical section has been made smaller,
which is probably more efficient.
\begin{figure}
\begin{center}
\begin{verbatim}
nested_par(...) ;-
(
p(..., X),
&
(
q(X, ...)
&
r(..., Y)
),
s(...)
&
t(Y, ...)
).
\end{verbatim}
\end{center}
\caption{Nested dependent parallelism.}
\label{fig:nested_dep_par}
\end{figure}
If \idle cannot find a context to resume then it
will attempt to run a spark from its local spark deque (lines 22--26).
If there is a spark to execute and the engine does not have a context,
a new one needs to be created;
this is handled by \prepareengineforspark above.
We considered creating a context and executing the spark only if the
context limit had not been reached,
however this can create a problem, which we explain using an example.
Consider Figure~\ref{fig:nested_dep_par},
which shows two nested parallel conjunctions and
two futures that create dependencies between the parallel
computations.
\var{X} creates a dependency between \code{p} and \code{q},
and \var{Y} creates a dependency between \code{r} and \code{t}.
When an engine, called E1,
executes this procedure,
it pushes the spark for the second and third
conjuncts of the outer conjunction onto its local spark deque.
E1 then begins the execution of \code{p}.
A second engine, called E2,
steals the spark from E1's spark deque and immediately creates two new
sparks; the first is for \code{t} and the second for \code{r}.
It pushes the sparks onto its deque,
therefore the spark for \code{t} is on the bottom (cold end).
E2 then begins the execution of \code{q}.
A third engine, E3,
wakes up and steals the spark for \code{t} from E2.
Next,
E2 attempts to read the value of \var{X} in \code{q},
but it has not yet been produced.
Therefore E2 suspends its context and calls \idle, which cannot find a context
to execute and therefore checks for and finds the local spark for \code{r}.
E2 needs a new context to execute \code{r};
if the context limit is exceeded then E2 cannot proceed.
Without E2's execution of \code{r} and production of \var{Y},
E3 cannot complete the execution of \code{t} and free up its context.
The system will not deadlock, as the execution of \code{p} by E1 can still
continue, and will eventually allow the execution of \code{q} to continue.
However, we still wish to avoid such situations,
therefore we do not check the context limit before running local contexts.
Without this check,
E2 is able to create a context and execute \code{r}.
This can never create more than a handful of extra contexts:
while the limit is exceeded,
engines will not steal work and will therefore execute sparks in a
left-to-right order;
dependencies and barriers cannot cause more contexts to become blocked.
The two choices we have made,
executing local sparks without checking the context limit and
attempting to resume contexts before attempting to execute sparks,
work together to prevent deadlocks that could be caused by
the context limit and dependencies.
The context limit is a work-around to prevent some kinds of worst case
behaviour and should not normally affect other algorithmic choices.
These algorithmic choices are valid and desirable,
even if we did not need or use the context limit,
as they keep the number of contexts low.
This is good for performance as contexts consume significant amounts of
memory and the garbage collector spends time scanning that memory.
If an engine executing \idle cannot find a local spark then it will
check for any notifications (lines 28--32).
If it has not been notified then it will try to steal a spark from another
engine using \trystealspark (above) (lines 34--38).
\trystealspark is a C function,
allowing us to call it from multiple places in the runtime system.
It returns the address of the spark's entry point to its caller,
which jumps to the entry point.
In this case the caller is \idle,
which is C code that uses the Mercury calling convention.
This convention passes all its arguments and results in the Mercury abstract
machine registers, including the return address.
Additionally \idle does not call any other Mercury procedures.
Therefore, it does not need to create a frame on any of the C or Mercury stacks:
any local C variables are stored on the C stack frame shared between all
Mercury procedures (and must be saved by the caller).
\idle has another difference with other Mercury procedures:
it never returns control to its caller.
Instead it continues execution by jumping to the code address of the next
instruction to execute,
such as the resume address of a context,
or the entry point of a spark.
If its callees, \trystealspark or \prepareengineforcontext were to make this
jump rather than returning and then letting \idle jump,
then this would leak the stack frame created for the C call.
Therefore, \trystealspark and \prepareengineforcontext must return the
address of the next instruction to execute and allow \idle to jump to this
code or jump to \sleep.
\pagebreak
\plan{Why we need notifications, their purpose}.
A sleeping engine may need to wake up occasionally and handle new parallel
work.
There are two ways to do this:
\begin{description}
\item[polling:] engines can wake up occasionally and \emph{poll} for new
work.
\item[notification:] engines can sleep indefinitely and be woken up by
some other engine when work becomes available.
\end{description}
\noindent
Each of these options has its own drawbacks and benefits.
Polling is very simple, but it has a number of drawbacks.
An engine that wakes up will often find that there is no parallel work,
and it will have to go to sleep again.
In first impression this uses only a very small amount of CPU time which
seems harmless,
however many idle processes on a system behaving in this way can add up to a
significant amount of CPU time.
%\paul{I know of an anecdote, I've asked my friend if he documented it so I can
%produce a citation.}
The other problem with polling is that a parallel task will not be executed
immediately as an engine's polling timer must first expire.
These two problems trade off against one another:
the more frequently an engine polls for work the quicker it can react but
the more CPU time it will consume.
Notification has neither of these problems, and has some additional
benefits.
An engine making a notification performs a deliberate action.
We take advantage of this by attaching additional information to the
notification.
We also send the notification to a particular engine,
allowing us to control which of several sleeping engines gets woken up and
begins the parallel work.
The major drawback of a notification system is that every time parallel work
is created a notification may need to be made,
and this adds to the overheads of creating parallel work.
The previous version of the runtime system used polling for work
stealing and notification for contexts.
Engines waiting for a POSIX condition variable to be notified about free
contexts would time out every two milliseconds and attempt to steal work.
The system did not support selecting which engine would be woken when a
context became runnable.
The decision between polling and notification can be hard to make,
and it may depend on the application and its work load.
We have chosen to support both methods in our new work stealing code;
notification is used by default when a spark is created,
but the user may elect to use polling.
Notification is always used when a context becomes runnable.
\begin{figure}
\begin{verbatim}
struct MR_engine_sleep_sync {
volatile unsigned MR_es_state;
volatile unsigned MR_es_action;
union MR_engine_wake_action_data MR_es_action_data;
sem_t MR_es_sleep_sem;
lock MR_es_lock;
};
union MR_engine_wake_action_data {
MR_EngineId MR_ewa_worksteal_engine;
MR_Context *MR_ewa_context;
};
\end{verbatim}
\caption{\enginesleepsync structure}
\label{fig:engine_sleep_sync}
\end{figure}
\plan{Normal state transitions and the \enginesleepsync structure.}
Notifications are implemented using the \enginesleepsync structure shown in
Figure~\ref{fig:engine_sleep_sync}.
A global array allows engines to access one another's \enginesleepsync
structures.
Each structure contains the following fields:
\code{MR\_es\_state} represents the engine's current state,
and is used by both the owner and other engines to read and write the state
of the engine;
\code{MR\_es\_action} and \code{MR\_es\_action\_data} are a closure describing
the next action that the engine should perform;
\code{MR\_es\_sleep\_sem} is a semaphore that the owner will wait on in order to
sleep and other engines will signal if they want the engine to wake up;
and \code{MR\_es\_lock} is used to prevent multiple engines trying to wake the same
engine at once.
The \idle code above manipulates the state field in its structure,
which is pointed to by \code{eng\_data}.
\begin{description}
\item[\code{MR\_WORKING}:] The engine has work to do and is doing it.
\item[\code{MR\_LOOKING\_FOR\_WORK}:] The engine is looking for work in
the form of a runnable context or local spark. It may have work to
do in the form of a local spark.
\item[\code{MR\_STEALING}:] The engine is currently trying to steal work.
\item[\code{MR\_SLEEPING}:] The engine is sleeping; it could not find any
work and is now waiting for a notification.
\item[\code{MR\_BUSY}:] The engine state is busy: this is a special value
used for synchronisation.
\item[\code{MR\_NOTIFIED}:] The engine has been notified of some work to
do but has not started the work yet.
\end{description}
\begin{figure}
\begin{center}
\includegraphics[width=0.8\textwidth]{pics/engine_state}
\end{center}
\caption{Engine states and transitions}
\label{fig:engine_states}
\end{figure}
\noindent
The states are shown in Figure~\ref{fig:engine_states} as oval-shaped nodes
in the graph except for \code{MR\_NOTIFIED},
which is a single value that represents all the notification types,
depicted as rectangular nodes in the graph.
The oval node with a dashed outline is illustrative only,
it exists in the graph to make it easier to visualise how work stealing may
fail.
An engine can move itself between any of the states connected by a black or
blue edge:
blue denotes a transition that is done with a compare and swap on the
\code{MR\_es\_state} field of the \enginesleepsync structure,
whilst other transitions are made with an assignment.
The parallel runtime system is under the most load when there are a large
number of sparks that represent small computations.
When this occurs, engines spend most of their execution time in the
\code{MR\_WORKING}, \code{MR\_LOOKING\_FOR\_WORK} and \code{MR\_STEALING}
states, or transitioning between them.
Therefore these transitions are \emph{busier}
and their edges in the graph are drawn with thicker lines.
\plan{Notification transitions}
When an engine creates a spark or makes a context runnable
it can notify a recipient using the recipient's
\enginesleepsync structure.
The sender will update the action closure and
\code{MR\_es\_state} fields.
This is modelled as a state transition of the recipient;
such transitions are shown with red and green edges in the graph.
Red edges denote transitions from the sleeping state:
the sender acquires the \code{MR\_es\_lock} in the
\enginesleepsync structure while updating the structure's contents and
signalling the sleep semaphore.
Green edges denote transitions from other states:
the sender uses a compare and swap to write the \code{MR\_BUSY} value to the
\code{MR\_es\_state} field,
then updates the action closure,
and finally writes \code{MR\_NOTIFIED} to the \code{MR\_es\_state} field.
A sender using this latter method must not use the semaphore.
Using compare and swap with the state field allows the recipient to also
use compare and swap, and in some cases to simply write to the field,
when changing states.
This allows us to make the busy transitions more efficient.
These are the transitions from
\code{MR\_LOOKING\_FOR\_WORK} to
\code{MR\_STEALING} and
\code{MR\_WORKING}.
Note that in the diagram all but one of the transitions from
\code{MR\_LOOKING\_FOR\_WORK} and \code{MR\_STEALING}
are protected,
% I think this semicolon is a supercomma. If I understand grammer
% correctly.
the exceptions being transitions to \code{MR\_WORKING},
which is acceptable because we allow some notifications to be ignored if the
recipient has found some other work.
The other transition and state not listed above occurs when the system shuts
down;
this is another exception as an engine could not possibly find work if the
runtime were shutting down\footnote{
Mercury supports exceptions and it is conceivable that an exception may
make the system abort.
% Supporting exceptions in the context of parallel Mercury is discussed in
% as further work (Section~\ref{sec:further_work}).
}.
There are several different actions that a sender can specify when notifying
another engine.
\begin{description}
\item[\code{MR\_ACTION\_CONTEXT}:]
A context has become runnable and is pointed to by the
\esactiondata field.
We attach the context to the message so that we do not have to place it
in the context run queue and then retrieve it again when we know that
this engine is capable of executing the context.
However this notification cannot be ignored as that would cause the
context to be leaked.
Leaking a context means dropping or losing the context;
a context that must be executed may not be leaked as the computation it
represents would get lost.
Therefore we only ever send \code{MR\_ACTION\_CONTEXT} messages to engines in
the \code{MR\_SLEEPING} state,
the only state that does not have an owner-initiated transition from
it.
\item[\code{MR\_ACTION\_RUNQUEUE\_CHECK}:]
A context has become runnable and was placed on the context run queue.
Because we cannot use the \code{MR\_ACTION\_CONTEXT} message with
non-sleeping engines,
we have created this message to be used with non-sleeping engines.
When a context can only be executed by a single engine,
because that engine has a frame on its C stack that the context needs,
then a deadlock would occur if the context were placed in the context run
queue after the engine had checked the run queue,
but before the engine had gone to sleep (and allowed the message above).
By adding this message we can notify the engine that it should check the
run queue before going to sleep.
\item[\code{MR\_ACTION\_WORKSTEAL}:]
A spark has been placed on the spark deque indicated by the
\esactiondata field.
This message is used to notify an engine of a new spark that it might be
able to steal and execute.
Informing the thief which spark deque it should check is an
optimisation.
Before this message is sent, the sender will check that the recipient
would not exceed the context limit by stealing and executing the spark.
\item[\code{MR\_ACTION\_SHUTDOWN}:]
The runtime system is shutting down, this engine should exit.
\end{description}
\noindent
\plan{Behaviour of \idle WRT notifications.}
The code for \idle in Algorithm~\ref{alg:idle} shows several of these
transitions,
including the transition to the \code{MR\_STEALING} state on lines 28--29.
If this compare and swap fails,
it indicates that the current engine should
wait for \code{MR\_es\_state} to become a value other than \code{MR\_BUSY} and then,
depending on the action closure,
either re-check the run queue or exit.
We have not shown the code that interprets and acts on the action closure,
however it is similar to the switch statement in \sleep
(Algorithm~\ref{alg:sleep}).
When an engine creates a spark,
it attempts to notify other engines of the new spark.
However, looking for and notifying an idle engine on every spark creation
would be prohibitively expensive.
Therefore we maintain a counter of the number of engines in an \emph{idle}
state.
When this counter is zero, an engine creating a spark will not attempt to
notify any other engine.
Idle states are \code{MR\_STEALING} and \code{MR\_SLEEPING}
and are shown in Figure~\ref{fig:engine_states} as shaded ovals.
Only engines in these states are sent the \code{MR\_ACTION\_WORKSTEAL} message.
We considered including the \code{MR\_LOOKING\_FOR\_WORK} state as an idle
state and allowing engines in this state to receive the
\code{MR\_ACTION\_WORKSTEAL} message,
However we think that most engines that execute \idle will find a runnable
context or a spark from their own deque and therefore trying to steal work
would be counter productive (and such a message would often be ignored).
Sending such a message would be \emph{too} speculative
(and is still a little speculative when sent to an engine in the
\code{MR\_STEALING} state).
When the user selects polling for work stealing rather notification,
then an engine will never attempt to notify another engine during spark
creation.
\begin{algorithm}[tbp]
\begin{verbatim}
1 void MR_sleep(void) {
2 int engine_id = MR_ENGINE(MR_eng_id);
3 int victim_id;
4 MR_EngineSleepSync *eng_data = MR_engine_sleep_data[engine_id];
5 MR_bool got_spark;
6 MR_Spark spark;
7 MR_Code *resume;
8
9 if (!MR_compare_and_swap(&(eng_data->MR\_es\_state),
10 MR_STEALING, MR_SLEEPING)) {
11 // Handle the notification that we have received.
12 }
13 MR_sem_wait(&(eng_data->MR_es_sleep_sem));
14 switch(eng_data->MR_es_action) {
15 case MR_ACTION_SHUTDOWN:
16 MR_GOTO(MR_do_shutdown);
17 case MR_ACTION_WORKSTEAL:
18 victim_id = eng_data->MR_es_action_data.MR_ewa_worksteal_engine;
19 do {
20 MR_spin_loop_hint();
21 got_spark = MR_steal_spark(MR_spark_deques[victim_id],
22 &spark);
23 } while (got_spark == BUSY);
24 if (got_spark == SUCCESS) {
25 eng_data->MR_es_state = MR_WORKING;
26 MR_ENGINE(MR_eng_victim_counter) = victim_id;
27 MR_prepare_engine_for_spark(&spark);
28 MR_GOTO(spark->MR_spark_code);
29 } else {
30 MR_ENGINE(MR_eng_victim_counter) = victim_id + 1;
31 MR_GOTO(MR_idle);
32 }
33 case MR_ACTION_CONTEXT:
34 eng_data->state = MR_WORKING;
35 ctxt = eng_data->MR_es_action_data.MR_ewa_context;
36 MR_GOTO(MR_prepare_engine_for_context());
37 case MR_ACTION_RUNQUEUE_CHECK:
38 MR_GOTO(MR_idle);
39 }
40 }
\end{verbatim}
\caption{\sleep}
\label{alg:sleep}
\end{algorithm}
An engine that executes \idle and cannot find any work will jump to the
\sleep procedure,
which is shown in Algorithm~\ref{alg:sleep}.
\sleep is another Mercury procedure written in C;
it starts by attempting to put the engine into the sleeping state,
checking for any notifications using a compare and swap as we saw in \idle.
If there are no notifications it will wait on the sleep semaphore in the
\enginesleepsync structure.
When another engine signals \code{MR\_es\_sleep\_sem} then the engine is woken up.
The woken engine will retrieve and act on the action closure.
We have shown the full switch/case statement for interpreting the values of
\code{MR\_es\_action} and \code{MR\_es\_action\_data} in \sleep.
The implied switch/case statement in \idle is similar,
except that it does not need to cover some actions.
When polling is enabled the call to \code{MR\_sem\_wait} is used with a
timeout.
If the timeout expires,
then the engine will acquire the lock and attempt to
steal a spark from any other engine, if it finds a spark it moves to the
\code{MR\_WORKING} state.
It holds the lock until it either decides to sleep again or sets its state
so that, in the meantime, no context notification is lost,
as contexts always use notification rather than polling.
When an engine receives a \code{MR\_ACTION\_WORKSTEAL} message it executes
the code at lines 18--31.
As mentioned in Section~\ref{sec:rts_work_stealing},
a data race when attempting to steal a spark will result in
\steal returning an error.
In \trystealspark this error is ignored and the algorithm moves to the next
deque.
However,
when an engine has received a message telling it that there is a spark on
this deque, the engine will use the loop on lines 19--23 to retry when a
collision occurs.
It stops once either it has been successful or the queue is empty.
We made this decision because it is possible that many sparks may be placed
on the deque at the same time,
and even though our engine lost a race to another thief, there may be more
sparks available.
If the engine fails to steal a spark from this deque,
the engine will jump to \idle where it will check the run queue and then
recheck all the spark deques.
The other actions performed when receiving a message are simple to
understand from Figure~\ref{fig:engine_states} and
Algorithm~\ref{alg:sleep}.
\begin{algorithm}[tbp]
\begin{verbatim}
1 void MR_join_and_continue(MR_SyncTerm *st, MR_Code *cont_label) {
2 MR_bool finished, got_spark;
3 MR_Context *current_context = MR_ENGINE(MR_eng_current_context);
4 MR_Context *orig_context = st->MR_st_orig_context;
5 MR_Spark spark;
6
7 finished = MR_atomic_dec_and_is_zero(&(st->MR_st_num_outstanding));
8 if (finished) {
9 if (orig_context != current_context) {
10 while (orig_context->MR_ctxt_resume != cont_label) {
11 CPU_spin_loop_hint();
12 }
13 MR_release_context(current_context);
14 MR_load_context(orig_context);
15 }
16 MR_GOTO(cont_label);
17 } else {
18 if (MR_pop_spark(MR_ENGINE(MR_eng_spark_deque), &spark)) {
19 if ((orig_context == current_context) &&
20 (spark.MR_spark_sync_term != st)) {
21 MR_save_context(current_context);
22 current_context->MR_ctxt_resume = cont_label;
23 MR_ENGINE(MR_eng_current_context) = NULL;
24 if (nonempty(MR_context_runqueue)) {
25 MR_push_spark(MR_ENGINE(MR_eng_spark_deque), &spark);
26 MR_GOTO(MR_idle);
27 }
28 }
29 MR_prepare_engine_for_spark(&spark);
30 MR_GOTO(spark.MR_spark_code);
31 } else {
32 if (orig_context == current_context) {
33 MR_save_context(current_context);
34 current_context->MR_ctxt_resume = cont_label;
35 MR_ENGINE(MR_eng_current_context) = NULL;
36 }
37 MR_GOTO(MR_idle);
38 }
39 }
40 }
\end{verbatim}
\caption{\joinandcontinue --- improved work stealing version}
\label{alg:join_and_continue_ws2}
\end{algorithm}
We have also had to make two changes to \joinandcontinue,
as shown in Algorithm~\ref{alg:join_and_continue_ws2}.
The first change is an optimisation on lines 13--14,
when the parallel conjunction's original context is suspended
and the engine's current context executes the last conjunct in the parallel
conjunction,
then the previous version of the algorithm would schedule the original
context by placing it on the run queue and then execute \idle as its current
context had finished.
Since we know that the current context is finished and the original one is
runnable,
we now resume the original context's execution directly.
The second change is on lines 19--28.
This new if statement checks whether the context that the engine holds is
compatible with the spark it has just popped from the deque.
A context is incompatible if it contains stack frames of a computation that
is not a parent of the computation the spark represents.
This means that the context is the original context for the parallel
conjunction it has been working on,
but the spark is not part of this parallel conjunction.
When this happens we must change the context
(we cannot choose a different spark without stealing and we try to minimise
stealing),
therefore we suspend the current context (lines 21--23).
Since a context switch is mandatory, then we know that it is better to
switch to a context that is already runnable,
so if the run queue is non-empty we put the spark back on the deque and jump
to \idle (lines 24--27);
otherwise we execute the spark.
\input{tab_work_stealing_revised3pt1}
\plan{Benchmark}
Table~\ref{tab:work_stealing_revised_mandelbrot} shows a comparison of the previous
work stealing implementation with this one using mandelbrot.
There are two row groups with three rows each.
The first group shows the results for right recursive mandelbrot while the
second shows the results for left recursive mandelbrot.
The first row in each group gives the results for the previous work stealing
system and
the second and third rows in each group show the results for the current
system using either notifications (second row) and polling (third row).
Each result is presented with its standard deviation in parentheses.
We compiled the right recursive mandelbrot program with the conjunct
reordering transformation from Section~\ref{sec:rts_reorder} disabled,
otherwise it would have had the same results as the left recursive version
of the same program.
It is still important to test right recursion such as in this example,
as the conjunct reordering transformation cannot reorder dependent parallel
conjunctions so right recursion can still occur.
The differences between the old and new results are minimal;
we have shown the standard deviation for each result to make it easier to
see where comparisons are valid.
In both left and right recursive mandelbrot programs using one Mercury
engine,
the new work stealing code with notifications appears to perform slightly worse than
the original work stealing code;
however the figures are so close that such a comparison may not be
valid.
Neither of the two configurations of the new implementation is faster than
the original implementation.
For right recursive mandelbrot both configurations of the new
implementation improved on
the old implementation.
In left recursive mandelbrot all three versions are comparable.
The old implementation used one spark deque per context, of which there are
many in right-recursive code;
whereas the new implementation uses one spark deque per engine, of which
there are a constant number.
Therefore,
we believe that this supports our decision to associate deque with engines
rather than contexts, in order to make work stealing simpler.
%Using only one engine means that work stealing will not occur but spark
%creation does, the engine will also execute \joinandcontinue often.
%We believe that this slow down is due to the notification code executed
%during each spark's creation.
%While no notifications are ever made, a check is performed to see if a
%notification should be made.
%This makes sense since the new implementation with notifications was slowest
%than the other two versions.
%The new implementation's changes to \joinandcontinue may also be having a
%performance impact;
%now \joinandcontinue must check that the current context is compatible with
%a spark before executing the spark.
%When we look at the cases for two or more Mercury engines making meaningful
%comparisons is still very difficult.
%The result of 8.1 seconds for right recursive mandelbrot with
%the new runtime system using notifications is an outlier:
%it has an unusually large standard deviation of 1.3s and we cannot use it
%for comparison.
%The right recursive results with four Mercury engines have low standard
%deviations and seem to suggest that the new runtime system is faster,
%especially when using polling.
% The left recursive mandelbrot program generates all its
% sparks at once and the right recursive one does not.
% The revised work stealing system may be slightly faster because it
% notifies an engine when there is a spark to execute in parallel,
% while in the original version, a sleeping engine polls the spark deques
% every two milliseconds.
% However, the more likely explanation is that the search through all the
% deques takes longer in the original version as the number of deques is equal
% to the number of contexts,
% and as we saw earlier in this chapter right recursion can create a large
% number of contexts.
% This is especially true with larger numbers of engines;
% fewer sparks are executed on the engine on which they were created where
% they can re-use their parent context.
\begin{figure}
\begin{center}
\hfill
\begin{minipage}[b]{0.39\textwidth}
\begin{center}
\subfigure[\fibs, no granularity control]{
\begin{tabular}{l}
\code{:- func fibs(int) = int.} \\
\\
\code{fibs(N) = F :-} \\
\code{~~~~( N < 2 ->} \\
\code{~~~~~~~~F = 1} \\
\code{~~~~;} \\
\code{~~~~~~~~(} \\
\code{~~~~~~~~~~~~F1 = fibs(N-1)} \\
\code{~~~~~~~~\&} \\
\code{~~~~~~~~~~~~F2 = fibs(N-2)} \\
\code{~~~~~~~~),} \\
\code{~~~~~~~~F = F1 + F2} \\
\code{~~~~).} \\
% Why is this the only way that I can add vertical space that works?!?
\\
\\
\\
\\
\\
\end{tabular}
\label{fig:fibs}}
\end{center}
\end{minipage}
%
\hfill
\begin{minipage}[b]{0.59\textwidth}
\begin{center}
\subfigure[\fibsgc, with granularity control]{
\begin{tabular}{l}
\code{:- func fibs\_gc(int, int) = int.} \\
\code{} \\
\code{fibsgc(N, Depth) = F :-} \\
\code{~~~~( N < 2 ->} \\
\code{~~~~~~~~F = 1} \\
\code{~~~~;} \\
\code{~~~~~~~~( Depth > 0 ->} \\
\code{~~~~~~~~~~~~(} \\
\code{~~~~~~~~~~~~~~~~F1 = fibs\_gc(N-1, Depth-1)} \\
\code{~~~~~~~~~~~~\&} \\
\code{~~~~~~~~~~~~~~~~F2 = fibs\_gc(N-2, Depth-1)} \\
\code{~~~~~~~~~~~~)} \\
\code{~~~~~~~~;} \\
\code{~~~~~~~~~~~~F1 = fibs\_seq(N-1),} \\
\code{~~~~~~~~~~~~F2 = fibs\_seq(N-2)} \\
\code{~~~~~~~~),} \\
\code{~~~~~~~~F = F1 + F2} \\
\code{~~~~).} \\
\end{tabular}
\label{fig:fibsgc}}
\end{center}
\end{minipage}
\hfill
%
\caption{Naive parallel Fibonacci with and without granularity control}
\end{center}
\end{figure}
\label{page:fibs}
To make it easier to see the effects of the new work stealing code,
we created a new benchmark named fibs.
Fibs calculates the value of the 43\textsuperscript{rd} Fibonacci number
using the naive algorithm:
summing the previous two Fibonacci numbers where these are defined
recursively.
This calculation is shown as the function \fibs in Figure~\ref{fig:fibs}.
(The arity of a Mercury function is written as \code{N+1}.
The \code{+1} refers to the return value,