This repository has been archived by the owner on Apr 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
rts_original_scheduling.tex
387 lines (357 loc) · 15.9 KB
/
rts_original_scheduling.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
\section{Original spark scheduling algorithm}
\label{sec:rts_original_scheduling}
\plan{Introduction}
In Sections~\ref{sec:backgnd_merpar} and~\ref{sec:backgnd_deppar},
we introduced parallelism in Mercury and described the runtime system in
general terms.
In this section we will explain how sparks were originally managed
prior to 2009,
when I began my Ph.D.\ candidature.
This will provide the background for the changes we have made to the
runtime system since then.
\plan{Global spark queue and contention WRT global queue}
Mercury (before 2009) uses a global spark queue.
The runtime system \emph{schedules} sparks for parallel execution by placing
them onto the end of the queue.
In this chapter we use the word `schedule' to mean the act of making a spark
available for parallel or sequential work.
An idle engine runs a spark by taking it from the beginning of the queue.
The global spark queue must be protected by a lock;
this prevents concurrent access from corrupting the queue.
The global spark queue and its lock can easily become a bottleneck when many
engines contend for access to the global queue.
\plan{Local spark stack --- relieves contention on global queue}
\citet{wang:2006:hons} anticipated this problem and created context local spark
stacks to avoid contention on the global queue.
Furthermore, the local spark stacks do not require locking.
When a parallel conjunction spawns off a spark,
it places the spark either at the end of the global spark queue or at the
top of its local spark stack.
\plan{Spark scheduling decision.}
The runtime system appends the spark to the end of the global queue if
an engine is idle, and
the number of contexts in use plus the number of sparks on the global queue
does not exceed the maximum number of contexts permitted.
If either part of the condition is false,
the runtime system pushes the spark onto the top of the context's local
spark stack.
This algorithm has two aims.
The first is to reduce contention on the global queue,
especially in the common case that there is enough parallel work.
The second aim is to reduce the amount of memory allocated
as contexts' stacks by reducing the number of contexts allocated.
Globally scheduled sparks may be converted into contexts,
so they are also included in this limit.
We explain this limit on the number of contexts in more detail
in Section~\ref{sec:rts_original_scheduling_performance},
after covering the background information in the current section.
Note that sparks placed on the global queue are executed in a
first-in-first-out manner;
sparks placed on a context's local stack are executed in a
last-in-first-out manner.
\begin{figure}
\begin{tabular}{rl}
1: & \code{~~MR\_SyncTerm ST;} \\
2: & \code{~~MR\_init\_syncterm(\&ST, 2);} \\
3: & \code{~~spawn\_off(\&ST, Spawn\_Label\_1);} \\
4: & \code{~~}$G_1$ \\
5: & \code{~~MR\_join\_and\_continue(\&ST, Cont\_Label);} \\
6: & \code{Spawn\_Label:} \\
7: & \code{~~}$G_2$ \\
8: & \code{~~MR\_join\_and\_continue(\&ST, Cont\_Label);} \\
9: & \code{Cont\_Label:} \\
\end{tabular}
\caption{Parallel conjunction implementation}
\label{fig:par_conj_impl_only}
\end{figure}
\plan{barrier code, this is used to explain the right recursion problem.}
In Section~\ref{sec:backgnd_merpar} we described how parallel
conjunctions are compiled
(Figure~\ref{fig:par_conj} shows an example).
Consider the compiled parallel conjunction in
Figure~\ref{fig:par_conj_impl_only}.
The context that executes the parallel conjunction,
let us call it $C_{Orig}$,
begins by
setting up the sync term,
spawning off $G_2$,
and executing $G_1$ (lines 1--4).
Then it executes the barrier \joinandcontinue,
shown in
Algorithm~\ref{alg:join_and_continue_peterw}.
The algorithms throughout this chapter use a macro named \code{MR\_ENGINE}
to access the fields of the structure representing the current engine.
Depending on how full the global run queue is,
and how parallel tasks are interleaved,
there are three important scenarios:
\begin{description}
\item[Scenario one:]~
In this scenario $C_{Orig}$ placed the spark for $G_2$ on the top of its
local spark stack.
Sparks placed on a context's local spark stack cannot be executed by any
other context.
Therefore when $C_{Orig}$ reaches the \joinandcontinue barrier
(line 5 in the example, Figure~\ref{fig:par_conj_impl_only}),
the context $G_2$ will be outstanding and
\var{st->MR\_st\_num\_outstanding} will be non-zero.
$C_{Orig}$ will execute the else branch on lines 17--42 of
\joinandcontinue,
where it will pop the spark for $G_2$ off the top of the spark stack.
It is not possible for some other spark to be on the top of the stack;
any sparks left on the stack by $G_1$ would have been popped off by
the \joinandcontinue barriers of the conjunctions that spawned off the
sparks.
This invariant requires a \emph{last-in-first-out} storage of sparks,
which is why each context uses a stack rather than a queue.
In Section~\ref{sec:rts_work_stealing} we explain in more detail
\emph{why} a \emph{last-in-first-out} order is important.
The check that the spark's stack pointer is equal to the current
parent stack pointer\footnote{
The code generator will ensure that \var{MR\_parent\_sp} is set
before the parallel conjunction is executed,
and that it is restored after.}
will succeed (line 11 of \joinandcontinue),
and the context will execute the spark.
After executing $G_2$,
$C_{Orig}$ will execute the second call to \joinandcontinue,
the one on line 8 of the example.
This time \var{st->MR\_st\_num\_outstanding} will be zero,
and $C_{Orig}$ will execute the then branch on lines 7--15 of
\joinandcontinue.
In the condition of the nested if-then-else,
$C_{Orig}$ will find that the current context is the original context,
and therefore continue execution at line 8.
This causes $C_{Orig}$ to jump to the continuation label,
line 9 in the example,
completing the execution of the parallel conjunction.
\begin{algorithm}[tbp]
\begin{verbatim}
1 void MR_join_and_continue(MR_SyncTerm *st, MR_Code *cont_label) {
2 MR_Context *current_context = MR_ENGINE(MR_eng_current_context);
3
4 MR_acquire_lock(&MR_SyncTermLock);
5 st->MR_st_num_outstanding--;
5 if (st->MR_st_num_outstanding == 0) {
6 if (st->MR_st_orig_context == current_context) {
7 MR_release_lock(&MR_SyncTermLock);
9 MR_GOTO(cont_label);
10 } else {
11 st->MR_st_orig_context->MR_ctxt_resume_label = cont_label;
12 MR_schedule(st->MR_st_orig_context);
13 MR_release_lock(&MR_SyncTermLock);
14 MR_GOTO(MR_idle);
15 }
16 } else {
17 MR_Spark spark;
18 MR_bool popped;
19
20 popped = MR_pop_spark(current_context->MR_ctxt_spark_stack, &spark);
21 if (popped && (spark.MR_spark_parent_sp == MR_parent_sp)) {
22 /*
23 ** The spark at the top of the stack is part of the same
24 ** parallel conjunction, we can exuecute it immediately.
25 */
26 MR_release_lock(&MR_SyncTermLock);
27 MR_GOTO(spark.MR_spark_code);
28 } else {
29 if (popped) {
30 /*
31 ** The spark is part of a different parallel conjunction,
32 ** put it back.
33 */
34 MR_push_spark(current_context->MR_ctxt_spark_stack, &spark);
35 }
36 if (st->MR_st_orig_context == current_context) {
37 MR_save_context(current_context)
38 MR_ENGINE(MR_eng_current_context) = NULL;
39 }
40 MR_release_lock(&MR_SyncTermLock);
41 MR_GOTO(MR_idle);
42 }
43 }
44 }
\end{verbatim}
\caption{\joinandcontinue --- original version}
\label{alg:join_and_continue_peterw}
\end{algorithm}
\item[Scenario two:]~
In this scenario $C_{Orig}$ placed the spark for $G_2$ on the
global spark queue,
where another context, $C_{Other}$, picked it up and executed it
in parallel.
Also, as distinct from scenario three below,
$C_{Other}$ reaches the barrier on line 8 of the example
(Figure~\ref{fig:par_conj_impl_only})
\emph{before}
$C_{Orig}$ reaches the barrier on line 5.
Even if they both seem to reach the barrier at the same time,
their barrier operations are performed in sequence because of the
lock protecting the barrier code.
When $C_{Other}$ executes \joinandcontinue,
It will find that \var{st->MR\_st\_num\_outstanding} is non-zero,
and will execute the else branch on lines 17--42 of \joinandcontinue.
It then attempts to pop a spark off its stack,
as in another scenario a spark on the stack might represent an
outstanding conjunction
(it cannot tell that the outstanding conjunct is $G_1$ executing in
parallel, and not some hypothetical $G_3$).
$C_{Other}$ took this spark from the global queue,
and was either empty or brand new before executing this spark,
meaning that it had no sparks on its stack before executing $G_2$.
Therefore $C_{Other}$ will not have any sparks of its own and
\code{MR\_pop\_spark} will fail.
$C_{Other}$ will continue to line 36 in \joinandcontinue
whose condition will also fail since $C_{Other}$
is not the original context ($C_{Orig}$).
The lock will be released and \joinandcontinue will determine what to do
next by jumping to \idle.
Eventually $C_{Orig}$ will execute its call to \joinandcontinue
(line 5 of the example),
or resume execution after waiting on the barrier's lock (line 4 of
\joinandcontinue),.
When this happens it will find that \var{st->MR\_st\_num\_outstanding}
is zero,
and execute the then branch beginning at line 7 of \joinandcontinue.
$C_{Orig}$ will test if it is the original context,
which it is,
and continue on line 8 of \joinandcontinue.
It then jumps to the continuation label on line 9 of the example,
completing the parallel execution of the conjunction.
\item[Scenario three:]~
As in scenario two, $C_{Orig}$ put the spark for $G_2$ on the global spark
queue,
where another context, $C_{Other}$, picked it up and executed it
in parallel.
However,
in this scenario
$C_{Orig}$ reaches the barrier on line 5 in the example
(Figure~\ref{fig:par_conj_impl_only})
\emph{before}
$C_{Other}$ reaches its barrier on line 8.
When $C_{Orig}$ executes \joinandcontinue,
it finds that \var{st->MR\_st\_num\_outstanding} is non-zero,
causing it to execute the else branch on lines 17--42 of \joinandcontinue.
$C_{Orig}$ will try to pop a spark of its local spark stack.
However the spark for $G_2$ was placed on the global
spark queue,
the only spark it might find is one created by an outer conjunction.
If a spark is found, the spark's parent stack pointer will not match the
current parent stack pointer,
and it will put the spark back on the stack.
$C_{Orig}$ executes the then branch (lines 37--38 of \joinandcontinue),
since this context is the original context.
This branch will suspend $C_{Orig}$,
and set the engine's context pointer to \NULL
before jumping to \idle.
When $C_{Other}$ reaches the barrier on line 8,
it will find that \var{st->MR\_st\_num\_outstanding} is zero,
and will execute the then branch of the if-then-else.
Within this branch it will test to see if it is the original
context,
the test will fail, and the else branch of the nested if-then-else
will be executed.
At this point we know that $C_{Orig}$ must be suspended because
there were no outstanding conjuncts and the current context is not
the original context;
this can only happen if $C_{Orig}$ is suspended.
The code wakes $C_{Orig}$ up by
setting its code pointer to the continuation label,
placing it on the global run queue,
and then jumping to \idle.
When $C_{Orig}$ resumes execution it executes the code on line 9,
which is the continuation label.
\end{description}
The algorithm includes an optimisation not shown here:
if the parallel conjunction has been executed by only one context,
then a version of the algorithm without locking is used.
We have not shown the optimisation because it is equivalent and not relevant
to our discussion;
we mention it only for completeness.
\begin{algorithm}[tbp]
\begin{verbatim}
void MR_idle() {
MR_Context *ctxt;
MR_Spark *spark;
MR_acquire_lock(&MR_runqueue_lock);
while(MR_True) {
if (MR_exit_now) {
MR_release_lock(MR_runqueue_lock);
MR_destroy_thread(); // does not return.
}
// Try to run a context
ctxt = MR_get_runnable_context();
if (ctxt != NULL) {
MR_release_lock(&MR_runqueue_lock);
if (MR_ENGINE(MR_eng_current_context) != NULL) {
MR_release_context(MR_ENGINE(MR_eng_current_context));
}
MR_load_context(ctxt);
MR_GOTO(ctxt->MR_ctxt_resume);
}
// Try to run a spark.
spark = MR_dequeue_spark(MR_global_spark_queue);
if (spark != NULL) {
MR_release_lock(&MR_runqueue_lock);
if (MR_ENGINE(MR_eng_current_context) == NULL) {
ctxt = MR_get_free_context();
if (ctxt == NULL) {
ctxt = MR_create_context();
}
MR_load_context(ctxt);
}
MR_parent_sp = spark->MR_spark_parent_sp;
ctxt->MR_ctxt_thread_local_mutables =
spark->MR_spark_thread_local_mutables;
MR_GOTO(spark->MR_spark_code);
}
MR_wait(&MR_runqueue_cond, &MR_runqueue_lock);
}
}
\end{verbatim}
\caption{\idle --- original version}
\label{alg:MR_idle_initial}
\end{algorithm}
\plan{Explain how work begins executing, for completeness.}
When an engine cannot get any local work it must search for global work.
Newly created engines, except for the first, also search for global work.
They do this by calling \idle,
whose code is shown in Algorithm~\ref{alg:MR_idle_initial}.
Only one of the idle engines can execute \idle at a time.
\var{MR\_runqueue\_lock} protects the context run queue and the
global spark queue from concurrent access.
After acquiring the lock,
engines execute a loop.
An engine exits the loop only when it finds some work to do or the
program is exiting.
Each iteration first checks if the runtime system is being shut down,
if so,
then this thread releases the lock,
and then destroys itself.
If the system is not being shut down,
the engine will search for a runnable context.
If it finds a context, it releases the run queue lock, loads the context
and jumps to the resume point for the context.
If it already had a context, then it first releases that context;
doing so is safe because \idle has a precondition that if an engine has a
context,
the context must not contain a spark or represent a suspended computation.
If no context was found, the engine attempts to take a spark from the global
spark queue.
If it finds a spark then it will need a context to execute that spark.
It will try to get a context from the free list;
if there is none it will create a new context.
Once it has a context,
it copies the context's copies of registers into the engine.
It also initialises the engine's parent stack pointer
register and the spark's thread local mutables
(which are set by the context that created the spark)
into the context.
If the engine does not find any work,
it will wait using a condition variable and the run queue lock.
The pthread wait function is able to unlock the lock and wait on the
condition variable atomically, preventing race conditions.
The condition variable is used to wake up the engine if either a spark is
placed on the global spark queue or a context is placed on the context run
queue.
When the engine wakes,
it will re-execute the loop.