Note that there may be OCR errors in this file, but I wanted it in text so I could annotate it.

Examiner 1

p.28 Top. future.wait/2

When explaining the working of future.wait/2, it would be nice to add a line explaining why there is no danger for deadlock.

diff –git a/backgnd_deppar.tex b/backgnd_deppar.tex index a5c14f2..81eb62c 100644 — a/backgnd_deppar.tex +++ b/backgnd_deppar.tex @@ -95,6 +95,9 @@ acquiring the lock as well as after. If it is \code{MR\_PRODUCED} before the lock is acquired, then \wait can safely retrieve the future’s value without using the lock. +Note also that because Mercury’s mode system ensures that variable +dependencies can never form cycles, +the compiler’s use of futures cannot create a deadlock.

A producer will call \signal to place a value into a future and wake up any suspended contexts.

p.30-31, Feedback

In your argumentation for the feedback-directed approach, you state yourself that a program is typically executed multiple times with respect to different inputs (p.31). While you use this argument in favour of the feedback-approach (which is imo totally justified), it does raise the question on how well is the feedback—approach providing a model for all of these runs? Unfortunately, this question is rapidly put aside by your remark “Most variations in input data will not effect parallelisation enough to cause a significant difference in performance” (p.31), but this is quite strong a statement; is there any (experimental) proof of it? I think the chosen approach is a good one. but it should be introduced/discussed within more appropriate level of objectiveness. See also my remark on the discussion w.r.t. Static analysis (p 92).

This issue was discussed but the discussion could have been better. The change below improves the discussion making our reasoning clearer.

diff –git a/backgnd_autopar.tex b/backgnd_autopar.tex index e5e0cb5..be01eae 100644 — a/backgnd_autopar.tex +++ b/backgnd_autopar.tex @@ -96,21 +96,30 @@ used with another, then the profile will not necessarily represent the actual use of the program, and that the optimisation may not be as good as it should be. -Parallelisation decisions are binary, +In practice parallelism is used in very few places within a program: +in the Mercury Compiler itself there are only 34 +(Page~\pageref{page:conjs_in_mcc}) conjuncts in 28,448\footnote{

The number of \PS structures in the same profiling data that was used to
provide the figures on Page~\pageref{page:conjs_in_mcc}.}

+procedures that may be worth parallelising. +This means that most differences in performance are not likely to affect the +very small part of the program where parallelism may be applicable. +Furthermore, +as parallelisation decisions are binary, either `do not parallelise’ to `parallelise’. -Therefore, -they will only change as the costs of computations cross some threshold. -Unless alternative input data causes a computation’s runtime to cross such a -threshold, -then the parallelisation based on the original input data is -as equally valid as the parallelisation based on the alternative input data. -Most variations in input data will not -effect parallelisation enough to cause a significant difference in performance. - -Even in a case where automatic parallelisation does not produce optimal -results, -near optimal results are good enough. -Automatic parallelisation will always be cheaper than manual +a numerical difference in performance will not usually alter the binary +parallelisation decision. +So in most cases, +auto-parallelisation based on the original input data is equivalent to +auto-parallelisation based on slightly different input data. + +In cases where different input data causes the program to be parallelised +differently, +the result is often going to be near-optimal. +In these cases the main benefits of automatic parallelisation are +unaffected, +which are: +automatic parallelisation will always be cheaper than manual parallelisation, it requires much less time and does not require a company hire a parallelisation expert. diff –git a/overlap.tex b/overlap.tex index 3b853ed..a66eb8a 100644 — a/overlap.tex +++ b/overlap.tex @@ -3121,6 +3121,7 @@ whose effect is lost in the noise. % However, the raw times showed significant variability, % and this process does not entirely eliminate that variability.

+\label{page:conjs_in_mcc} To see the difference between naive and overlap, we need to look at larger programs. Our standard large test program is the Mercury compiler, which contains

p 34,

imprecise vocabulary. On the one hand you talk about “computations p and C1” but further down the test they have become ‘“conjuncts p and q”.

Fixed by replacing the initial ‘computations’ with ‘conjuncts’

p.41 Notation

line G_{k_l},\ldots,G_a. This must be G_k+l.

The examiner is correct. I’ve fixed this.

p.50. BLKSIZE

What is “HLKSlZE”?’ Is it some dark option to be set in the Eoehm runtime? it, it should also be explained. if care to mention you

Maybe the examiner says that if we cannot explain it then we should not mention it. I’ve explained that it is inscrutable.

diff –git a/rts_gc.tex b/rts_gc.tex index a01cdf5..7306619 100644 — a/rts_gc.tex +++ b/rts_gc.tex @@ -510,10 +510,18 @@ We anticipate that increasing the size of the local free lists will cause even less contention for global locks, allowing allocation intensive programs to have better parallel performance. -The size of the local free list can be adjusted by increasing the -\texttt{HBLKSIZE} tunable. -Unfortunately this feature is experimental, -and adjusting \texttt{HBLKSIZE} caused our programs to crash. +We wanted to investigate if increasing the size of the local free lists could +improve performance. +We contacted the Boehm GC team about this and they advised us to experiment +with the \texttt{HBLKSIZE} tunable. +They did not say what this tunable controls. +it is undocumented and the source code is very hard to understand. +Our impression is that it only indirectly tunes the sizes of the local free +lists. +Unfortunately this feature is experimental: +adjusting \texttt{HBLKSIZE} caused our programs to crash intermittently, +such that we could not determine for which values (other than the default) +the programs would run reliably. Therefore we cannot evaluate how \texttt{HBLKSIZE} affects our programs. Once this feature is no longer experimental,

p.60. Choice of stack

While l was wondering for several pages Why this structure had to be a stack, the second half of the page provides the explanation (to deal with nested conjunctions). It would help the reader if this justification was moved to the spot. where the stacks are introduced.

I cannot move this as there is a circular dependency in the explanation. I’ve provided a forward reference instead. I’ve also added a forward reference that was missing from an earlier chapter to the same explanation.

diff –git a/rts_original_scheduling.tex b/rts_original_scheduling.tex index da8adda..30dcc74 100644 — a/rts_original_scheduling.tex +++ b/rts_original_scheduling.tex @@ -161,6 +161,10 @@ there are three important scenarios: any sparks left on the stack by $G_1$ would have been popped off by the \joinandcontinue barriers of the conjunctions that spawned off the sparks.

This invariant requires a \emph{last-in-first-out} storage of sparks,
which is why each context uses a stack rather than a queue.
In Section~\ref{sec:rts_work_stealing} we explain in more detail
\emph{why} a \emph{last-in-first-out} order is important.
The check that the spark’s stack pointer is equal to the current parent stack pointer\footnote{

diff –git a/rts_work_stealing.tex b/rts_work_stealing.tex index 2c6b9f1..ec273c5 100644 — a/rts_work_stealing.tex +++ b/rts_work_stealing.tex @@ -10,7 +10,8 @@ its solution. In a work stealing system, sparks placed on a context’s local spark stack are not committed to running in that context; -they may be executed in a different context if they are stolen. +they may be executed in a different context if they are stolen +(we describe below why we use a stack to store sparks). This delays the decision of where to execute a spark until the moment before it is executed.

NOOP p.71. remark

Just a remark, but algorithm 3.8 is basic and the idea of reordering independent conjunctions quite seems not far. being pushed very

p. 78. XXX

In Figure 3.6, there is an “XXX” remaining.

Deleted, (the revisit had already been acted upon).

p.79. Busy transitions

Figure 0 = Seems strange to characterize some transitions as “busier” because “you think” (p.78) they occur most often. Is this relevant and, if it is, could it be better (experimentally) validated/justified? if it isn’t, don‘t talk about it as it makes one wonder Whether some of the made are based on intuitionchoicesyou only.

Described our reasoning why these edges are busier:

diff –git a/rts_work_stealing2.tex b/rts_work_stealing2.tex index bc4178e..def8dbe 100644 — a/rts_work_stealing2.tex +++ b/rts_work_stealing2.tex @@ -530,8 +530,13 @@ blue edge: blue denotes a transition that is done with a compare and swap on the \code{MR\_es\_state} field of the \enginesleepsync structure, whilst other transitions are made with an assignment. -The edges drawn with thicker lines are \emph{busier}: -these are the transitions that we think occur most often. +The parallel runtime system is under the most load when there are a large +number of sparks that represent small computations. +When this occurs, engines spend most of their execution time in the +\code{MR\_WORKING}, \code{MR\_LOOKING\_FOR\_WORK} and \code{MR\_STEALING} +states, or transitioning between them. +Therefore these transitions are \emph{busier} +and their edges in the graph are drawn with thicker lines.

\plan{Notification transitions} When an engine creates a spark or makes a context runnable

p.92. Static analysis

When (re)introducing the general approach and justifying the feedback-approach, the discussion on profiler-feedback versus static analysis could be more detailed and more objective. You put a lot of emphasis on “representative input” (see also my remark concerning pp.30-31)that is chosen by the programmer, but i why not let the user decide on what is “representative input” by providing, eg. a specification of typical input (e.g. types and size of certain structures). In the latter case, an approach using static analysis might be more useful than a profiler—based one. Just to be clear, I 0 not criticising your approach, nor am I asking to change it; I am only stating I feel it could be somewhat more objectively (with its strong and weak points) introduced and discussed.

To have this ‘specification of input’ you need a representative input, so both methods have the same requirements. Each method has its own strengths and may complement the other.

diff –git a/overlap.tex b/overlap.tex index a66eb8a..50fbd11 100644 — a/overlap.tex +++ b/overlap.tex @@ -276,14 +276,19 @@ However, this will not be accurate; static analysis cannot take into account sizes of data terms, or other values that are only available at runtime. It may be possible to provide this data by some other means, -such as by requiring the programmer to provide a specification of their -program’s likely input data. -It has been shown that programmers are not good at estimating where their -programs’ hotspots are, -likewise we think that a programmer’s estimate of their program’s likely -input data will also be inaccurate. -This conclusion is supported by the obvious reasoning that it is always best -to experimentally measure something rather than estimate it is value. +such as by requiring the programmer to provide a +descriptions of the typical shapes and sizes of +their program’s likely input data. +Programming folklore says that programmers are not good at estimating where +their programs’ hotspots are. +Some of the reasons for this will affect a programmer’s estimate of their +program’s likely input data, making it inaccurate. +In fact, misunderstanding a program’s typical input is one of the reasons +why a programmer is likely to mis-estimate the location of the +program’s hotspots. +Our argument is that an estimate, +even a confident one, can only be verified by measurement, +but a measurement never needs estimation to back it up. Therefore, our automatic parallelisation system uses profiler feedback information. This was introduced in Section~\ref{sec:backgnd_autopar},

I’ve also described this earlier in Chapter 2.

diff –git a/backgnd_autopar.tex b/backgnd_autopar.tex index 6cfa3fb..2a3b85f 100644 — a/backgnd_autopar.tex +++ b/backgnd_autopar.tex @@ -40,11 +40,16 @@ against another computation. It is important not to create too much parallelism: The hardware is limited in how many parallel tasks it can execute, any more and the overheads of parallel execution will slow the program down. -Therefore, it is not just sub-optimal to parallelise the search of the small li +Therefore, it is not just sub-optimal to parallelise the search of the small +list, but detrimental. -The only way we can know the actual cost of most pieces of code -is by understanding their typical inputs, -or measuring their runtime cost while operating on typical inputs. +Using a description of a program’s typical inputs one could +calculate the execution times of the program’s procedures. +However it is more direct, more robust and much easier to simply use a +profiler to measure the typical execution times of procedures in the program +while the program is executing with typical inputs, +especially when we have such a powerful profiler already available +(Section~\ref{sec:backgnd_deep}). Therefore, profiling data should be used in auto-parallelisation; it allows us to predict runtime costs for computations whose

p.93 (end of section 4.2). Terminology

Terminology: one often uses “monovariant/polyvariant” to refer to the fact that a predicate/procedure is analysed/transformed/compiled one versus multiple times with respect to a somewhat different content.

I’ve rephrased this paragraph to use these terms (and explain them).

diff –git a/overlap.tex b/overlap.tex index 97d03d0..4157fd0 100644 — a/overlap.tex +++ b/overlap.tex @@ -383,11 +398,13 @@ A procedure can contain several conjunctions with two or more goals that we consider parallelising, therefore multiple candidate parallelisations may be generated for different conjunctions in a procedure. -The same procedure may also appear more than once in the call graph, -and therefore multiple parallelisations may be generated for the same -conjunctions within the procedure. -We discuss how we resolve conflicting recommendations for the same procedure -in Section~\ref{sec:overlap_pragmatic}. +The same procedure may also appear more than once in the call graph. +Each time it occurs in the call graph its conjunctions may be parallelised +differently, or not at all, +therefore it is said to be \emph{polyvariant} (having multiple forms). +Currently our implementation compiles a single \emph{monovariant} procedure. +We discuss how the implementation chooses which candidate parallelisations to +include in Section~\ref{sec:overlap_pragmatic}.

% \section{Traversing the call graph} % \label{sec:overlap_dfs}

p.106 (bottom of the page):

“the recursivecalls cost at its average recursion depth is used by the algorithm”. is this speaking) the best one can get or would it be to obtain more precise results (eg. (theoretically possible by performing some finpoint computation on the predicate)?

The examiner has understood the issue to some degree. I’ve emphasised the issue and added discussion about getting more precise results through analysis of recurrence relations.

:diff –git a/conc.tex b/conc.tex index b9e2ddc..0b49b5b 100644 — a/conc.tex +++ b/conc.tex @@ -93,6 +93,7 @@ and to adjust the values that represent the costs of parallel execution overheads in the cost model.

\section{Further work} +\label{sec:conc_further_work}

Throughout this dissertation we have discussed further work that may apply to each contribution. diff –git a/overlap.tex b/overlap.tex index a0accd5..756d879 100644 — a/overlap.tex +++ b/overlap.tex

@@ -1713,22 +1730,39 @@ times. In many cases, the conjunction given to Algorithm~\ref{alg:dep_par_conj_overlap_middle} will contain a recursive call. -In these cases the recursive call’s cost at its average recursion depth is -used by the algorithm. -This assumes that the recursive call -calls the \emph{original, sequential} version of the procedure. +In these cases, +the algorithm uses the recursive call’s cost at its average recursion depth +in the sequential execution data gathered by the profiler. +This is naive because it assumes that the recursive call +calls the \emph{original, sequential} version of the procedure, +however the call is recursive and so the parallelised procedure calls itself, +the \emph{transformed parallel} procedure whose cost at its average recursion +depth is going to be different from the sequential version’s. When the recursive call calls the parallelised version, -we can expect a similar saving (absolute time, not ratio) +%we can expect a similar saving +there may be a similar saving +(absolute time, not ratio) on \emph{every} recursive invocation, provided that there are enough free CPUs. How this affects the expected speedup of the top level call depends on the structure of the recursion. -Our current approach handles non-recursive cases correctly, + +It should be possible to estimate the parallel execution time of the top level +call into the recursive procedure, +including the parallelism created at each level of the recursion, +provided that +the recursion pattern is one that is understood by the algorithms in +Section~\ref{sec:overlap_reccalls}. +Before we implemented this it was more practical to improve the efficiency of +recursive code +(Chapter~\ref{chap:loop_control}). +We have not yet returned to this problem, +see Section~\ref{sec:conc_further_work}. +Nevertheless, +our current approach handles non-recursive cases correctly, which are the majority (78\%) of all cases; it handles a further 13\% of cases (single recursion) reasonably well (Section~\ref{sec:overlap_reccalls}). -We do not currently do any further analysis when parallelising recursive -code. Note that even better results for singly recursive procedures can be achieved because of the work in Chapter~\ref{chap:loop_control}.

p.120 (bottom of the page). Typo: “perforrned perform”.

Fixed (almost) double word.

p. 12.4. Typo: “that the each iteration”

Removed ‘the’ from the phrase.

Examiner 2

General

Scope outside of Mercury

I would have liked to see some discussion about how all the techniques proposed in this dissertation could be applied outside of Mercury [e.g., to Prolog? To functional languages?)

This was never within our intended scope, each of the other languages has limitations that affect the usefulness of automatic parallelisation. Some of these are described in the literature review. For example Prolog does not have any efficient parallel implementations and Haskell’s lazyness makes it very difficult to reason about performance. Many other languages (including Prolog) are not pure (which is required for auto-parallelism) and no other language has a profiler as powerful as the Mercury deep profiler.

I’ve added a small note in the introduction to this affect:

diff –git a/intro.tex b/intro.tex index 3f6ae20..01785c5 100644 — a/intro.tex +++ b/intro.tex @@ -218,8 +225,16 @@ automatic parallelism. Our work is targeted towards Mercury. We choose to use Mercury because it already supports explicit parallelism of dependent conjunctions, -and it provides powerful profiling tools which generate data for our profile -feedback analyses. +and it provides the most powerful profiling tool of any declarative language, +which generates data for our profile feedback analyses. +In some ways our work can be used with other programming languages, +but most other languages have significant barriers. +In particular automatic parallelism can only work reliably with declaratively +pure languages, +the language should also use a strict evaluation strategy to make it easy to +reason about parallel performance, +and in the case of a logic language, a strict and precise mode system is +required to determine when variables are assigned their values. Mercury’s support for parallel execution and the previous auto-parallelisation system \citep{bone:2008:hons} is described in Chapter~\ref{chap:backgnd}.

Benchmark diversity

Many of your considerations on two benchmarks, representing rely some fairly regular computations. How would you consider these representatives? Or, more in general, I would have liked to see a much broader pool of diverse benchmarks being used throughout the dissertation.

We agree, however we did not have the resources to find and construct more benchmarks. That said, Chapter 3 deals with these ‘easy’ benchmarks deliberately, to ensure that we can handle these computations efficiently. Chapter 4 contains a discussion about applying the overlap analysis to the Mercury compiler, and remarks that fewer conjunctions are parallelised (a good thing) but that the difference is lost in the noise. Chapter 5 deals with a pathological case that can be best shown using the same benchmarks as Chapter 3, as these benchmarks exhibit the pathological behaviour without creating extra ‘noise’.

The first paragraph in Chapter 3 made a similar point. I’ve extended that paragraph to make this point clearer:

diff –git a/rts.tex b/rts.tex index b7f5096..5b33986 100644 — a/rts.tex +++ b/rts.tex @@ -16,6 +16,14 @@ we did not get the speedups that we expected. Therefore, we chose to address the performance problems before we worked on automatic parallelism. +Throughout this chapter +we continue to use these two benchmarks, along with a naive Fibonacci +program (Page~\pageref{page:fibs}). +These benchmarks are not diverse and they all create a lot of +AND-parallelism, +most of which is independent. +We use these benchmarks deliberately to test that our runtime system can +handle large amounts of parallelism efficiently.

In this chapter we investigate and correct these performance problems. We start with the garbage collector in Section~\ref{sec:rts_gc}; diff –git a/rts_work_stealing2.tex b/rts_work_stealing2.tex index b872029..a120b85 100644 — a/rts_work_stealing2.tex +++ b/rts_work_stealing2.tex @@ -982,6 +982,7 @@ rather than contexts, in order to make work stealing simpler \end{center} \end{figure}

+\label{page:fibs} To make it easier to see the effects of the new work stealing code, we created a new benchmark named fibs. Fibs calculates the value of the 43\textsuperscript{rd} Fibonacci number

Formal semantics

There are no formal considerations about the fact that the parallel implementations respect the “theoretical” operational semantics of the language [e.g., same observable behavior). Even though it is true, it would be a good idea to spell it out.

Each transformation individually respects the program semantics (the declarative semantics, if they respected the operational semantics then they wouldn’t have transformed anything!). We have already said as much as we’ve presented each transformation.

Chapter 1

Chapter 1 is supposed to set the contest for the whole dissertation, and it does so in a good way. The chapter could be strengthened a bit by adding some citations [especially in the first few pages). Additionally

Non-SMP

Considerations in this chapter ignore the new generations of architecturesbased on CUDA Numa (not SMP), etc.

These architectures aren’t ignored, they’re acknowledged and then we say they’re out of scope. We did not specifically mention CUDA or other GPGPU architectures so I will acknowledge them:

diff –git a/intro.tex b/intro.tex index 3f6ae20..0a9678f 100644 — a/intro.tex +++ b/intro.tex @@ -87,9 +87,25 @@ and slower access to the other processors’ memories. %The benefit of NUMA is that it is easier to build large NUMA systems than %large SMP systems. %The drawback is that it is harder to program. -SMP systems are currently vastly more common, so programmers are usually -more interested in programming for them. -Therefore, in this dissertation we are only concerned with SMP systems. +A new type of architecture uses graphics programming units (GPUs) to +perform general purpose computing, +they are called GPGPU architectures. +However they are not as general purpose as their name suggests: +they work well for large regular data-parallel and compute-intensive +workloads, +but do not work well for more general symbolic processing. +GPGPUs give programs access to small amounts of different types of memory +of fixed sizes, +however most symbolic programs rely on dynamic allocation of unpredictable +amounts of memory. +Additionally, symbolic programs often include code with many conditional +branches; +this type of code does not perform well on GPGPUs. +GPGPU architectures are not as general purpose as SMP systems and +SMP systems are vastly more common than NUMA systems. +Therefore, in this dissertation we are only concerned with SMP systems as +they are both more general and more common, +making them more desirable targets for most programmers. Our approach will work with NUMA systems, but not optimally.

%\plan{We need parallelism}

Pure/impure examples

I would suggest to add examples of Pure and impure languages

diff –git a/literature_review.tex b/literature_review.tex index b41bbbb..4f19f55 100644 — a/literature_review.tex +++ b/literature_review.tex @@ -36,6 +36,8 @@ We group languages into the following two classifications. but we will restrict our attention to the specific benefit that this makes it easy for both compilers and programmers to understand if it is safe to parallelise any particular computation.

Examples of pure declarative languages are Mercury, Haskell and
Clean.
\item[Impure] programming languages are those that allow side effects. This includes imperative and impure declarative languages.

@@ -45,6 +47,9 @@ We group languages into the following two classifications. parallelism desirable. Thus parallelisation of programs written in impure languages is notoriously difficult.

Examples of impure languages are C, Java, Prolog and Lisp;
even though the last two of these are declarative languages,
they still allow side effects, and are therefore impure.
\end{description}

Is the example in page 8 correct?

The example is correct, the text was not.

diff –git a/literature_review.tex b/literature_review.tex index b41bbbb..2c17e59 100644 — a/literature_review.tex +++ b/literature_review.tex @@ -557,14 +557,14 @@ The \code{par} and \code{pseq} functions have the types: par :: a -> b -> b pseq :: a -> b -> b \end{verbatim} -They both take two arguments and return their first. +They both take two arguments and return their second. Their declarative semantics are identical; however their \emph{operational} semantics are different. The \code{par} function may spawn off a parallel task that evaluates its -second argument to WHNF, -and returns its first argument. -The \code{pseq} function will evaluate its second argument to WHNF -\emph{and then} return its first argument. +first argument to WHNF, +and returns its second argument. +The \code{pseq} function will evaluate its first argument to WHNF +\emph{and then} return its second argument. We can think of these functions as the \code{const} function with different evaluation strategies.

Logic programming scope (non SLD?)

Considerations in page 9 talk about “logic programming”. but they are really focused on languages derived from Prolog (SLD-based, etc.). Logic programming is a much broader term, and the considerations in this page do not reach other LP languages [e.g._,ASP-based).

Say explicitly that we restrict our attention to SLD-based languages.

diff –git a/literature_review.tex b/literature_review.tex index b41bbbb..6bef1f2 100644 — a/literature_review.tex +++ b/literature_review.tex @@ -611,7 +611,9 @@ which threads will perform which computations. \subsubsection{Parallelism in logic languages} \label{sec:intro_par_logic}

-Logic programming languages use selective linear resolution with definite +Different logic programming languages use different evaluation strategies, +but we will restrict our attention to those that use +selective linear resolution with definite clauses (SLD resolution) \citep{kowalski_sld}. SLD resolution attempts to answer a query by finding a Horn clause whose head has the same predicate name as the selected atom in the query,

WONT Dependent vs Independent

Hermenegildo used to stress that there is really no such thing as independent and dependent and-p, they are the same thing just seen at different levels of granularity [and I tend to agree with this).

Try to find something about this in the literature, if I don’t find anything then no action needs to be taken.

I could not find any specific reference by Hermenegildo about this, and in any cause I strongly disagree.

Research inheritance

My memory might be wrong. but the dependent and——p model of Pontelli and Gupta does not really build on [45] [they are completely independent). Furthermore, DDAS was the name of the system developed by Kish Shen, not by Pontelli Gupta.

The reviewer is correct, I’ve revised the discussion in question.

diff –git a/bib.bib b/bib.bib index 41b7cff..081c61d 100644 — a/bib.bib +++ b/bib.bib @@ -918,7 +918,7 @@ Misc{shapiro:flat_concur_prolog, address = {Leuven, Belgium} }

-@techreport{pontelli:1996:ddas, +@techreport{pontelli:1996:nondet-and-par, author = {Enrico Pontelli and Gopal Gupta}, title = {Non-determinate Dependent And-Parallelism Revisited}, institution = {Laboratory for Logic, Databases, and Advanced diff –git a/literature_review.tex b/literature_review.tex index b41bbbb..dbade1d 100644 — a/literature_review.tex +++ b/literature_review.tex @@ -874,10 +874,8 @@ However clause bodies can still include tell unifications can may provide a variable instantiation that allows some other blocked computation to resume. Therefore most unifications still incur these extra costs.

-The Data Dependent AND-parallelism system (DDAS) -of \citet{pontelli:1996:ddas} supports explicit -dependent AND-parallelism in a non-deterministic language; -it is based on the work of \citet{gupta:1991:ace}. +\citet{pontelli:1996:nondet-and-par} describe a system that +supports explicit dependent AND-parallelism in a non-deterministic language. Analyses determine a conservative set of shared variables in dependent conjunctions and then \emph{guess} which conjunct produces each variable (usually the leftmost one in which the variable appears)

Chapter 2

Detism stats

Can you provide a source for the various statistics mentioned in page 25?

diff –git a/backgnd_merpar.tex b/backgnd_merpar.tex index 21d12ec..b6da6b4 100644 — a/backgnd_merpar.tex +++ b/backgnd_merpar.tex @@ -144,10 +144,17 @@ since its result may never be needed. Restricting parallelism to \ddet and \dccmulti code is not a significant limitation; since the design of Mercury strongly encourages deterministic code, -in our experience, about 75 to 85\% of all Mercury procedures are \ddet, -and most programs spend an even greater fraction of their time in \ddet code. -\peter{Do we have any statistics regarding what proportion of execution time

is spent in det code? That would be a more relevant statistic.}

+in our experience, about 75 to 85\% of all Mercury procedures are +\ddet. +(This statistic was calculated by counting the different +determinism declarations in the source code of the Mercury system.) +Furthermore, +we expect that most programs spend an even greater fraction of their time in +\ddet code +(we know from profiling data that the Mercury compiler does). +% Table~\ref{tab:recursion_types}) shows that at least 95\% of the +% compiler’s time is spent in \ddet, \dccmulti, \dsemidet or \dccnondet +% code, which is the most accuratly we have measured this. Existing algorithms for executing nondeterministic code in parallel have very significant overheads, generating slowdowns by integer factors. Thus we have given priority to parallelising deterministic code,

TRO and and-parallelism

How does the discussion in page 26 relate to some of the tail recursion optimizations developed for and=parallelism?

diff –git a/backgnd_merpar.tex b/backgnd_merpar.tex index 21d12ec..05a3b03 100644 — a/backgnd_merpar.tex +++ b/backgnd_merpar.tex @@ -250,6 +250,12 @@ consists only of the final conjunct, $G_n$, and the context just executes it. Once each conjunct synchronises using {\joinandcontinue}, the original context will continue execution after the parallel conjunction. +The introduction of the barrier at the end of the conjunction can prevent +the compiler from using tail recursion optimistion. +This occurs when $G_n$ ended in a recursive call, and the whole conjunction +was the last conjunction in a procedure’s body. +We discuss this problem in more detail and provide our solution in +Chapter~\ref{chap:loop_control}. Figure~\ref{fig:par_conj} shows an example of the code generated to execute a parallel conjunction. In this example the first conjunct creates a spark that represents the

Futures

I might have missed it, but lots of what I see in page 28 resembles the behavior of conditional variables in POSIX threads.

diff –git a/backgnd_deppar.tex b/backgnd_deppar.tex index a5c14f2..ec1a14d 100644 — a/backgnd_deppar.tex +++ b/backgnd_deppar.tex @@ -109,6 +109,18 @@ Because \signal has no outputs and is deterministic, it must be declared as impure so that the compiler will not optimise away calls to it.

+Some readers may wonder whether futures are similar to POSIX condition +variables \citep{butenhof1997:pthreads}. +While both name their operations \emph{wait} and \emph{signal}, +they are different in two significant ways. +First, +futures store a value as well as a state, +POSIX condition variables store only their state. +Second, +when a future is signalled, all its consumers are woken, +whereas only one of a POSIX condition variable’s waiters is woken. +A POSIX condition variable is more similar to a semaphore. + To minimise waiting, the compiler pushes \signal operations on each future as far to the left into the producer conjunct as possible,

Evidence

I found some considerations in page 30/31 a bit speculative (especially the last two paragraphs before 2.4.1); any evidence supporting these clairns? @ particular, evidence related to how unbalanced Computations can become due to different inputs.

Reviewer #1 raised this issue and I’ve provided extra information in response to reviewer #1.

WONT Diagrams

The discussion in this Chapter could benefit from graphical representations of the data structures.

Which discussion? there are several, many of them already include C style structure definitions (Sparks, SyncTerms and Futures). I don’t think anything more graphical is required.

Chapter 3

Proofread

I found several English errors and typos, please proofread

diff –git a/macros.tex b/macros.tex index db513dd..500081b 100644 — a/macros.tex +++ b/macros.tex @@ -46,6 +46,7 @@ \newcommand{\push}[0]{\code{MR\_push\_spark()}\xspace} \newcommand{\pop}[0]{\code{MR\_pop\_spark()}\xspace} \newcommand{\steal}[0]{\code{MR\_steal\_spark()}\xspace} +\newcommand{\esactiondata}{\code{MR\_-es\_-action\_-data}\xspace} \newcommand{\findpartime}[0]{\code{find\_par\_time}\xspace} \newcommand{\findbestpartition}[0]{\code{find\_best\_partition}\xspace}

diff –git a/rts.tex b/rts.tex index b7f5096..e30d014 100644 — a/rts.tex +++ b/rts.tex @@ -7,11 +7,11 @@ Early in the project we tested two manually parallelised programs: a raytracer and a mandelbrot image generator. -Both of them have a single significant loop +Both programs have a single significant loop whose iterations are independent of one another. -We expect that automatic parallelisation would parallelise this loop -as it is the best place to introduce parallelism. -When we parallelised this loop manually we did +We expect that a good automatic parallelisation system will parallelise this +loop as it is the best place to introduce parallelism. +When we parallelised this loop manually, we did not get the speedups that we expected. Therefore, we chose to address the performance problems @@ -30,10 +30,12 @@ spark scheduling. We address one of these problems by introducing work stealing in Section~\ref{sec:rts_work_stealing}. Then in Section~\ref{sec:rts_reorder} we reorder conjuncts in independent -parallel conjunctions to work around the other spark scheduling problem. -Finally, in Section~\ref{sec:rts_work_stealing2} we make further improvements to -work stealing and change how idle engines look for work, sleep and are -woken up. +parallel conjunctions to work around the second spark scheduling problem. +Finally, in Section~\ref{sec:rts_work_stealing2} we make further +improvements to +work stealing and change the data structures and algorithms used to manage +idle engines, +including how idle engines look for work, sleep and are woken up.

\input{rts_gc} \input{rts_original_scheduling} diff –git a/rts_gc.tex b/rts_gc.tex index a01cdf5..2ffb2da 100644 — a/rts_gc.tex +++ b/rts_gc.tex @@ -11,7 +11,7 @@ Mercury does not allow destructive update. % Their use does not interfere with parallelism as they are used in % conjunction with impurity. % The compiler will not parallelise impure goals.} -Therefore a call usually returns its value in newly allocated memory +Therefore a call usually returns its results in newly allocated memory rather than modifying the memory of its parameters. Likewise, a call cannot modify data that may be aliased. This means that Mercury programs often have a high rate of allocation, @@ -57,8 +57,9 @@ for $P$ processors. Using four processors the theoretical best speedup is: %\paul{I prefer how \frac displays maths but in text the fonts become tiny.} $(1 + 19) / (1 + 19/4) = 3.48$. -17\% of the parallel runtime ($1 + 19/4$) is collector time. -If we use a machine with 100 processors then this becomes: +The fraction of this new execution time ($1 + 19/4$) spent in the collector +($1$) is 17\% (it was 0.5\% without parallelisation). +If we use a machine with 100 processors then the speedup becomes: $(1 + 19) / (1 + 19/100) = 16.8$; with 84\% of the runtime spent in the collector. As the number of processors increases, @@ -67,7 +68,7 @@ collection to complete.

\plan{Discuss predictions regarding parallel marking, locking and thread local heaps.} -To reduce this problem, +To reduce this effect, \citet{boehm:1988:gc} included parallel marking support in their collector. Ideally this would remove the bottleneck described above. However, @@ -86,7 +87,7 @@ parallel Mercury programs (Section~\ref{sec:backgnd_merpar}). The second is that memory allocation routines must use locking to protect shared data structures, -which will slow down allocation. +which slows down allocation. Boehm GC’s authors recognised this problem and added support for thread-local resources such as free lists. Therefore, @@ -103,7 +104,7 @@ benchmark programs with different memory allocation requirements. We wanted to determine how memory allocation rates affect performance of parallel programs. Our first benchmark is a raytracer developed for the -ICFP programming contest in 2000. +ICFP programming contest in the year 2000. For each pixel in the image, the raytracer casts a ray into the scene to determine what colour to paint that pixel. Two nested loops build the pixels for the image: @@ -184,15 +185,17 @@ interference from any other performance problems. \plan{Describe performance in practice.} The raytracer program benefits from parallelism in both Mercury and the garbage collector. -Using Mercury’s parallelism only (four Mercury engines, and one GC thread) -speeds the program up by a factor of 1.58, +Using Mercury’s parallelism only +(four Mercury engines, and one GC thread) +the program achieves a speedup of 1.58, compared to 1.29 when using the GC’s parallelism only (one Mercury engine, and four GC threads). When using both Mercury and the GC’s parallelism (four engines and four marker threads) -the raytracer achieves a speedup of 2.73. +it achieves a speedup of 2.73. These speedups are much lower than we might expect from such a program: either the mutator, the collector or both are not being parallelised well. + mandelbrot\_lowalloc does not see any benefit from parallel marking. It achieves very good speedups from multiple Mercury engines. We know that this program has a low allocation rate @@ -202,6 +205,7 @@ these results support the hypothesis that heavy use of garbage collection makes it difficult to achieve good speedups when parallelising programs. The more time spend in garbage collection, the worse the speedup due to parallelism. + mandelbrot\_highalloc, which stores its data on the heap, sees similar trends in performance as raytracer. It is also universally slower than mandelbrot\_lowalloc. @@ -238,7 +242,7 @@ respectively.

\plan{Describe trends in either the mutator or GC time.} As we expected, mandelbrot\_lowalloc spends very little time running the collector. -Typically, it ran the collector only twice during its execution. +Typically, it runs the collector only twice during its execution. Also, total collector time was usually between 5 and 30 milliseconds. The other two programs ran the collector hundreds of times. As we increased the number of Mercury engines @@ -291,7 +295,7 @@ number of Mercury engines both the collector and mutator time decrease. This shows that parallelism in Mercury and in Boehm GC both contribute to the elapsed time speedup we saw in the diagonal path in Table~\ref{tab:gc}. -As Table~\ref{tab:gc_amdahl} give us more detail, +As Table~\ref{tab:gc_amdahl} gives us more detail, we can also see that collector time as a percentage of elapsed time increases as we add threads and engines to the collector and Mercury. This occurs for both the raytracer and mandelbrot\_highalloc. @@ -350,8 +354,8 @@ by definition, allocation cannot occur during collector time. mandelbrot\_highalloc has a higher allocation rate than raytracer. However, -it spends less time doing collection, when measured either absolutely -or by percentage of elapsed time. +measuring either absolutely or as a percentage of elapsed time, +mandelbrot\_highalloc spends less time doing collection than the raytracer. Garbage collectors are tuned for particular workloads. It is likely that Boehm GC handles mandelbrot\_highalloc’s workload more easily than raytracer’s workload. @@ -387,7 +391,7 @@ we have shown that speedups due to Mercury’s parallel conjunction are limited by the garbage collector. Better speedups can be achieved by using the parallel marking feature in the garbage collector. -We also attempted to improve performance further by modifying the initial +Now we will show how performance can be improved by modifying the initial heap size of the program. Table~\ref{tab:gc_heapsize} shows the performance of the raytracer and mandelbrot\_highalloc programs with various initial heap sizes. @@ -418,7 +422,7 @@ If, after a collection, it still cannot satisfy the memory request then it will increase the size of the heap. In each collection, the collector must read all the stacks, global data, thread local data and -all the in-use memory in the heap +all the in-use memory in the heap (memory that is reachable from the stacks, global data and thread local data). This causes a lot of cache misses, especially when the heap is large. @@ -438,8 +442,8 @@ than with a heap size of 32MB. %remembering to care about normal caches much less TLBs.} 64MB is much larger than the processor’s cache size (this processor’s L3 cache is 8MB) and -covers more page mappings than its TLBs can hold (L2 TLB covers 2MB when -using 4KB pages). +covers more page mappings than its TLBs can hold +(the L2 TLB covers 2MB when using 4KB pages). The collector’s structures and access patterns may be slower at this size because of these hardware limitations, and the benefits of a 64MB heap are not enough to overcome the effects of @@ -470,12 +474,12 @@ There are two reasons for this \item[Reason 1] In a parallel program with more than one Mercury engine, each collection must \emph{stop-the-world}:

All Mercury engines are stopped so that they do not modify the heap during
all Mercury engines are stopped so that they do not modify the heap during the marking phase. This requires synchronisation which reduces the performance of parallel programs.
Therefore, the less often collection occurs, the rarer stop-the-world
events are, and the less their synchronisation affects performance.
The less frequently collection occurs, the less performance is affected
by the synchronisation of stop-the-world events.
\item[Reason 2] Another reason was suggested by Simon Marlow:

@@ -490,7 +494,7 @@ There are two reasons for this (P2) objects, causing a cache miss in P1’s cache and invalidating the corresponding cache line in P2’s cache.

Later, when collection is finished,
Later, when collection finishes, P2’s process resumes execution and incurs a cache miss for the object that it had been using. % He never states this in either of his parallel GC papers.

@@ -502,15 +513,15 @@ There are two reasons for this \plan{Discuss local heaps for threads, and their reliability problems.} We also investigated another area for increased parallel performance. Boehm GC maintains thread local free lists that allow memory to be allocated -without contending for locks on global free lists. -When a local free list cannot satisfy a memory request, the global free -lists must be used. -The local free lists amortise the costs of locking the global free lists. +without contending for locks on the global free list. +When a thread’s local free list cannot satisfy a memory request, the global +free list must be used. +The local free lists amortise the costs of locking the global free list. We anticipate that increasing the size of the local free lists will cause even less contention for global locks, allowing allocation intensive programs to have better parallel performance. -The size of the local free list can be adjusted by increasing the +The size of the local free lists can be adjusted by increasing the \texttt{HBLKSIZE} tunable. Unfortunately this feature is experimental, and adjusting \texttt{HBLKSIZE} caused our programs to crash. diff –git a/rts_original_scheduling.tex b/rts_original_scheduling.tex index da8adda..1a8e80d 100644 — a/rts_original_scheduling.tex +++ b/rts_original_scheduling.tex @@ -5,7 +5,7 @@ \plan{Introduction} In Sections~\ref{sec:backgnd_merpar} and~\ref{sec:backgnd_deppar}, we introduced parallelism in Mercury and described the runtime system in -generic terms. +general terms. In this section we will explain how sparks were originally managed prior to 2009, when I began my Ph.D.\ candidature. @@ -46,7 +46,7 @@ The second aim is to reduce the amount of memory allocated as contexts’ stacks by reducing the number of contexts allocated. Globally scheduled sparks may be converted into contexts, so they are also included in this limit. -We explain this limit on the number of context in more detail +We explain this limit on the number of contexts in more detail in Section~\ref{sec:rts_original_scheduling_performance}, after covering the background information in the current section. Note that sparks placed on the global queue are executed in a @@ -137,7 +137,7 @@ Then it executes the barrier \joinandcontinue, shown in Algorithm~\ref{alg:join_and_continue_peterw}. The algorithms throughout this chapter use a macro named \code{MR\_ENGINE} -to access the fields of the current engine structure. +to access the fields of the structure representing the current engine. Depending on how full the global run queue is, and how parallel tasks are interleaved, there are three important scenarios: @@ -219,7 +219,7 @@ there are three important scenarios:

Eventually $C_Orig$ will execute its call to \joinandcontinue (line 5 of the example),

or be executed after waiting on the barrier’s lock (line 4 of
or resume execution after waiting on the barrier’s lock (line 4 of \joinandcontinue),. When this happens it will find that \var{st->MR\_st\_num\_outstanding} is zero,

@@ -343,7 +343,7 @@ They do this by calling \idle, whose code is shown in Algorithm~\ref{alg:MR_idle_initial}. Only one of the idle engines can execute \idle at a time. \var{MR\_runqueue\_lock} protects the context run queue and the -global context queue from concurrent access. +global spark queue from concurrent access. After acquiring the lock, engines execute a loop. An engine exits the loop only when it finds some work to do or the @@ -373,8 +373,8 @@ register and the spark’s thread local mutables into the context. If the engine does not find any work, it will wait using a condition variable and the run queue lock. -The pthreads wait function is able to unlock the lock and wait on the -condition atomically, preventing race conditions. +The pthread wait function is able to unlock the lock and wait on the +condition variable atomically, preventing race conditions. The condition variable is used to wake up the engine if either a spark is placed on the global spark queue or a context is placed on the context run queue. diff –git a/rts_original_scheduling_performance.tex b/rts_original_scheduling_performance.tex index 450bf3d..bcfa6ea 100644 — a/rts_original_scheduling_performance.tex +++ b/rts_original_scheduling_performance.tex @@ -30,7 +30,8 @@ In Section~\ref{sec:rts_gc} we ran our benchmarks with a recent version of the runtime system. In the rest of this chapter we describe many of the improvements to the -runtime system that improved parallel performance. +runtime system that led to the improved parallel performance reported in +that section.

\plan{Introduce right recursion.} Figure~\ref{fig:map_right_and_left_recursive} shows two alternative, parallel @@ -77,9 +78,10 @@ We use the mandelbrot\_lowalloc program from Section~\ref{sec:rts_gc}. Using this program we can easily observe the speedup due to parallelism in Mercury without the effects of the garbage collector. -The loop that iterates over the rows in the image uses right recursion. -It is similar to \code{map/3} -in Figure~\ref{fig:map_right_recursive}. +The main loop of the renderer uses right recursion, +and is similar to \code{map/3} +(Figure~\ref{fig:map_right_recursive}). +This is the loop that iterates over the image’s rows. The leftmost column shows the maximum number of contexts permitted at any time. This is the limit that was mentioned in the previous section. @@ -130,10 +132,9 @@ After executing \code{P/2} the original context will call the \joinandcontinue barrier. It then attempts to execute a spark from its local spark stack, which will fail because the only spark was placed on the global spark queue. -The original context will be needed after the other conjunct finishes, -to execute the code after the parallel conjunction. -The original context cannot proceed but will be needed later, -therefore it is suspended until all the other 599 iterations of the loop +The original context cannot proceed until after the other context finishes, +but it will be needed then and must then be kept in memory until then. +Therefore it is suspended until all the other 599 iterations of the loop have finished. Meanwhile the spark that was placed on the global run queue is converted into a new context. @@ -142,7 +143,7 @@ becomes blocked within the recursive instance of the same parallel conjunction; it must wait for the remaining 598 iterations of the loop. This process continues to repeat itself, -allocating more contexts which can consume large amounts of memory. +allocating more contexts. Each context consumes a significant amount of memory, much more than one stack frame. Therefore @@ -264,12 +265,24 @@ Right and Left recursion shown with standard deviation} Table~\ref{tab:2009_left_nolimit} shows benchmark results using left recursion. The table is broken into three sections: + +\begin{enumerate} + +\item a copy of the right recursion data from Table~\ref{tab:right}, which is presented again for comparison; + +\item the left recursion data, which we will discuss now; + +\item and left recursion data with a modified context limit, which we will discuss in a moment. + +\end{enumerate} + +\noindent The figures in parenthesis are the standard deviation of the samples.

The left recursive figures are underwhelming; @@ -371,7 +384,7 @@ this way.

\plan{Reinforce that these results support the idea that scheduling decisions are made prematurely} -Both of the left recursion problems have the same cause: +Both of the problems involving left recursion have the same cause: the decision to execute a spark sequentially or in parallel is made too early. This decision is made when the spark is scheduled, diff –git a/rts_reorder.tex b/rts_reorder.tex index 16f9cd7..ccbcc29 100644 — a/rts_reorder.tex +++ b/rts_reorder.tex @@ -59,7 +59,7 @@ Figure~\ref{fig:map_left_recursive} (page~\pageref{fig:map_right_recursive}). \end{algorithm}

\plan{Describe the transformation} -We only attempted to reorder completely independent conjunctions. +We only reorder completely independent conjunctions. It may be possible to find independent sections of dependent conjunctions and reorder them, but we have not needed to. @@ -82,10 +82,11 @@ for example a goal cannot be swapped with an impure goal, therefore \trypushconjlater may not always push a goal all the way to the end of the list.

-\plan{Discuss reasoning for not showing results}. +\plan{Discuss reasoning for not showing results.} We have benchmarked and tested this to confirm that independent right recursion now performs the same as independent left recursion. -This is because independent right recursion is transformed into independent +We expected these programs to perform identically +because independent right recursion is transformed into independent left recursion.

diff –git a/rts_work_stealing.tex b/rts_work_stealing.tex index 2c6b9f1..856b106 100644 — a/rts_work_stealing.tex +++ b/rts_work_stealing.tex @@ -18,7 +18,7 @@ before it is executed. Work stealing is a popular method for managing parallel work in a shared memory multiprocessor system. % XXX: There may be earlier papers by keller 1984 as cited by halstead. -Multilisp \citep{halstead:1985:multilisp} was one of the first systems to +Multilisp (\citet{halstead:1985:multilisp}) was one of the first systems to use work stealing, which Halstead calls “an unfair scheduling policy”. The term “unfair” is not an expression of the morals of stealing (work), @@ -70,7 +70,7 @@ Work stealing’s other benefits can be summarised as follows.

\end{itemize}

-The initial work stealing implementation was built jointly by +Mercury’s initial work stealing implementation was built jointly by Peter Wang and myself, Wang contributed about 80\% of the work and I contributed the remaining 20\%. @@ -349,7 +349,7 @@ The deque provides the following operations. is not a double pointer as one might expect. This implementation detail avoids memory allocation for sparks inside the deque implementation.

A callers of any of these functions temporarily stores the spark on
A caller of any of these functions temporarily stores the spark on its program stack.
\item[\code{MR\_ws\_result MR\_steal\_spark(deque *d, spark *s)}]

@@ -486,7 +486,7 @@ processors in the order that they appear in the program. % processors in the order that they appear in the program. % This is much stronger than C’s \code{volatile} keyword which only controls % how the compiler may order instructions and cache values. -\citet{Chase_2005_wsdeque} uses Java’s \code{volatile} qualifier in the +\citet{Chase_2005_wsdeque} use Java’s \code{volatile} qualifier in the declaration for the \var{top} and \var{bottom} fields of the deque structure. When adding the deque algorithms to Mercury’s runtime system, diff –git a/rts_work_stealing2.tex b/rts_work_stealing2.tex index b872029..65cb5a5 100644 — a/rts_work_stealing2.tex +++ b/rts_work_stealing2.tex @@ -230,7 +230,8 @@ without a context will call \idle to acquire new work. the new version is shown in Algorithm~\ref{alg:idle}. This algorithm uses a structure pointed to by \var{eng\_data} to receive notifications; -we will discuss this structure and notifications this later in this section. +we will discuss this structure and how it is used to send and receive +notifications later in this section. For now we will discuss how \idle looks for work in lieu of any notification.

\idle tries to find a runnable context before trying to find a local @@ -275,7 +275,7 @@ If it finds a context, it will execute \prepareengineforcontext (Algorithm~\ref{alg:prepare_engine_for_context}), which checks for and unloads the engine’s current context, -and then loads a new context. +and then loads the new context. It also clears the context’s resume field and returns the field’s previous value so that \idle can jump to the resume address. The context’s resume field must be cleared here as @@ -544,7 +544,7 @@ This is modelled as a state transition of the recipient; such transitions are shown with red and green edges in the graph. Red edges denote transitions from the sleeping state: the sender acquires the \code{MR\_es\_lock} in the -\enginesleepsync structure while updating the structures contents and +\enginesleepsync structure while updating the structure’s contents and signalling the sleep semaphore. Green edges denote transitions from other states: the sender uses a compare and swap to write the \code{MR\_BUSY} value to the @@ -585,13 +585,13 @@ another engine.

\item[\code{MR\_ACTION\_CONTEXT}:] A context has become runnable and is pointed to by the

\code{MR\_es\_action\_data} field.
\esactiondata field. We attach the context to the message so that we do not have to place it in the context run queue and then retrieve it again when we know that this engine is capable of executing the context. However this notification cannot be ignored as that would cause the context to be leaked.
Leaking a context means not dropping or loosing the context;
Leaking a context means dropping or losing the context; a context that must be executed may not be leaked as the computation it represents would get lost. Therefore we only ever send \code{MR\_ACTION\_CONTEXT} messages to engines in

@@ -614,7 +614,7 @@ another engine.

\item[\code{MR\_ACTION\_WORKSTEAL}:] A spark has been placed on the spark deque indicated by the

\code{MR\_es\_action\_data} field.
\esactiondata field. This message is used to notify an engine of a new spark that it might be able to steal and execute. Informing the thief which spark deque it should check is an

@@ -1089,10 +1089,10 @@ However, the smaller number of spark deques make work stealing faster. Overall, the benefit of the new implementation depends on the program being executed. -However it is important to remember that fibs is a microbenchmark and does +However it is important to remember that fibs is a micro-benchmark and does not represent typical programs, -especially granularity control as these versions create embarrassing -parallelism. +especially the non-granularity controlled and under-granularity controlled +versions as they create embarrassing parallelism.

There are some other interesting trends in the results. Most obviously, @@ -1104,11 +1104,11 @@ run more slowly than those with coarse granularity. However when we consider the sequential results for varying amounts of granularity, we see some strange results. -We should expect to see the overheads of granularity control in these +We expected to see the overheads of granularity control in these results; the results with large values for \Depth should be slower than the ones with smaller values. -We would also expect the results without granularity control to be fastest, +We also expected the results without granularity control to be the fastest, but not significantly: in practice \fibsgc contributes to a tiny part of the call graphs of tests with a low \Depth value. @@ -1117,8 +1117,8 @@ the sequential execution times for fibs with a \Depth of 40 are \emph{faster} than those with a \Depth of 10 and those without granularity control. We are not sure of the cause, -however one hypothesis is that of the two procedures used in the granularity control -code, +however one hypothesis is that of the two procedures used in the granularity +control code, \fibsseq and \fibsgc, one is faster, namely \fibsgc, due to some strange interaction of the processor’s branch predictor and the code’s @@ -1145,7 +1145,7 @@ the stolen sparks onto its own deque, then it immediately executes the single spark which was not placed on the deque. With each stealing operation, -sparks becomes distributed among more of the engines, +sparks become distributed among more of the engines, which improves the probability that a given victim has work on its own deque. Work stealing will become less frequent as engines try to execute their own sparks before stealing other’s,

Amdahl’s law vs Gustafson-Barsis law

Amdahl’s law tend to be rather conservative \ have you considered using something like Gustafson-Barsis instead?

Added a citation and a paragraph about why this is not applicable.

diff –git a/bib.bib b/bib.bib index 41b7cff..57dbefc 100644 — a/bib.bib +++ b/bib.bib @@ -1835,5 +1835,14 @@ p Enrico Pontelli}, address = {New York, NY, USA}, }

+o +@article{gustafson:88:reevaluating-amdahl,

author = {John L. Gustafson},
title = {Reevaluating Amdahl’s Law},
journal = {Communications of the ACM},
year = {1988},
volume = {31},
pages = {532–533}

+}

diff –git a/rts_gc.tex b/rts_gc.tex index a01cdf5..327cd26 100644 — a/rts_gc.tex +++ b/rts_gc.tex @@ -65,6 +65,13 @@ As the number of processors increases, the mutator threads spend a larger proportion of their time waiting for collection to complete.

+In some constexts +Gustafson-Barsis’ law +\citep{gustafson:88:reevaluating-amdahl} might be more approprite than +Amdahl’s law. +This is true when there is either: always more data than could be processed +by additional computation resources, +or when the data can be split into smaller and smaller peices +without affecting the granuality of parallel tasks. +These conditions can be true for large numeric computations; +however they are rarely true for symbolic computations, +and Mercury is typically used for symbolic computations. +Therefore we use Amdahl’s law as it is more applicable. + \plan{Discuss predictions regarding parallel marking, locking and thread local heaps.} -To reduce this effect, +To reduce the amount of time that mutator threads have to wait for the +collector,

WONT Clarification/Discussion (Page 50)

Reason 2 page 50: would it be possible to test this hypothesis? p) bounding/unbounding threads?

No such reference exists. There are no “reasons” on this page and “aim 2” is both obvious given the reasoning on p50 and is explained and measured throughout the dissertation.

Prose on page 56

I found page 56 rather poorly written and hard to follow.

I’ve made the prose clearer and included some extra clarifications to resolve potential ambiguities.

diff –git a/rts_original_scheduling_performance.tex b/rts_original_scheduling_performance.tex index 450bf3d..3b87ebd 100644 — a/rts_original_scheduling_performance.tex +++ b/rts_original_scheduling_performance.tex @@ -72,42 +72,43 @@ and therefore tend to write right recursive code.

\plan{Show performance figures.} Table~\ref{tab:right} shows average elapsed time in seconds for the -mandelbrot\_lowalloc program from 20 test runs. -We use the mandelbrot\_lowalloc program from Section~\ref{sec:rts_gc}. -Using this program we can easily observe the -speedup due to parallelism in Mercury without the effects of the garbage -collector. -The loop that iterates over the rows in the image uses right recursion. -It is similar to \code{map/3} -in Figure~\ref{fig:map_right_recursive}. -The leftmost column shows the maximum number of contexts permitted at -any time. -This is the limit that was mentioned in the previous section. +mandelbrot\_lowalloc program from Section~\ref{sec:rts_gc}. +We this program because it is easy to observe the speedup due to parallelism +in Mercury as garbage collection does not affect its performance very much. +The loop that iterates over the rows in the image uses right recursion; +it is similar to the loop in Figure~\ref{fig:map_right_recursive}. +The leftmost column of the table shows the maximum number of contexts +that may exist at any time. +This is the limit that was introduced in the previous section. The next two columns give the elapsed execution time for a sequential -version of the program; -in this version of the program no parallel conjunctions were used. +version of the program, that is, +the program compiled without the use of the parallel conjunction operator. These two columns give results without and with thread safety. The following four columns give the elapsed execution times using one to four Mercury engines. -The numbers in parentheses are the ratio between the time and the -sequential thread safe time. +Each value in the table is a mean of 20 test runs. +The numbers in parentheses are the ratio between the mean time and the +mean sequential thread safe time.

\plan{Observations} In general we achieve more parallelism when we use more contexts, -up to a threshold of 601 contexts, -as there are 600 rows in the image and a base case each one consumes a -context. +up to a threshold of 601 contexts. +The threshold is at this point because +there are 600 rows in the image meaning 600 iterations of the loop plus 1 +for the base case. +Each iteration may consume a context. This is why the program does not benefit greatly from a high limit such as 1024 or 2048 contexts. -This program may use fewer than 600 contexts as any sequentially executed -sparks use their parent context. +When a spark is executed in its parent context +(two iterations execute sequentially) +then the program may use fewer than 600 contexts. This is possibly why a limit of 512 contexts also results in a good parallel speedup. When only 256 contexts are used, the four core version achieves a speedup of 1.30, compared to 3.55 for 512 or more contexts. -Given that mandelbrot uses independent parallelism there should never be any -need to suspend a context. +Given that mandelbrot uses independent parallelism, +ideally there should never be any need to suspend a context. Therefore the program should parallelise well enough when restricted to a small number of contexts (four to eight). Too many contexts are needed to execute this program at the level of @@ -119,8 +120,10 @@ parallelism. \plan{Describe the context limit problem.} %Right recursion uses a context for each iteration of the loop, %and then suspends that context. -To understand this problem we must consider how parallel conjunctions are +To understand this problem, we must consider how parallel conjunctions are executed (see Section~\ref{sec:backgnd_merpar}). +We will step through the execution of \code{map/3} from +Figure~\ref{fig:map_right_recursive}. The original context creates a spark for the second and later conjuncts and puts it on the global spark queue. It then executes the first conjunct \code{P/2}.

Chapter 6

WONT Please include more figures.

No, Due to bitrot in the upstream ThreadScope project generating additional pictures requires extra work.

Bibliography

Several errors, please review your entries?

[46] has a spurious ‘p’

[45] appeared in a more complete forrn in some ICLP [perhaps 1994)

I believe Pontelli was an author in [47] -

also it was published in 2001, not in 1995; on the other hand 1995 saw

the publication of Hernienegildo’s et al. paper on 8a:ACE (which introduces many of the independent and—pstructures and optimizations)

Files

Examiners_reports.org

Latest commit

History