In this chapter we develop a more formal model for Futhark and provide an ideal cost model for the language in terms of the concepts of work and span. Before we present the cost model for the language, we present a simple type system for Futhark and an evaluation semantics. In the initial development, we shall not consider Futhark's more advanced features such as loops and uniqueness types, but we shall return to these constructs later in the chapter.
Futhark supports certain kinds of nested parallelism. For instance, Futhark can in many cases map two nested maps into fully parallel code. Consider the following Futhark function:
.. literalinclude:: src/multable.fut :lines: 10-13
In the case of this program, Futhark will flatten the code to make a single flat kernel. We shall return to the concept of flattening in a later chapter.
When we shall understand how efficient an algorithm is, we shall build
our analysis around the two concepts of work and span. These
concepts are defined inductively over the various Futhark language
constructs and we may therefore argue about work and span in a
compositional way. For instance, if we want to know about the work
required to execute the multable
function, we need to know about
how to compute the work for a call to the map
SOAC, how to compute
the work for the iota
operation, how to compute the work for the
multiply operation, and, finally, how to combine the work. The way to
determine the work for a map
SOAC instance is to multiply the size
of the argument array with the work of the body of the argument
function. Thus, we have
W(\fop{map}~(\lamK{j}{i * j})~(\fop{iota}~n)) = n+1
Applying a similar argument to the outer map, we get
W(\fop{map}~(\lamK{i}{$\cdots$})~(\fop{iota}~\kw{$n$})) = (n+1)^2
Most often we are interested in finding the asymptotical complexity of the algorithm we are analyzing, in which case we will simply write
W(\fop{map}~(\lamK{i}{$\cdots$}) (\fop{iota}~n) = O(n^2)
In a similar way we can derive that the span of a call multable n
,
written S(\kw{multable n}), is O(1).
In this section we present a simplified version of the Futhark language in terms of syntax, a type system for the language, and a strict evaluation semantics.
We assume a countable infinite number of program variables, ranged over by x and f. Binary infix scalar operators, first-order built-in operations, and second order array combinators are given as follows:
\id{binop} &::=~~ \fop{+} ~~|~~ \fop{-} ~~|~~ \fop{*} ~~|~~ \fop{/} ~~|~~ \cdots \\ \\ \id{op} &::=~~ \fop{-} ~~|~~ \fop{abs} ~~|~~ \fop{copy} ~~|~~ \fop{concat} ~~|~~ \fop{empty} \\ &~~|~~ \fop{iota} ~~|~~ \fop{partition} ~~|~~ \fop{rearrange} \\ &~~|~~ \fop{replicate} ~~|~~ \fop{reshape} \\ &~~|~~ \fop{rotate} ~~|~~ \fop{shape} ~~|~~ \fop{scatter} \\ &~~|~~ \fop{split} ~~|~~ \fop{transpose} ~~|~~ \fop{unzip} ~~|~~ \fop{zip} \\ \\ \id{soac} &::=~~ \fop{map} ~~|~~ \fop{reduce} \\ &~~|~~ \fop{scan} ~~|~~ \fop{filter} ~~|~~ \fop{partition}
In the grammar for the Futhark language below, we have eluded both the required explicit type annotations and the optional explicit type annotations. Also for simplicity, we are considering only "unnested" pattern matching and we do not, in this section, consider uniqueness types.
p &::=~~ x ~~|~~ \kw{(}x_1,...,x_n\kw{)} \\ \\ \id{ps} &::=~~ p_1 \cdots p_n \\ \\ F &::=~~ \lam{ps}{e} ~~|~~ e~\id{binop} ~~|~~ \id{binop}~e \\ \\ P &::=~~ \fw{let}~f~\id{ps}~\kw{=}~e ~~|~~ P_1 P_2 ~~|~~ \fw{let}~p~\kw{=}~e \\ \\ v &::=~~ \fop{true} ~~|~~ \fop{false} ~~|~~ n ~~|~~ r \\ &~~|~~ \kw{[}v_1,...,v_n\kw{]} ~~|~~ \kw{(}v_1,...,v_n\kw{)} \\ \\ e &::=~~ x ~~|~~ v ~~|~~ \fw{let}~\id{ps}~\kw{=}~e~\fw{in}~e' \\ &~~|~~ e\kw{[}e'\kw{]} ~~|~~ e\kw{[}e'\kw{:}e''\kw{]} \\ &~~|~~ \kw{[}e_1,...,e_n\kw{]} ~~|~~ \kw{(}v_1,...,v_n\kw{)} \\ &~~|~~ f e_1 ... e_n ~~|~~ \id{op}~e_1 ... e_n ~~|~~ e_1~\id{binop}~e_2 \\ &~~|~~ \fw{loop}~ p_1\kw{=}e_1,\cdots,p_n\kw{=}e_n ~\fw{for}~ x \kw{<} e ~\fw{do}~ e' \\ &~~|~~ \fw{loop}~ p_1\kw{=}e_1,\cdots,p_n\kw{=}e_n ~\fw{while}~ e ~\fw{do}~ e' \\ &~~|~~ \id{soac}~F~e_1~\cdots~e_n
Without considering Futhark's uniqueness type system, Futhark's type system is simple. Types (\tau) follow the following grammar-slightly simplified:
\tau & ::=~~ \kw{i32} ~~|~~ \kw{f32} ~~|~~ \kw{bool} ~~|~~ \kw{[]}\tau \\ & ~~|~~ \kw{(}\tau_1,\cdots,\tau_n\kw{)} ~~|~~ \tau \rarr \tau' ~~|~~ \alpha
We shall refer to the types i32
, f32
, and bool
as basic types. Futhark supports more basic types than those
presented here; consult :numref:`baselang` for a complete list.
In practice, Futhark requires a programmer to provide explicit
parameter types and an explicit result type for top-level function
declarations. Similarly, in practice, Futhark requires explicit types
for top-level let
bindings. In such explicit types, type variables
are not allowed; at present, Futhark does not allow for a programmer
to declare polymorphic functions.
Futhark's second order array combinators and some of its primitive operations do have polymorphic types, which we specify by introducing the concept of type schemes, ranged over by \sigma, which are basically quantified types with \alpha and \beta ranging over ordinary types. When \sigma=\forall\vec{\alpha}.\tau is some type scheme, we say that \tau' is an instance of \sigma, written \sigma \geq \tau' if there exists a substitution [\vec{\tau}/\vec{\alpha}] such that \tau[\vec{\tau}/\vec{\alpha}] = \tau'. We require all substitutions to be simple in the sense that substitutions do not allow for function types, product types, or type variables to be substituted. Other restrictions may apply, which will be specified using a type variable constraint \alpha \triangleright T, where T is a set of basic types.
The type schemes for Futhark's second-order array combinators are as follows:
\id{soac} & ~~:~~ \mathrm{TypeOf}(\id{soac}) \\ \hline \fop{filter} & ~~:~~ \forall \alpha. (\alpha \rarr \mathtt{bool}) \rarr []\alpha \rarr []\alpha\\ \fop{map} & ~~:~~ \forall \alpha_1\cdots\alpha_n\beta. (\alpha_1\rarr\cdots\rarr\alpha_n \rarr \beta) \\ & ~~~~~ \rarr []\alpha_1 \rarr\cdots\rarr []\alpha_n \rarr []\beta\\ \fop{reduce} & ~~:~~ \forall \alpha. (\alpha \rarr \alpha \rarr \alpha) \rarr \alpha \rarr []\alpha \rarr \alpha\\ \fop{scan} & ~~:~~ \forall \alpha. (\alpha \rarr \alpha \rarr \alpha) \rarr \alpha \rarr []\alpha \rarr []\alpha\\
The type schemes for Futhark's built-in first-order operations are as follows:
\id{op} & ~~:~~ \mathrm{TypeOf}(\id{op}) \\ \hline \fop{concat} & ~~:~~ \forall \alpha. []\alpha \rarr \cdots \rarr []\alpha \rarr []\alpha \\ \fop{empty} & ~~:~~ \forall \alpha. []\alpha \\ \fop{iota} & ~~:~~ \kw{int} \rarr []\kw{int} \\ \fop{replicate} & ~~:~~ \forall \alpha. \kw{int} \rarr \alpha \rarr []\alpha\\ \fop{rotate} & ~~:~~ \forall \alpha. \kw{int} \rarr []\alpha \rarr []\alpha\\ \fop{transpose} & ~~:~~ \forall \alpha. [][]\alpha \rarr [][]\alpha\\ \fop{unzip} & ~~:~~ \forall \alpha_1\cdots\alpha_n. [](\alpha_1,\cdots,\alpha_n) \\ & ~~~~~ \rarr ([]\alpha_1,\cdots,[]\alpha_n) \\ \fop{scatter} & ~~:~~ \forall \alpha. []\alpha \rarr []\kw{int} \rarr []\alpha \rarr []\alpha \\ \fop{zip} & ~~:~~ \forall \alpha_1\cdots\alpha_n. []\alpha_1\rarr\cdots\rarr[]\alpha_n \\ & ~~~~~ \rarr [](\alpha_1,\cdots,\alpha_n)
The type schemes for Futhark's built-in infix scalar operations are as follows:
\id{binop} & ~~:~~ \mathrm{TypeOf}(\id{binop}) \\ \hline \kw{+},\kw{-},\kw{*},\kw{/},\cdots & ~~:~~ \forall \alpha \triangleright \{\kw{i32},\kw{f32}\}. \alpha \rarr \alpha \rarr \alpha \\ \kw{==},\kw{!=},\kw{<},\kw{<=},\kw{>},\kw{>=} & ~~:~~ \forall \alpha \triangleright \{\kw{i32},\kw{f32}\}. \alpha \rarr \alpha \rarr \kw{bool}
We use \Gamma to range over type environments, which are finite maps mapping variables to types. We use \{\} to denote the empty type environment and \{x:\tau\} to denote a singleton type environment. When \Gamma is some type environment, we write \Gamma,x:\tau to denote the type environment with domain \Dom(\Gamma) \cup \{x\} and values (\Gamma,x:\tau)(y) = \tau if y=x and \Gamma(y), otherwise. Moreover, when \Gamma and \Gamma' are type environments, we write \Gamma + \Gamma' to denote the type environment with domain \Dom(\Gamma) \cup \Dom(\Gamma') and values (\Gamma + \Gamma')(x) = \Gamma'(x) if x \in \Dom(\Gamma') and \Gamma(x), otherwise.
Type judgments for values take the form \vd v : \tau, which are read “the value v has type \tau.” Type judgments for expressions take the form \Gamma \vd e : \tau, which are read “in the type environment \Gamma, the expression e has type \tau.” Finally, type judgments for programs take the form \Gamma \vd P : \Gamma', which are read “in the type environment \Gamma, the program P has type environment \Gamma'.”
Values ~~\sembox{ \vd v : \tau}
\fraccc{}{\vd r : \kw{f32}} \\[2mm]
\fraccc{}{\vd n : \kw{i32}} \\[2mm]
\fraccc{}{\vd \fop{true} : \kw{bool}} \\[2mm]
\fraccc{}{\vd \fop{false} : \kw{bool}} \\[2mm]
\fraccc{\vd v_i : \tau_i ~~~i = [1;n]}{\vd \kw{(}v_1,\cdots,v_n\kw{)} : \kw{(}\tau_1,\cdots,\tau_n\kw{)}} \\[2mm]
\fraccc{\vd v_i : \tau ~~~i = [1;n]}{\vd \kw{[}v_1,\cdots,v_n\kw{]} : \kw{[]}\tau}
Expressions ~~\sembox{\Gamma \vd e : \tau}
\fraccc{\Gamma(x) = \tau}{\Gamma \vd x : \tau} \\[2mm]
\fraccc{\Gamma \vd e : []\tau ~~~~\Gamma \vd e_i : \kw{int}~~i=[1,2]}{ \Gamma \vd e\kw{[}e_1:e_2\kw{]} : []\tau} \\[2mm]
\fraccc{\Gamma \vd e : \tau ~~~~ \Gamma, x:\tau \vd e' : \tau'}{\Gamma \vd \fw{let}~x~\kw{=} ~e~ \fw{in}~e' : \tau'} \\[2mm]
\fraccc{\Gamma \vd e : \kw{(}\tau_1,\cdots,\tau_n\kw{)} \\ \Gamma, x_1:\tau_1,\cdots,x_n:\tau_n \vd e' : \tau}{\Gamma \vd \fw{let}~\kw{(}x_1,\cdots,x_n\kw{)} ~\kw{=} ~e~ \fw{in}~e' : \tau} \\[2mm]
\fraccc{\Gamma \vd e_i : \tau_i ~~~i = [1;n]}{\Gamma \vd \kw{(}e_1,\cdots,e_n\kw{)} : \kw{(}\tau_1,\cdots,\tau_n\kw{)}} \\[2mm]
\fraccc{\Gamma \vd e_i : \tau ~~~i = [1;n]}{\Gamma \vd \kw{[}e_1,\cdots,e_n\kw{]} : \kw{[]}\tau} \\[2mm]
\fraccc{\vd v : \tau}{\Gamma \vd v : \tau} \\[2mm]
\fraccc{\Gamma(f) = \kw{(}\tau_1,\cdots,\tau_n\kw{)} \rarr \tau ~~~~ \Gamma \vd e_i : \tau_i ~~ i = [1;n]}{\Gamma \vd f~e_1\cdots e_n : \tau} \\[2mm]
\fraccc{\Gamma \vd e_i : \tau_i ~~ i = [1;2] \\ \mathrm{TypeOf}(\id{binop}) \geq \tau ~~~~ \tau = \tau_1 \rarr \tau_2 \rarr \tau'}{\Gamma \vd e_1 ~\id{binop}_\tau~ e_2 : \tau' } \\[2mm]
\fraccc{\Gamma \vd e_i : \tau_i ~~ i = [1;n] \\ \mathrm{TypeOf}(\id{op}) \geq \tau \\ \tau = \tau_1 \rarr \cdots \rarr \tau_n \rarr \tau'}{\Gamma \vd \id{op}_\tau~ e_1~\cdots~e_n : \tau' } \\[2mm]
\fraccc{\Gamma \vd e : []\tau ~~~~~~\Gamma \vd e' : \kw{int}}{ \Gamma \vd e\kw{[}e'\kw{]} : \tau} \\[2mm]
\fraccc{\Gamma \vd F : \tau_\mathrm{f} ~~~ \Gamma \vd e_i : \tau_i ~~ i = [1;n] \\ \mathrm{TypeOf}(\id{soac}) \geq \tau_\mathrm{f} \rarr \tau_1 \rarr \cdots \rarr \tau_n \rarr \tau}{ \Gamma \vd \id{soac} ~F~e_1~\cdots~e_n : \tau }
Functions ~~\sembox{\Gamma \vd F : \tau}
\fraccc{\Gamma,x_1:\tau_1~\cdots~x_n:\tau_n \vd e : \tau}{ \Gamma \vd \lam{x_1:\tau_1~\cdots~x_n:\tau_n}{e} : \tau_1 \rarr \cdots \rarr \tau_n \rarr \tau } \\[2mm]
\fraccc{\Gamma \vd e : \tau_1 \\ \mathrm{TypeOf}(\id{binop}) \geq \tau_1 \rarr \tau_2 \rarr \tau}{\Gamma \vd e ~\id{binop} : \tau_2 \rarr \tau } \\[2mm]
\fraccc{\Gamma \vd e : \tau_2 \\ \mathrm{TypeOf}(\id{binop}) \geq \tau_1 \rarr \tau_2 \rarr \tau}{\Gamma \vd \id{binop}~e : \tau_1 \rarr \tau }
Programs ~~\sembox{\Gamma \vd P : \Gamma'}
\fraccc{\Gamma \vd e : \tau ~~~~~ x \not \in \Dom(\Gamma)}{\Gamma \vd \fw{let}~x~\kw{=} ~e : \{x:\tau\}} \\[2mm]
\fraccc{\Gamma \vd P_1 : \Gamma_1 ~~~~ \Gamma + \Gamma_1 \vd P_2 : \Gamma_2} {\Gamma \vd P_1~P_2 : \Gamma_1 + \Gamma_2} \\[2mm]
\fraccc{\Gamma,x_1:\tau_1,\cdots,x_n:\tau_n \vd e : \tau ~~~~~ f \not \in \Dom(\Gamma)} {\Gamma \vd \fw{let}~f~\kw{(}x_1,\cdots,x_n\kw{)}~\kw{=} ~e : \{f:\kw{(}\tau_1,\cdots,\tau_n\kw{)} \rarr \tau\}}
For brevity, we have eluded some of the typing rules and we leave it
to the reader to create typing rules for rearrange
, shape
,
reshape
, loop-for
, loop-while
, and array ranging (e[i:j:o]
).
In this section we develop a simple evaluation semantics for Futhark programs. The semantics is presented as a big step evaluation function that takes as parameter an expression and gives as a result a value. A soundness property states that if a program P is well-typed and contains a function main of type () \rightarrow \tau, then, if evaluation of the program results in a value v, the value v has type \tau.
To ease the presentation, we treat the evaluation function as being implicitly parameterised by the program P.
The semantics of types yields their natural set interpretations:
\Eval{\kw{i32}} & =~~ \Z \\ \Eval{\kw{f32}} & =~~ \R \\ \Eval{\kw{bool}} & =~~ \{\fop{true},\fop{false}\} \\ \Eval{\kw{(}\tau_1,\cdots,\tau_n\kw{)}} & =~~ \Eval{\tau_1} \times \cdots \times \Eval{\tau_n} \\ \Eval{\kw{[]}\tau} & =~~ \N \rightarrow \Eval{\tau} \\ \Eval{\tau_1 \rightarrow \tau_2} & =~~ \Eval{\tau_1} \rightarrow \Eval{\tau_2}
For ease of presentation, we consider a syntactic vector value \kw{[}v_1,\cdots,v_n\kw{]} equal to the projection function on the vector, returning a default value of the underlying type for indexes greater than n-1 (zero-based interpretation).
For built-in operators \id{op}_\tau, annotated with their type instance \tau according to the typing rules, we assume a semantic function \Eval{\id{op}_\tau} : \Eval{\tau}. As an examples, we assume \Eval{\kw{+}_{\kw{i32}\rightarrow\kw{i32}\rightarrow\kw{i32}}} : \Z \rightarrow \Z \rightarrow \Z.
When e is some expression, we write e[v_1/x_1,\cdots,v_n/x_n] to denote the simultaneous substitution of v_1,\cdots,v_n for x_1,\cdots,x_n (after appropriate renaming of bound variables) in e.
Evaluation of an expression e is defined by an evaluation function \Eval{\cdot} : \mathrm{Exp} \rightarrow \mathrm{Val}. The function is defined in a mutually recursive fashion with an auxiliary utility function \extractF{F} for extracting SOAC function parameters. We first give the definition for \Eval{\cdot}:
\Eval{f~e_1~\cdots~e_n} & =~~ \Eval{e[\Eval{e_1}/x_1\cdots \Eval{e_n}/x_n]} \\ & ~ ~~~~~ \mathrm{where}~\fw{let}~f~x_1~ \cdots x_n ~\kw{=}~ e \in P \\ \Eval{v} & =~~ v \\ \Eval{e\kw{[}e'\kw{]}} & =~~ \Eval{e}(\Eval{e'}) \\ \Eval{\fw{let}~x~\kw{=}~e~\fw{in}~e'} & =~~ \Eval{e'[\Eval{e}/x]} \\ \Eval{\fw{let}~\kw{(}x_1,\cdots,x_n\kw{)}~\kw{=}~e~\fw{in}~e'} & =~~ \Eval{e'[v_1/x_1\cdots v_n/x_n]} \\ & ~ ~~~~~ \mathrm{where}~ \Eval{e} = \kw{(}v_1,\cdots,v_n\kw{)} \\ \Eval{\kw{[}e_1,\cdots,e_n\kw{]}} & =~~ \kw{[}\Eval{e_1},\cdots,\Eval{e_n}\kw{]} \\ \Eval{\kw{(}e_1,\cdots,e_n\kw{)}} & =~~ \kw{(}\Eval{e_1},\cdots,\Eval{e_n}\kw{)} \\ \Eval{e_1~\id{binop}_\tau~e_2} & =~~ \sem{\id{binop}_\tau}~\Eval{e_1}~\Eval{e_2} \\ \Eval{\id{op}_\tau~e_1\cdots e_n} & =~~ \sem{\id{op}_\tau}~\Eval{e_1}~\cdots~\Eval{e_n} \\ \Eval{\fop{map}~F~e_1\cdots e_m} & =~~ \Eval{\kw{[}e'[v_1^1/x_1\cdots v_1^m/x_m],\cdots,e'[v_n^1/x_n\cdots v_n^m/x_m]\kw{]}} \\ & ~ ~~~~~\mathrm{where}~\lambda x_1\cdots x_m . e' = \extractF{F} \\ & ~ ~~~~~\mathrm{and}~ \kw{[}v_1^i,\cdots,v_n^i\kw{]} = \Eval{e_i} ~~~ i=[1..m]
Given a SOAC function parameter F, we define the utility extraction function, \extractF{F}, as follows:
\extractF{\lam{x_1\cdots x_n}{e}} & =~~ \lambda x_1 \cdots x_n . e \\ \extractF{\id{binop}~e} & =~~ \lambda x . x~\id{binop}~v \\ & ~ ~~~~~\mathrm{where}~v = \Eval{e} \\ \extractF{e~\id{binop}} & =~~ \lambda x . v~\id{binop}~x \\ & ~ ~~~~~\mathrm{where}~v = \Eval{e}
Type soundness is expressed by the following proposition:
Proposition: Futhark Type Soundness
If ~\vd P : \Gamma and \Gamma(\id{main}) = () \rightarrow \tau and \Eval{\id{main}~\kw{()}} = v then ~\vd v : \tau.
Notice that we have glanced over the concept of bounds checking by assuming that arrays with elements of type \tau are implemented as total functions from N to \Eval{\tau}.
In this section we give a cost model for Futhark in terms of functions
for determining the total work done by a program, in terms of
operations done by the big-step evaluation semantics, and the span
of the program execution, in terms of the maximum depth of the
computation, assuming an infinite amount of parallelism in the SOAC
computations. The functions for work and span, denoted by W :
\mathrm{Exp} \rightarrow \N and S : \mathrm{Exp} \rightarrow
\N are given below. The functions are defined independently,
although they make use of the evaluation function
\Eval{\cdot}. We have given the definitions for the essential
SOAC functions, namely map
and reduce
. The definitions for
the remaining SOACs follow the same lines as the definitions for
map
and reduce
.
Work (W)
W(v) &=~~ 1 \\ W(\fw{let}~x~\kw{=}~e~\fw{in}~e') &=~~ W(e) + W(e'[\Eval{e}/x]) + 1 \\ W(\fw{let}~(x_1,...,x_n)~\kw{=}~e~\fw{in}~e') &=~~ \Let~[v_1,...,v_n] = \Eval{e} \\ & ~ ~~ \In~W(e) + W(e'[v_1/x_1,\cdots,v_n/x_n]) + 1 \\ W(\kw{[}e_1,\cdots,e_n\kw{]}) &=~~ W(e_1) + \ldots + W(e_n) + 1 \\ W(\kw{(}e_1,\cdots,e_n\kw{)}) &=~~ W(e_1) + \ldots + W(e_n) + 1 \\ W(f~e_1 \cdots e_n) &=~~ W(e_1) + \ldots + W(e_n) + W(e[\Eval{e_1}/x_1,\cdots \Eval{e_n}/x_n]) + 1 \\ & ~ ~~ \quad \mathrm{where} (\fw{let}~f~x_1~\cdots~x_n~=~e) ~\in~ P \\ W(e_1 \id{binop} e_2) &=~~ W(e_1) + W(e_2) + 1 \\ W(\fop{map}~F~e) &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e} \\ & ~ ~~ \quad \lambda x. e' = \extractF{F} \\ & ~ ~~ \In~W(e) + W(e'[v_1/x]) + \ldots + W(e'[v_n/x]) \\ W(\fop{reduce}~F~e'~e'') &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e''} \\ & ~ ~~ \quad \lambda x~x'. e = \extractF{F} \\ & ~ ~~ \In~W(e') + W(e'') + W(e[v_1/x,v_n/x']) \times n + 1 \\ & ~ ~~ \quad \mathrm{assuming} ~ W(e[v_1/x,v_n/x']) ~\mathrm{indifferent~to} ~v_1~ \mathrm{and} ~v_n \\ W(\fop{iota}~e) &=~~ W(e) + n \quad \mathrm{where} ~ n = \Eval{e}
Span (S)
S(v) &=~~ 1 \\ S(\fw{let}~x~=~e~\fw{in}~e') &=~~ S(e) + S(e'[\Eval{e}/x]) + 1 \\ S(\fw{let}~(x_1,...,x_n)~=~e~\fw{in}~e') &=~~ \Let~[v_1,...,v_n] = \Eval{e} \\ & ~ ~~ \In~S(e) + S(e'[v_1/x_1,\cdots,v_n/x_n]) \\ S([e_1,\cdots,e_n]) &=~~ S(e_1) + \ldots + S(e_n) + 1 \\ S((e_1,\cdots,e_n)) &=~~ S(e_1) + \ldots + S(e_n) + 1 \\ S(f e_1 \cdots e_n) &=~~ S(e_1) + \ldots + S(e_n) + S(e[\Eval{e_1}/x_1,\cdots \Eval{e_n}/x_n]) + 1 \\ & ~ ~~ \quad \mathrm{where}~(\fw{let}~f~x_1~\cdots~x_n~=~e) ~\in~ P \\ S(e_1~\id{binop}~e_2) &=~~ S(e_1) + S(e_2) + 1 \\ S(\fop{map}~F~e) &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e} \\ &~ ~~ \quad \lambda x. e' = \extractF{F} \\ &~ ~~ \In~S(e) + \id{max}(S(e'[v_1/x]), \ldots , S(e'[v_n/x])) + 1 \\ S(\fop{reduce}~F~e'~e'') &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e''} \\ & ~ ~~ \quad \lambda x~x'. e = \extractF{F} \\ & ~ ~~ \In~S(e') + S(e'') + S(e[v_1/x,v_n/x']) \times \mathrm{ln}\,n + 1 \\ & ~ ~~ \quad \mathrm{assuming} ~ S(e[v_1/x,v_n/x']) ~\mathrm{indifferent~to} ~v_1~ \mathrm{and} ~v_n \\ S(\fop{iota}~ e) &=~~ S(e) + 1
In this section, we shall investigate an implementation of reduction using the general concept of contraction, which is the general algorithmic trick of solving a particular problem by first making a contraction step, which simplifies the problem size, and then repeating the contraction algorithm until a final result is reached :cite:`algdesign:parseq2016`.
The reduction algorithm that we shall implement assumes an associative reduction operator \oplus : A \rightarrow A \rightarrow A, a neutral element of type A, and a vector v of size 2^n, containing elements of type A. If \mathrm{size}(v) = 1, the algorithm returns the single element. Otherwise, the algorithm performs a contraction by splitting the vector in two and applies the reduction operator elementwise on the two subvectors, thereby obtaining a contracted vector, which is then used as input to a recursive call to the algorithm. In Futhark, the function can be implemented as follows:
.. literalinclude:: src/reduce_contract.fut :lines: 13-16
The function specializes the reduction operator \oplus to be
+
and the neutral element to be 0
. The function first pads the
argument vector xs
with neutral elements to ensure that its size
is a power of two. It then implements a sequential loop with the
contraction step as its loop body, implemented by a parallel
\fop{map} over an appropriately split input vector.
The auxiliary function for padding the input vector is implemented by the following code:
.. literalinclude:: src/reduce_contract.fut :lines: 4-10
To determine the work and span of the algorithm red
, we first
determine the work and span for padpow2
, for which we again need
to determine the work and span for nextpow2
. From simple
inspection we have W(\kw{nextpow2 n}) = S(\kw{nextpow2 n}) =
O(\mathrm{log}\,\kw{n}). Now, from the definition of W and
S and because \kw{nextpow2 n} \leq 2\,\kw{n}, we have
W(\kw{padpow2 ne v}) = W(\fop{concat}~v~\kw{(}\fop{replicate}~\kw{(nextpow2 n - n) ne)}) = O(\kw{n})
and
S(\kw{padpow2 ne v}) = O(\log\,\kw{n})
where \kw{n} = \size~\kw{v}.
Each loop iteration in has span O(1). Because the loop is iterated at-most \log(2\,\kw{n}) times, we have (where \kw{n} = \size\,\kw{v})
W(\kw{red v}) & =~~ O(\kw{n}) + O(\kw{n/2}) + O(\kw{n/4}) + \cdots + O(1) ~~=~~ O(\kw{n}) \\ S(\kw{red v}) & =~~ O(\log\,\kw{n})
It is an exercise for the reader to compare the performance of the
reduction code to the performance of Futhark’s built-in reduce
SOAC (see :numref:`benchmarking`).
Another example of a contraction-based algorithm is radix-sort. Radix-sort is a non-comparison based sorting routine, which implements sorting by iteratively moving elements with a particular bit set to the beginning (or end) in the array. It turns out that this move of elements with the same bit set can be parallelised. Thus, for arrays containing 32-bit unsigned integers, the sorting routine needs only 32 loop-iterations to sort the array. A central property of each step is that elements with identical bit values will not shift position. Depending on whether the algorithm consistently moves elements with the bit set to the end of the array or to the beginning of the array results in the array being sorted in either ascending or descending order.
A radix-sort algorithm that sorts the argument vector in ascending order is shown below:
.. literalinclude:: src/rsort.fut :lines: 20-33
The function rsort_step
implements the contraction step that takes
care of moving all elements with the bitn
set to the end of the
array. The main function rsort
takes care of iterating the
contraction step until the array is sorted (i.e., when the contraction
step has been executed for all bits.) To appreciate the purpose of
each data-parallel operation in the function, the table below
illustrates how rsort_step
takes care of moving elements with a
particular bit set (bit 1) to the end of the array. The example
assumes the current array (xs
) contains the array
[2,0,6,4,2,1,5,9]
. Notice that the last three values all have
their 0-bit set whereas the first five values have not. The values
of xs
marked with \dagger are the ones with bit 1 set.
Variable | ||||||||
---|---|---|---|---|---|---|---|---|
xs |
2^\dagger | 0 | 2^\dagger | 4 | 2^\dagger | 1 | 5 | 9 |
bits1 |
1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
bits0 |
0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
scan (+) 0 bits0 |
0 | 1 | 1 | 2 | 2 | 3 | 4 | 5 |
idxs0 |
0 | 1 | 0 | 2 | 0 | 3 | 4 | 5 |
idxs1 |
1 | 1 | 2 | 2 | 3 | 3 | 3 | 3 |
idxs1' |
6 | 6 | 7 | 7 | 8 | 8 | 8 | 8 |
idxs1'' |
6 | 0 | 7 | 0 | 8 | 0 | 0 | 0 |
idxs |
6 | 1 | 7 | 2 | 8 | 3 | 4 | 5 |
map (-1) idxs |
5 | 0 | 6 | 1 | 7 | 2 | 3 | 4 |
By a straightforward analysis, we can argue that
W(\kw{rsort}~\kw{v})
= O(\kw{n}), where n = \length\,\kw{v}; each of the operations in
has work O(\kw{n}) and rsort_step
is called a
constant number of times (i.e., 32 times). Similarly, we can argue that
S(\kw{rsort}~\kw{v}) = O(\log\,\kw{n}), dominated by the SOAC
calls in rsort_step
.
A variant of a contraction algorithm is an algorithm that first solves a smaller problem, recursively, and then uses this result to provide a solution to the larger problem. One such algorithm is a version of the Sieve of Eratosthenes that, to find the primes smaller than some n, first calculates the primes smaller than \sqrt n. It then uses this intermediate result for sieving away the integers in the range \sqrt n up to n that are multiples of the primes smaller than \sqrt n.
Unfortunately, Futhark does not presently support recursion, thus, one needs to use a \fw{loop} construct instead to implement the sieve. A Futhark program calculating the number of primes below some number n, also denoted in the literature as the \pi function, is shown below:
.. literalinclude:: src/primes.fut :lines: 18-
Notice that the algorithm applies a parallel sieve for each step, using a combination of maps and reductions. The best known sequential algorithm for finding the number of primes below some n takes time O(n\,\log\,\log\,n). Although the present algorithm is quite efficient in practice, it is not work effcient, since the work inside the loop is super-linear. The loop itself introduces a span with a \log\,\log\,n factor because the size of the problem is squared at each step, which is identical to doubling the exponent size at each step (i.e., the sequence 2^2, 2^4, 2^8, 2^{16}, \ldots, n, where n=2^{2^m}, for some positive m, has m = \log\,\log\,n elements.)
In :numref:`primes-by-expansion` we discuss the possibility of using a flattening approach to implement a work-efficient parallel Sieve-of-Erastothenes algorithm.