Lambda calculus with algebraic simplification for reduction parallelisation: Extended study

Abstract Parallel reduction is a major component of parallel programming and widely used for summarisation and aggregation. It is not well understood, however, what sorts of non-trivial summarisations can be implemented as parallel reductions. This paper develops a calculus named λAS, a simply typed lambda calculus with algebraic simplification. This calculus provides a foundation for studying a parallelisation of complex reductions by equational reasoning. Its key feature is δ abstraction. A δ abstraction is observationally equivalent to the standard λ abstraction, but its body is simplified before the arrival of its arguments using algebraic properties such as associativity and commutativity. In addition, the type system of λAS guarantees that simplifications due to δ abstractions do not lead to serious overheads. The usefulness of λAS is demonstrated on examples of developing complex parallel reductions, including those containing more than one reduction operator, loops with conditional jumps, prefix sum patterns and even tree manipulations.


Introduction
Functional programming is commonly regarded as a promising approach in parallel programming. A major reason is the freedom of side effects, enabling evaluation of independent subexpressions in parallel. For example, in the following recursive Fibonacci function: fib n = if n ≤ 1 then 1 else fib (n − 1) + fib (n − 2) it is syntactically clear that the two recursive calls, fib (n − 1) and fib (n − 2), can be simultaneously evaluated. For this reason, functional programming makes parallel programming easy and intuitive.
Another benefit of using functional programs in parallel programming is equational reasoning, which helps certify the correctness of parallel implementations. As an example, consider the following parallel implementation of the fib function in Haskell: fib n = if n ≤ 1 then 1 else par x (pseq y (x + y)) where x = fib (n − 1) y = fib (n − 2) In this program, par requests the evaluation of its first argument, x, in parallel to that of its second argument, and pseq forces the evaluation of its first argument, y, before the evaluation of its second argument. The correctness of this implementation immediately follows from the observational equalities of par and pseq: Such equational reasoning is useful for not only certification but also for the development of parallel implementations. For example, consider the following usual summation function, sum: sum [] = 0 sum (a : x) = a + sum x Although this function does not appear to contain independent subexpressions, equational reasoning reveals its potential for parallel evaluation: It is not difficult to generalise the observation above to sum (l + + r) = sum l + sum r, where + + denotes a list concatenation operator. That is, sum can process the elements of the first half, l, and the remaining elements, r, in parallel. Such parallel summation, sum, is an instance of parallel reduction, also known as parallel summarisation or aggregation. Parallel reductions are used for calculating the total, maximum, average, and other results for huge data. Parallel reductions appear everywhere in real programs and are thus supported by most modern parallel programming environments, including MPI, 1 OpenMP, 2 Intel Threading Building Blocks, 3 MapReduce (Dean & Ghemawat, 2004), Cilk++ (Frigo et al., 2009), Manticore (Fluet et al., 2008), Repa (REgular PArallel arrays) for Haskell (Keller et al., 2010) and Futhark (Henriksen et al., 2017).
Despite the importance and usefulness of parallel reductions, the current support for them is not satisfactory. Existing parallel programming environments support only specific patterns of parallel reductions, typically loops (or singly recursive function) specified by using an associative operator. To see the problem, consider the following poly function, which calculates the value of a polynomial represented by a list of coefficients. Its formal definition is shown in Figure 1(a): poly x [a 0 , a 1 , . . . , a n ] = a 0 + a 1 x + · · · + a n x n Although poly is a modest generalisation of sum (note that poly 1 = sum), it does not fit the parallel reduction pattern supported by existing environments because it involves more than one operator (namely, addition and multiplication). In fact, it does not have an immediate divide-and-conquer implementation: there is no operator ⊕ that satisfies poly x (l + + r) = poly x l ⊕ poly x r. Therefore, its parallel implementation is non-trivial. A known parallel implementation uses the powers of x in addition to the value of poly. More formally, the parallel implementation is specified by the following pl x y = (poly x y, x length y ): This parallel implementation appears very different from the original poly function. We hope for parallel programming environments to support a wide variety of non-trivial reductions that real programs contain, including those with more than one operator like poly, those using control operators such as break (Figure 1(b)), those with prefix sum patterns that calculate not only the summary but also all intermediate results (Figure 1(c)) and those traversing non-linear structures such as trees (Figure 1(d)). Although there have been many studies on systematically developing parallel reductions (Fisher & Ghuloum, 1994;Suganuma et al., 1996;Hu et al., 1997Hu et al., , 1998Chin et al., 1998;Gorlatch, 1999;Xu et al., 2004;Matsuzaki et al., 2005Matsuzaki et al., , 2006Deitz et al., 2006;Morita et al., 2007;Morihata et al., 2009;Morihata & Matsuzaki, 2010;Emoto et al., 2010;Morihata & Matsuzaki, 2011;Sato & Iwasaki, 2011;Chi & Mu, 2011;Emoto et al., 2012;Raychev et al., 2015;Fedyukovich et al., 2017;Farzan & Nicolet, 2017;Jiang et al., 2018;Farzan & Nicolet, 2019), those studies consider only specific forms of reductions, and none of them can uniformly deal with all the kinds of reductions shown in Figure 1.
This paper introduces a calculus named λ as , a simply typed lambda calculus with algebraic simplification. It is designed to provide a foundation for systematically developing a variety of parallel reductions based on equational reasoning. The central idea is to regard a parallel reduction as a simplification of functions using algebraic properties such as associativity and commutativity. For example, consider calculating sum [a 0 , . . . , a n ]. The sequential evaluation essentially corresponds to the following expression: a 0 + (· · · (a n−1 + (a n + 0)) · · · ) This is not suitable for parallel evaluation because no independent subexpressions exist. It can be divided into a function and an argument, however, by inserting a lambda abstraction. Then, effective parallel evaluation is possible if the function part can be evaluated during evaluation of the argument. For this example, the function part can be simplified using the associativity of (+): a 0 + (· · · (a n−1 + (a n + 0)) · · · ) = { inserting a lambda abstraction } (λx. a 0 + (· · · (a k−1 + (a k + x)) · · · )) (a k+1 + (· · · (a n−1 + (a n + 0))) · · · ) ⇒ { parallel evaluation } (λx. a k 0 + x) a n k+1 where a k 0 = 0≤i≤k a i and a n k+1 = k+1≤i≤n a i This understanding of parallel reduction is not new. It has been used for developing parallel reduction loops (Callahan, 1992;Fisher & Ghuloum, 1994;Sato & Iwasaki, 2011;Raychev et al., 2015;Farzan & Nicolet, 2017;Jiang et al., 2018), parallel list/tree reductions Chin et al., 1998;Xu et al., 2004;Matsuzaki et al., 2005Matsuzaki et al., , 2006Morihata & Matsuzaki, 2010) and parallel querying on semi-structured databases (Buneman et al., 2006;Cong et al., 2007Cong et al., , 2012. Here, λ as integrates this idea into lambda calculi. λ as is a simply typed lambda calculus extended with a special abstraction syntax, namely, δ abstraction. In λ as , a lambda-abstracted term, λx. e, is a value; in other words, the body e is not evaluated until its argument is passed. A δ-abstracted term, δx. e, is not a value, however, and its body e is simplified using algebraic properties before the arrival of its argument. For example, is not a value and is thus immediately evaluated to: Note that this evaluation may be performed at the same time as the evaluation of the argument. For instance, has potential for parallel evaluation, as the following evaluation process shows It is non-trivial to provide a good strategy for simplifying complex expressions. For example, δx 1 . δx 2 . 8 × ((−1) × x 1 + x 2 ) + 5 × (x 1 × 3 + x 2 × (−2)) can be simplified to λx 1 . λx 2 . 7 × x 1 − 2 × x 2 by distributing × over +, whereas δx. x 3 + 3 × x 2 + 3 × x + 1 can be simplified to λx. (x + 1) 3 by factorisation. Even worse, an inappropriate simplification strategy may significantly decrease efficiency. For instance, δx 1 .δx 2 . · · · δx n . (1 + x 1 ) × (1 + x 2 ) × · · · × (1 + x n ) may be 'simplified', by distributing × over +, to an exponentially large expression: To provide a simple and effective simplification strategy, λ as requires that simplifications must result in linear polynomials. 4 For example, δx. δy. x × y gets stuck because its body contains a product of x and y. This linearity requirement is somewhat restrictive but beneficial from several aspects. First, simplifications can be easily achieved by distributing × over + and then merging terms that have a common variable. Second, the result of the simplification is commonly small because the size of a linear polynomial is at most proportional to the number of variables. Third, several studies (Xu et al., 2004;Matsuzaki et al., 2006;Emoto et al., 2010;Sato & Iwasaki, 2011;Emoto et al., 2012) pointed out that linear polynomials are expressive enough to capture a wide variety of parallel reductions. Therefore, the linearity requirement can be regarded as a guideline for developing efficient parallel reductions by introducing δ abstractions. To support such development, λ as has a type system that checks the linearity requirement.
Formalising a new lambda calculus, λ as , should be an important step in developing a powerful reduction parallelisation method for practical programming languages. The existing studies on parallel reduction suggest the following hypothesis: parallel reductions rely on the algebraic properties and simplifications of the operators used, and are nearly independent of control structures or programming patterns. If this hypothesis is correct, it could be a valuable clue to a uniform approach for dealing with various language features and programming patterns used in practical programs. Typed lambda calculi are perfectly suitable for confirming this hypothesis: control structures can be encoded by higherorder expressions, whereas operators (for base-type values) are clearly distinguished from higher-order features.
This paper contains the following three major contributions: • Systematic development of a wide variety of parallel reductions using λ as (Section 2): the paper discusses reduction patterns including all examples in Figure 1, and others, as well. • Design of λ as , a lambda calculus with algebraic simplification (Section 3): the type system of λ as guarantees progress, that is, the effectiveness of simplifications. Its operational semantics shows that any typed λ as term is observationally equivalent to the corresponding term of the simply typed lambda calculus. • Extensive studies for strengthening λ as (Section 4): in particular, the paper discusses the possibilities of combining λ as with the fixed-point operator, algebraic structures other than a commutative semiring and control operators.

Developing complex parallel reductions by λ as
This section informally introduces λ as and demonstrates its effectiveness through examples. Figure 2 lists standard functions used in this section. Later, Section 3 develops the formalism.

Flavour of λ as
The following is the syntax of λ as . The type, τ , is the same as that of the simply typed lambda calculus.
λ as extends the simply typed lambda calculus via the semiring operators, ⊕ and ⊗, on the carrier set R (c ∈ R), and a δ abstraction, δx R . e. Other features, such as conditionals, algebraic datatypes, and recursion, can be added if they are consistent with lambda calculi and do not manipulate semiring values of type R. In the following, such additional features are used where necessary and expressed by the syntax of Haskell.
A semiring abstracts the cooperation of two related operations such as addition and multiplication. For the time being, we consider the (commutative) semiring of addition and multiplication on integers, that is, (⊕) = (+), (⊗) = ×, and R = Z. Other semirings are introduced as needed.
The operational semantics of λ as is the standard call-by-value reduction except for δ abstractions. On one hand, a δ abstraction is observationally equivalent to a lambda abstraction, that is, δx Z . e ≡ λx Z . e. Here, two terms are said to be observationally equivalent if they will be reduced to the same value for any surrounding context of the base type. On the other hand, the body of a δ abstraction, namely e in δx Z . e, is evaluated before the argument, x, is specified. The δ-abstracted variable should have the semiring type, R. The type annotations for variables may be omitted if they are apparent from the context.
For example, as discussed in the introduction, for the following term: the function and argument can be evaluated in parallel. In the following, → denotes a reduction step (or possibly a series of them) and ⇒ is used instead to emphasise possibilities for parallel evaluations: For simplifying the function part without knowing the value of the argument, an evaluation of λ as involves variables that are not bound to values yet. We call such variables indeterminates. λ as simplifies polynomials over indeterminates using the algebraic properties of the semiring.
So long as a δ abstraction is not involved, the evaluation is carried out as the usual call-by-value reduction. For example, is not evaluated any further unless the argument is passed, whereas is not a value and is evaluated to For providing a simple and effective simplification strategy for polynomials, λ as requires that an evaluation inside a δ abstraction must result in a linear polynomial over indeterminates. For example, δx. δy. x × y gets stuck because the body is non-linear: it involves a multiplication of indeterminates, x and y.
Note that the linearity is not a syntactic but semantic requirement. For example, λx. δy. x × y does not get stuck if a constant (i.e., a value that contains no indeterminate) is supplied as the argument; however, it does get stuck if the argument contains indeterminates. λ as is associated with a type system that guarantees progress of computation. In other words, the type system of λ as rejects terms that may involve a multiplication of indeterminates. The rest of this section considers only typeable terms that cause neither non-termination nor errors.
λ as is slightly more expressive. For instance, consider the following term: As x + x × (7 + 3) depends on x, its parallel evaluation appears to be impossible. In λ as , by replacing the first let with plet, defined by: plet x = e 1 in e 2 ≡ (δx. e 2 ) e 1 , a parallel evaluation becomes possible: As seen in this example, the introduction of plet, or equivalently a δ abstraction, enables parallel evaluation regardless of data dependency. To see the effectiveness of plet, consider the following sequence of let expressions: let x 0 = a 0 in let x 1 = a 1 + x 0 in let · · · in let x n = a n + x n−1 in x n This program cannot gain any parallel speedup even using plet instead of let because each right-hand side expression cannot be simplified further. Nevertheless, by inserting plet, it can be transformed into an equivalent program that is more suitable for parallel evaluation: in (let x k+1 = a k+1 + z in let · · · in let x n = a n + x n−1 in x n ) ≡ (δz. (λx k+1 . (λx k+2 . (· · · (λx n . x n ) · · · )) (a k+2 + x k+1 )) (a k+1 + z)) ((λx 0 . (λx 1 . (· · · (λx k . x k ) · · · )) (a 1 + x 0 )) a 0 ) ⇒ (λx. a n k+1 + x) a k 0 where a k 0 = 0≤i≤k a i and a n k+1 = k+1≤i≤n a i The introduction of plet thus breaks the data dependency and yields two terms that can be evaluated in parallel.

Parallel list reduction
Now let us consider parallel list reductions.
Example 1 (Summation). We start with the simplest example, sum. Given lists l and r, the goal is to calculate sum (l + + r) by processing l and r independently. This can be achieved by inserting a δ abstraction: sum (l + + r) = { let l = [a 0 , a 1 , . . . , a m ] } a 0 + (a 1 + (· · · (a m + sum r) · · · )) = { introducing a δ abstraction } (δx. a 0 + (a 1 + (· · · (a m + x) · · · ))) (sum r) = { introducing foldr } (δx. foldr (+) x l) (sum r) Since the left-hand side expression contains l + + r, which is not a pattern, the derived equation is not a valid function definition in Haskell. Yet, this equational reasoning suggests an implementation that processes l and r in parallel and guarantees the correctness of the implementation regardless of the strategy of splitting the given list into two sublists, l and r. Moreover, the function part, δx. foldr (+) x l, can be effectively simplified to a linear polynomial of the form of a + x, where a is a constant and x is an indeterminate.
The parallel implementation of poly calculates different kinds of results for l and r. While the result for l is a linear polynomial that consists of two coefficients, the result for r contains only one value. This is not problematic. As the following equation shows, any number of independent sublists can be processed by this implementation, in which • denotes a function composition: poly x (y 0 + + y 1 + + · · · + + y n ) = ((δz. poly x z y 0 ) • (δz. poly x z y 1 ) • · · · • (δz. poly x z y n )) (poly where poly x e w = foldr (λa. λy. a + x × y) e w In the above expression, all poly x can be evaluated in parallel.
Example 3 (Maximum Prefix Sum). Given a list of numbers, maximum prefix sum (Hu et al., 1997;Morita et al., 2007) is the problem of finding the largest among the summations of prefixes of the list. For example, the maximum prefix sum of [5, −2, 1, 6, −7, 3] is 5 + (−2) + 1 + 6 = 10. The function to compute the maximum prefix sum, mps, is defined as follows, where ↑ is the binary maximum operator: Note that ↑ and + forms a semiring on Z ∪ {−∞}, where ↑ is the addition and + is the multiplication. Therefore, exactly the same process as in the case of poly gives the following parallel implementation of mps: mps (l + + r) = (δz. foldr (λa. λy. 0 ↑ (a + y)) z l) (mps r)

Conditional
Conditional branches often interfere with parallelisations. λ as provides a guideline for parallelising programs with conditionals.
As an example, consider the following sumP, which calculates the summation of all positive elements: A reasoning very similar to the case of sum leads to the following divide-and-conquer implementation: This divide-and-conquer implementation passes the typechecking 5 of λ as , which means that the conditional expression is harmless. The comparison operator, <, does not access polynomials, in particular, the indeterminate generated by the δ abstraction; therefore, the conditional does not interfere with the algebraic simplification. Note that this situation is different from that of the following mps , which is equivalent to mps discussed in Section 2.3 but uses a conditional branch instead of the binary maximum operator: If we try to derive divide-and-conquer implementation of mps , the comparison operator, >, will compare polynomials that may contain indeterminates. This is not allowed because it is impossible to effectively simplify expressions that contain comparisons between polynomials. Therefore, for parallelising mps , we should replace the conditional branch by a semiring operator. Fortunately, in this case, the conditional can be replaced by a binary maximum operator, which forms a semiring with +. In this way, λ as enables us to distinguish harmful conditionals from harmless ones and thereby provides a guideline for reduction parallelisation.

Loop
As the proposed approach is not specific to foldr, we next consider loops.
Example 4 (Summation by Loop). The first example is sumL, which calculates the summation by a loop: The function can be reasoned as follows: As in the case of sum, the function part and the argument can be evaluated in parallel because the function part, δx. foldl (+) x r, can be simplified to a linear polynomial of the form of x + a, where a is a constant and x is an indeterminate.
Because of the associativity and commutativity of the addition, sumL is observationally equivalent to sum. Though this equivalence enables us to parallelise sumL, we did not use it because the use of case-specific properties makes the reasoning less scalable. Next, we will demonstrate that the same approach can deal with a more complicated example with jumps.
Example 5 (Loop with Jump). Consider sumB1 in Figure 1(b), which sums up all elements until encountering a negative element. This is a typical example of a loop with a break statement. The presence of the break causes the computation result to depend on the order of the elements. Nevertheless, its parallelisation is possible: The equational reasoning above derives a term in which the left sublist, l, and the right sublist, r, are contained in independent subexpressions. Unfortunately, the function part cannot be simplified because it uses not a δ abstraction but a lambda abstraction; moreover, because the abstraction binds a function, it cannot be replaced by a δ abstraction. In fact, however, this problem is not essential. The lambda-abstracted function, f , is used last; therefore, it is sufficient to take the function after finishing list processing: Then, the left sublist is processed by the following sumB1 during processing of the right sublist: This parallel implementation works correctly, as the following example shows We should distinguish the case of sumB1 from the following case of sumB2, in which the computation terminates if the calculated value becomes negative: As in the case of mps , λ as does not allow this program because the semiring value stored in e cannot be manipulated by <. Moreover, the conditional branch can be replaced by neither the maximum nor minimum operator. In fact, sumB2 cannot be parallelised by a simple divide-and-conquer approach as in the case of sumB1.
There is another known parallel prefix sum algorithm, which delays the prefix sum computation until the total sum is propagated. This algorithm can be expressed by the following program: It is fairly easy to check that this implementation is observationally equivalent to the previous one. Once we regard δ abstractions as λ abstractions, standard reasoning on a lambda calculus shows their equivalence: Nevertheless, this implementation shows a different behaviour from that of the previous one: psum [3,1,7,5,2,6,9,4]  Although rather complicated, this implementation also consists of three major steps. First, the summation is calculated for every sublist. Then, the calculated value is propagated globally by resolving the lambda abstraction. Finally, prefix sum computation is applied to each sublist.
The development for psum is not specific to summation. The same can be applied to any computation that records the intermediate results of a reduction expressed by a linear expression over a semiring. For example, similar algorithms can calculate all partial summations of a power series.
The discussions to this point demonstrate typical use scenarios for λ as . Several parallel implementations of reduction-related computations are developed by introducing δ abstractions, and their correctness easily is checked by equational reasoning. In this way, λ as supports reduction parallelisation rather than providing parallel reductions as a primitive parallel computation pattern.

Beyond list processing
Existing reduction parallelisation methods mainly consider list/array processing. By virtue of the expressiveness of λ as , it can also deal with programs that process data structures other than lists.
Example 6 (Bottom-Up Tree Processing). It appears straightforward to evaluate bottomup tree processing in parallel. For example, the following sum Tree can process independent subtrees in parallel: sum Tree (Nd n l r) = n + sum Tree l + sum Tree r sum Tree (Lf n) = n This kind of bottom-up approach has been commonly used for parallel tree processing.
While it is suitable for processing balanced trees, it cannot achieve sufficient parallel speedup if the input is a list-like tall tree. This limitation is not insignificant because practical tree structures, such as XML data and syntax trees, are very often list-like. λ as enables a bold approach. Given a tree t, consider dividing t in the middle such that t = c[t ], where t is a subtree of t, c is a tree context that has a unique 'hole' denoted by • and c[t ] denotes the tree obtained by substituting t for the hole in c. The goal is to develop a function sum Tree that satisfies the following equation: Equational reasoning easily leads to the definition of sum Tree . In the following, we use c [•] instead of c to express that c is not a tree but a context: is similar. In summary, the following definition is obtained: Because the computation of sum Tree consists of additions, (δx. sum Tree c x) can be computed independently with sum Tree t .
The definition of sum Tree shows the possibility of processing independent subtrees in parallel, and subtrees can be recursively divided. Moreover, as the following reasoning shows, even contexts can be recursively divided. Accordingly, this approach of dividing a tree into a context and a subtree can lead to substructures of similar sizes, thereby providing a good load balancing even for list-like trees: Note on tree division strategy The approach, namely dividing a tree into a subtree and a context, is influenced by the theory of parallel tree contraction (Reid-Miller et al., 1993;Morihata et al., 2009;Morihata & Matsuzaki, 2011) that enables efficient parallel tree reduction regardless of the tree shape. It is a generalisation of the divide-and-conquer approach for parallel list processing. If x is a list, then x = c[x ] is equivalent to x = c + + x . Besides, the approach subsumes the bottom-up tree processing, which divides Nd n l r into Nd n [•] r and l.
As similar to the case of parallel list processing, the equations about sum Tree developed so far do not correspond to valid function definitions in Haskell. Instead, they show the correctness of any parallel tree reduction in which each task corresponds to a subtree or (one-hole) context. In other words, they correspond to several concrete implementations, including the reduction to parallel list processing (Reid-Miller et al., 1993) and transformation to balanced tree processing (Morihata & Matsuzaki, 2011).
The approach is somewhat similar to lazy tree splitting (Bergstrom et al., 2012) that generates tasks in a by-need manner during tree traversal expressed by Huet's zippers (Huet, 1997). However, each task generated by lazy tree splitting corresponds to a subtree, and hence, the lazy tree splitting approach cannot achieve sufficient parallel speed-up for list-like trees.
Example 7 (Complex Tree Processing with Accumulations). The next example is a more complex case of tree processing: rd shown in Figure 1(d). The program expresses a reaching definition analysis of a simple imperative program. Assign v e, Seq s 1 s 2 , If e s 1 s 2 and while e s, respectively, denote an assignment statement like v := e, a sequential statement like s 1 ; s 2 , a conditional statement like if (e) s 1 else s 2 and a loop statement like while (e) s. The function remove v y removes definitions of variable v from the set of definitions, y.
The program of rd appears unsuitable to parallel processing. In the case of Seq, a computation of a subtree depends on the result of another subtree via an accumulation parameter. In λ as , however, the dependency can be broken by introducing δ abstraction: rd (Seq s 1 s 2 ) y = (δy. rd s 2 y) (rd s 1 y).
Therefore, the dependency is not problematic if the computation of rd involves only semiring operators. Consider a semiring whose carriers are bit vectors such that each bit corresponds to a variable in the program and whose operators are the bitwise logical OR operator ∨ and the bitwise logical AND operator ∧. Then, the computation of rd can be expressed via a semiring: ∪ and remove v y can be regarded as ∨ and ¬v ∧ y, respectively, where ¬v is a bit vector with each bit set to 1 except for the bit corresponding to v. We regard ∨ and ∧ as the addition and multiplication, respectively; then, rd does not involve a non-linear multiplication because a left operand of a multiplication is always a constant, ¬v. In summary, the introduction of δ abstraction is safe and leads to parallel evaluation of rd.
As in the case of sum Tree , dividing a syntax tree into a context and a subtree may improve load balancing. The situation here is more difficult, however, than the case of sum Tree .
The reasoning to this point leads to the following equation: The other cases can be similarly dealt with. The result is the following: It is not easy to understand the behavior of this implementation. Nevertheless, the equational reasoning certifies its correctness; moreover, the type system guarantees the linearity of polynomials and thereby the efficiency of its parallel evaluation.
Example 8 (Recurrence Equation). As a final example, consider a purely numerical computation: calculating a numerical sequence defined by the following recurrence equation, which generalises calculation of the Fibonacci numbers: It is well known that the following program provides a linear time implementation: λ as can then be used for developing a divide-and-conquer implementation: The computational cost of the obtained recursive program is O(log n). This example shows the possibility of using λ as beyond parallel processing.
For a semiring R = (S, ⊕, ⊗, 0, 1), we may use R and S interchangeably if the meaning is apparent from the context. For example, we may write s ∈ R, that is, 's is an element of R', instead of s ∈ S.
Given a set of indeterminates X (X should be disjoint from R) and a semiring R = (S, ⊕, ⊗, 0, 1), a polynomial of the following form, where c 0 , c 1 , . . . , c m ∈ S and x 1 , x 2 , . . . , x m ∈ X : is called a left-linear polynomial over (R, X ). We may omit R and X if they are clear from the context. Similarly, a polynomial of the following form: is called a right-linear polynomial. When ⊗ is commutative, left-and right-linear polynomials coincide and are called linear polynomials.

Syntax and operational semantics
For simplicity, this section considers λ as defined by the following syntax. Section 4 discusses further extensions: if e 1 → e 1 and e 2 → e 2 , or, e i = e i and e j → e j (i, j ∈ {1, 2}, i = j) e 1 ⊕ e 2 → e 1 ⊕ e 2 if e 1 → e 1 and e 2 → e 2 , or, e i = e i and e j → e j (i, j ∈ {1, 2}, i = j) e 1 ⊗ e 2 → e 1 ⊗ e 2 if e 1 → e 1 and e 2 → e 2 , or, e i = e i and e j → e j (i, A metavariable x is used to denote a variable (or indeterminate). R denotes the underlying semiring, and c is a value in R. Each base type, R, is annotated by either P (polynomial) or C (constant). Later, Section 3.3 explains the meanings of these annotations. Values in λ as are defined as follows. For now, ⊗ is assumed to be commutative, and thus, only linear polynomials are considered. Later, Section 4.1 extends the theory developed here to non-commutative semirings" v :: Values are functions and linear polynomials. Constants are special cases of linear polynomials. Note that δ abstractions are not values.
The operational semantics is defined by the set of reduction rules shown in Figure 3, in which e[v/x] denotes the capture-avoiding substitution of v to x in e. The first four rules are the same as those of the usual call-by-value simply typed lambda calculus. The fifth and sixth rules simplify the body of a δ abstraction. A δ abstraction becomes a λ abstraction if the body is completely simplified. Linear polynomials are simplified according to the algebraic properties of the semiring. For simplicity, we assume that every linear polynomial contains the same set of indeterminates. Because an indeterminate can be introduced to a polynomial by associating it with a zero coefficient, this assumption is not restrictive. To keep linearity, at least one operand of multiplication must be a constant. Figure 4 shows the typing rules of λ as . An environment maps a variable to its type. {x : τ } denotes an extension of by a binding x : τ , that is, {x : τ }(x) = τ , and  {x : τ }(y) = (y) if x = y. A λ as term e is said to be typeable if there exist an environment and a type τ such that e : τ . The typing rules contain two key differences from those of the simply typed lambda calculus. First, each base type is annotated by either P or C. In the rules, a metavariable α is used to denote P or C. A term of type R C should be reduced to a constant that contains no indeterminate. The annotations are used for guaranteeing the safety of multiplication, in which the operands must contain a constant. Second, a special rule is prepared for δx R . e. Because e is to be simplified before the argument is passed, e should be typeable even if x is an indeterminate and therefore of R P type. Moreover, because δx R . e is regarded as a usual function after the simplification of e, δx R . e should have the same type as λx R . e. Accordingly, the body, e, is typechecked twice. Note that the following simpler rule is safe but too restrictive:

Type system
{x : R P } e : τ δx R . e : R P → τ This rule regards nearly every δ-abstracted function as returning non-constants. For instance, it infers ∅ (δx R . x) : R P → R P and thus rejects apparently safe terms such as (δx R . x) 1 × (δx R . x) 1. Except for these two differences, the typing rules of λ as are the same as those of the simply typed lambda calculus. A λ as term containing no δ abstraction is typeable if and only if it is typeable in the simply typed lambda calculus.
The following discussion considers only typeable λ as terms.

Properties
λ as can be regarded as a call-by-value simply typed lambda calculus extended by a speculative simplification. In the following, we show that the speculative simplification of λ as is not problematic. The formalisation uses the notion of contexts. A context of λ as is a λ as term that contain exactly one special variable •. The following gives the definition: Here, C[e] denotes a λ as term obtained by substituting • for e in C.
First, evaluation of λ as is terminating. Let → * be the reflective transitive closure of →.
Theorem 1. For any λ as term e, there exists e such that e → * e and no e satisfies e → e .
Proof. Consider another reduction relation , defined as follows: • e → e implies e e .
• e e implies C[e] C[e ] for any context C.
Because enables more reductions than →, the termination of reductions by → follows from that by .
Note that expresses β reductions with algebraic simplifications using semiring properties, and the algebraic simplification can be regarded as a convergent (i.e., confluent and terminating) rewriting system. Because a rewriting system obtained by combining convergent rules with β reductions of the simply typed lambda calculus is convergent (Tannen, 1988;Okada, 1989;Tannen & Gallier, 1991), is terminating.
Theorem 2 only considers successful evaluations that yield values. This is not a serious limitation if we consider only typeable terms. As Theorem 1 shows, evaluations of λ as terms terminate. Moreover, evaluation of a typeable term never gets stuck, as we now prove.
First, the following lemma shows the relationship between the types and results of evaluations. Let fvs(e) denote the set of all free variables in e, that is, variables not bound by any λ abstraction or δ abstraction. We assume that every free variable is an indeterminate and thus has type R P because free variables except for indeterminates cannot be values in λ as .

Lemma 3. Assume that
v : τ , and that (x) = R P for any x ∈ fvs(v); then, the following equations hold Proof. The proof follows immediately from the typing rules of λ as .
Next, the following two theorems show that, for any typeable λ as term (without nonindeterminate free variables), its evaluation will not get stuck. Proof. The proof follows straightforwardly from a case analysis over the rules of →.
The only non-trivial case is the beta reduction, (λx τ . e) v → e [v/x]. This case can be straightforwardly proved by an induction over the structure of e.
Theorem 5. If e : τ and (x) = R P for any x ∈ fvs(e), then there exists e such that e → e unless e is a value.
Proof. The proof uses an induction over the structure of e. Every case is easily proved using Lemma 3. Note that it is safe to regard any x ∈ fvs(e) as an indeterminate of a polynomial because (x) = R P .
Corollary 6. For any λ as term e and λ as context C such that C[δx R . e] : R C , there exists c ∈ R such that C[δx R . e] → * c and C[λx R . e] → * c.

Encoding by Hindley-Milner typing
The type system of λ as is not satisfactory from a practical perspective. It is difficult to assign R P and R C appropriately.
As an simple example, consider dbl = λx.
x + x (type annotations are omitted because no annotation is appropriate). The possible type of dbl is either Z P → Z P or Z C → Z C . In fact, both are inappropriate: if dbl is of the former type, δx. (dbl 1) × x cannot be typechecked; if dbl is of the latter type, δx. dbl x cannot.
This problem is not specific to dbl. Most of the expressions can calculate both polynomials and constants, depending on the inputs (or free variables) passed. This observation is not expressed in the type system. Instead, the type system blindly generates all possibilities and tried to find a possible assignment. In particular, the rule for a multiplication requires examining two possibilities, and the rule for a δ abstraction requires typechecking the body twice.
A possible solution to this problem is to introduce polymorphism, such as dbl : ∀α ∈ {Z C , Z P }. α → α. In particular, a promising approach is to encode the type system of λ as using the standard Hindley-Milner type system. This approach is beneficial from several perspectives.
• It allows polymorphic types not only for base types.
• It enables us to encode the type system of λ as in widely used languages such as Haskell and OCaml. • It avoids costly brute-force searches and makes it possible to use existing efficient type inference algorithms. • It avoids inferring R P for every base-type expressions because the Hindley-Milner type system can infer the principle type schema.
The idea of using the Hindley-Milner type system is based on the following observation. Recall that dbl can take an argument of either Z C or Z P . This situation can be naturally expressed by the let polymorphism. Moreover, we can similarly solve inefficiency in typechecking δ abstractions. The body of a δ abstraction is essentially evaluated twice, and that the two evaluations may take arguments of different types. Using a let expression, we can informally express this situation by the following equation. Here, x denotes an indeterminate rather than a variable: That is, δx R . e first takes x as its argument, and after that, additionally takes the actual argument. Based on this observation, we can check the following expression instead of directly typechecking δx R . e: That is, δx R . e is essentially regarded as λx R . e, but in addition, its applicability to an indeterminate x :: R P is checked. 8 The polymorphism of the Hindley-Milner type system can encode subtyping (Fluet & Pucella, 2006), which is useful for expressing other typing rules of λ as . For example, the types of constants and ⊕ can be expressed by ∀α. R α and ∀α. R α → R α → R α , respectively.
Unfortunately, the rule for multiplications cannot be encoded by the Hindley-Milner type system. This problem is essential because multiplications do not have the most general type schema. For instance, the type of λx R . λy R . x ⊗ y is either ∀α. R α → R C → R α or ∀α. R C → R α → R α , and these two types are incomparable. A natural workaround is to use two kinds of type-annotated multiplications, (⊗ L ) :: ∀α. R C → R α → R α and (⊗ R ) :: ∀α. R α → R C → R α . This modification may, however, make some typeable terms not typeable. For instance, although The discussion above shows a trade-off between advanced types (e.g., polymorphism) and the necessity of annotations. While advanced types are generally more expressive, because of the difficulty of their inference, the type system often requires programmers more type annotations. In the current situation of using the Hindley-Milner type system for λ as , the cost seems acceptable relative to the benefit.
We have introduced in Section 3.3 a monomorphic type system and did not regard the use of the Hindley-Milner type system as the default choice. It is because adoption of other kinds of type systems, including those equipped with structural subtyping, generalised algebraic data types, refinement types and full dependent types, could be beneficial. They enable us to count the number of indeterminates in the type level, and therefore, would be useful to infer not only possibilities of getting stuck but also overheads for simplifying polynomials. Nevertheless, as discussed above, their use may not come for free and require additional type annotations. In summary, there exists a design choice between type systems. Further study remains as future work.

Non-commutative semiring
So far, we have considered commutative semirings. If ⊗ is not commutative, simplifications become more difficult. For instance, neither c ⊗ x ⊗ c nor (c 1 ⊗ x ⊗ c 1 ) ⊕ (c 2 ⊗ x ⊗ c 2 ) (c 1 = c 1 and c 2 = c 2 ) can be simpler. 9 Therefore, to guarantee the simplicity of polynomials, the operational semantics and the type system should distinguish left-and right-linear polynomials. Every left (right) operand of multiplication should be either a constant or a right-linear (left-linear, respectively) polynomial. Addition and multiplication between a left-linear polynomial and a right-linear polynomial should be prohibited.
We can refine the operational semantics of additions and multiplications as follows: In the type system, R should be annotated by either LP (left-linear polynomial), RP (right-linear polynomial) or C (constant). We thus refine the typing rules for δ abstractions and multiplications as follows: The type system can be encoded by the Hindley-Milner type system. When using a non-commutative semiring R, every base type has two kinds of annotations: left-linear L or right-linear R, and constant C or polynomial P. Accordingly, we can encode the types of semiring values, semiring operators and indeterminates as follows: x :: ∀α. R α,P As in the case of commutative semirings, this encoding rejects some typeable terms. For instance, (1 ⊗ L 1) ⊕ (1 ⊗ R 1) cannot pass the typechecking based on this encoding. Although these refinements make the whole calculus more complicated, they maintain the major properties including Corollary 6.
Example 9 (list concatenations). As an application of non-commutative semiring, consider a function, pElem, which gathers positive elements: Reasoning similar to the case of sumP discussed in Section 2.4 leads to the following divide-and-conquer implementation: By regarding + + as the multiplication operator, we can see that this program calculates left-linear polynomials and hence parallelisable. Note that + + cannot be the addition operator because of its non-commutativity. Commutativity of the addition operator is essential. Otherwise, we cannot simplify expressions like x 1 ⊕ x 2 ⊕ x 1 , and therefore, sizes of polynomials cannot be bound by the number of indeterminates.

Multiplication without associativity
Sometimes, we can drop even associativity for the multiplication.
As an example, consider the 'cons' operator, (:), for lists. While it does not satisfy associativity, a related associative operator, (+ +), enables simplification. For example, a linear expression a : b : c : x, where x is an indeterminate, can be simplified to w + + x where w = [a, b, c]. In general, any linear expression written using the cons operator can be simplified to the form of w + + x, where w is a list and x is an indeterminate. In fact, we have implicitly used this simplification strategy in Example 9.
Many practical operators have related associative operators that enable simplification.

Division:
For an indeterminate x and numbers a 1 , a 2 , a 3 , x / a 1 / a 2 / a 3 can be simplified to x / a where a = a 1 × a 2 × a 3 .

Algebraic data structures:
For an indeterminate x and constructors of algebraic data structures f 1 , is a data structure with the unique hole, • (Minamide, 1998).
Formally, we can apply the simplification strategy if the following properties hold. Here we consider the case of left-linear expressions, and that of right-linear ones are analogous: That is, ⊗ needs not to be associative; instead, should be able to simplify a series of ⊗ applications. Note that should be nearly associative, as the following reasoning shows In particular, this generalises the case of a non-commutative semiring, in which ( ) = (⊗). The simplification strategy can be implemented using the following reduction rule instead of the corresponding original one: The type system is unnecessary to modify from the case of non-commutative semirings. This modification maintains the major properties including Corollary 6.
Example 10 (nondeterminisitc finite automata). Let A = (Q, , I, F, τ ) be a nondeterministic finite automaton, where Q is the set of the states, is the alphabet, I ⊆ Q is the set of the initial states, F ⊆ Q is the set of the final states and τ ⊆ (Q × × Q) is the transition relation. The following run A expresses the state transition by A : One may expect that this program does not fit the parallelisation approach by λ as because there is no algebraic operator. In fact, it can be parallelised by reformulating the computation by a vector-matrix multiplication. Let {q 1 , q 2 , . . . , q m } = Q. First, s ⊆ Q can be regarded as a bit vector of size m whose i-th bit is set 1 if q i ∈ s. Second, each τ a (a ∈ ) can be regarded as a m × m matrix whose (i, j) element is set 1 if (q j , q i ) ∈ τ a . Then, {q | q ∈ s, (q, q ) ∈ τ a } can be calculated by a vector-matrix multiplication τ a s, where ∨ and ∧ are, respectively, the addition and multiplication. Because a vector-matrix multiplication has a related associative operator, that is, a matrix-matrix multiplication, run A can be parallelised just like those examples discussed in Section 2.5.
Example 11 (tree transformation). As mentioned above, constructions of algebraic data structures are equipped with an associative operator, which is the substitution operator for data structures with holes. This view enables us to parallelise computations that construct data structures. As a concrete example, consider the following simple tree transformation. The input is a tree consisting of four kinds of nodes: B, L, R and N. The output is obtained by eliminating the right and the left subtree of each L node and each R node, respectively. The following function tt does the transformation: As discussed in Section 2.7, splitting the input tree into a subtree and a context may improve parallelism. Formally, we would like to derive tt that satisfies the following equation: It is not difficult to derive the following definition of tt : Because of the linearity of tt and algebraic property of data structures with holes, this program can be evaluated in parallel just like sum Tree .
While the formalism discussed so far is sufficient for capturing typical parallel reductions, there is a room for further generalisation. Any set of operators with a simplification strategy can be integrated to λ as if it satisfies desirable properties, such as termination, confluency and efficiency. Nevertheless, it is generally non-trivial to develop a type system that ensures the properties. For example, the linearity requirement of polynomials is not sufficient to guarantee the efficiency of tree transformations because using a tree with a hole more than once may require its duplication. Minamide (1998) provided a type system that guarantees the single use of each structure. While it seems possible to adapt this type system to λ as , a formal investigation is left for future work.

Other programming constructs
λ as is so designed that it can be extended with standard program constructs such as conditionals, data structures and recursions. Note that Theorems 2, 4 and 5 do not depend on details of the calculus such as the evaluation order and termination. Accordingly, any construct can be added if it can be expressed by a lambda calculus (neglecting the evaluation strategy) and does not directly manipulate semiring values.
In Section 2.6, we have already used list constructors that store semiring values. They are not problematic. Data structures are merely containers, and hence, their application does not cause essential computation about semiring values. This intuition can be more formally justified by considering Church encoding for the structures.
It is also possible to add the fixed-point operator to λ as . Note that Theorems 2, 4 and 5 deal only with terminating evaluations. If an evaluation terminates in n steps, we can use an n-fold unfolding operator instead of the fixed-point operators. 10 Adding the n-fold unfolding operator does not break the properties of λ as because it can be expressed in the simply typed lambda calculus. Therefore, δx R . e is observationally equivalent to λx R . e if evaluations of these two terms terminate. Note, however, that they may have different termination behaviors. For instance, when ⊥ is non-terminating, (λx τ . 1) (λz R . ⊥) terminates, whereas (λx τ . 1) (δz R . ⊥) does not.

More than one semiring
It is not difficult to deal with programs that use more than one semiring if those semirings are clearly distinguished. In practice, however, multiple semirings may share operators and values. For example, integers and integer additions are used in both (Z, +, ×, 0, 1) and (Z ∪ {−∞}, ↑, +, −∞, 0). The type system should thus distinguish these two semirings and be aware of problematic terms like δx. (−2) × (x ↑ 1), which consists of operators, ↑ and ×, that do not form a semiring.
It is possible but not satisfactory to develop a type system that rejects all terms in which a δ abstraction involves more than one different semiring. A better approach is to provide a method that enables restructuring of terms, so that the body of a δ abstraction contains computation of at most one semiring. For instance, (δx. (−2) × (x ↑ 1)) e is equivalent to (δy. (−2) × y) ((δx. x ↑ 1) e), which is not problematic if e is evaluated to a constant. Further investigation of this notion is left for future work.

Delimited continuation for introducing more δ abstractions
We have introduced parallelism by changing λ abstractions to δ abstractions. It is natural to consider introducing more parallelism by splitting expressions by inserting δ abstractions. Formally, given an expression e 1 [e 2 ], we would like to evaluate e 2 and its surrounding context e 1 [•] in parallel by inserting a δ abstraction, that is, (δx. e 1 [x]) e 2 . However, this approach is not always possible. For instance, consider e = (λx. x + ((λy. y + x) 3) + x) 5 = e 1 [e 2 ] where e 1 [•] = (λx. x + • + x) 5 and e 2 = (λy. y + x) 3; then, e is not equivalent to (δz. e 1 [z]) e 2 because e 2 contains a free variable, x, whose actual value, 5, is specified in e 1 .
A remedy to this situation is to use delimited continuations. In the following, we use the shift/reset operator (Danvy & Filinski, 1990).
A continuation represents the computation that will be performed later. A reset operator, e , delimits a continuation. A shift operator, S k. e, captures the current continuation up to the surrounding reset operator and binds it to the variable, k. For example, 3 + (S k. k (k 2)) + 4 is evaluated as follows: Usefulness of continuations for modelling concurrency and parallelism is well recognised (Giorgi & Métayer, 1990;Wand, 1999;Li et al., 2007;Fluet et al., 2008;Imam & Sarkar, 2014;Dolan et al., 2017). We specifically focus on one-shot (Bruggeman et al., 1996;Dolan et al., 2017) delimited continuations. Each one-shot continuation can only be invoked at most once and hence corresponds to suspend/resume patterns. One-shot delimited continuations can express several concurrent/parallel programming constructs including coroutines (de Moura & Ierusalimschy, 2009) and Multilisp's futures (Imam & Sarkar, 2014).
It is non-trivial to integrate delimited continuations into the type system and operational semantics of λ as . Especially, the notion of current continuation is not well defined in parallel evaluations. For instance, when evaluating (S k. k) (2 + 3) , it is unclear which of λy. y (2 + 3) or λy. y 5 is bound to k. Even worse, the result of (S k. 1) (S k. 3) can be either 1 or 3, depending on the evaluation order. To avoid such pathological cases, we strictly follow the suspend/resume patterns. Continuations are only used for suspending subcomputations, and suspended computations should be resumed later; neither duplication nor cancellation of suspended computations is allowed. In other words, we use delimited continuations only for controlling the evaluation strategy.

Efficiency
We have discussed the usefulness of the type system of λ as for understanding the overhead introduced by the simplification. However, what the type system exactly guarantees is not efficiency but the linearity of polynomials. The linearity rules out apparent inefficiency including exponential blow-up. Nevertheless, it does not guarantee that the parallel evaluation will improve efficiency.
First, the parallel evaluation may be useless because of the insufficient or ill-balanced independent tasks. For instance, the following program is correct but does not achieve any parallel speedup because the δ abstracted subterm, δy Z . a + y, cannot be simplified any further: Second, the speculative nature of the simplification in λ as may provoke computations that are unnecessary in the sequential evaluation. Recall the following example discussed in Section 4.3, in which the δ abstraction forces to evaluate ⊥: Third, the simplification of polynomials is slower than the usual evaluation. For instance, the parallel implementation of poly calculates two coefficients of a linear polynomial, and therefore, does about twice as much work as the sequential implementation. In general, if a calculated linear polynomial contains k indeterminates, its simplification is about k + 1 times as slow as the usual evaluation. This overhead is often essential for reduction parallelisation, as in the case of poly.
In usual partial evaluations and simplifications, a result of evaluation/simplification can be large, and therefore, its duplication commonly introduces serious overheads. In λ as , however, a simplification only results in a linear polynomial whose size is bounded by the number of indeterminates. Therefore, if the number of indeterminates is at most constant, the cost of duplicating linear polynomials instead of constants does not affect asymptotic complexity.

Related work
The characteristic feature of λ as is its use of algebraic properties for simplifying functions. This feature is closely related to partial evaluation (Jones, 1996). Given a subset of inputs, which are called static, partial evaluation generates a program specialised for the static inputs without knowing the other inputs, called dynamic. On one hand, online partial evaluation, in which usual evaluation may invoke partial evaluation at runtime, can implement the function simplification in λ as . When the evaluator encounters δx. e, it regards x as a dynamic input and requests a partial evaluator to simplify e. On the other hand, δ abstraction can be used for expressing (semiring-based) online partial evaluation. For example, given a function f (s, d), where s and d are, respectively, the static and dynamic input, its partial evaluation with fixing the static input s to 1 can be expressed by (λs. δd. f (s, d)) 1. From this perspective, one of the most closely related studies is the parallel partial evaluation by Consel & Danvy (1992). While λ as makes use of algebraic properties and requires linearity over indeterminates to model efficient parallel reductions, their approach does not impose any requirements on programs and therefore provides no support for parallel programming, especially for developing efficient parallel reductions.
Several studies have shown the usefulness of partial evaluation or function simplification for developing parallel reductions, including those on deriving parallel reductions on arrays/lists and trees (Callahan, 1992;Fisher & Ghuloum, 1994;Hu et al., 1998;Chin et al., 1998;Matsuzaki et al., 2005;Morihata & Matsuzaki, 2010;Raychev et al., 2015;Farzan & Nicolet, 2017;Jiang et al., 2018;Farzan & Nicolet, 2019) and those on parallel querying of semi-structured databases (Buneman et al., 2006;Cong et al., 2007Cong et al., , 2012. They generally focus on specific reduction patterns to enable automation of reduction parallelisation. λ as is designed to be a foundation for exploring automatically parallelisable reduction patterns. As discussed in Section 2, the higher-order calculus enables us to express several reduction patterns and uniformly study their parallelisation. Instead of the generality, λ as makes less account of automatic parallelisation. λ as focuses on linear polynomials on semiring operators. The importance of linear polynomials in the context of parallel reductions has already been discussed. Xu et al. (2004) developed an automatic parallelisation system for list reductions. Their idea is to trace algebraic operators and the linearity condition using a type system. Matsuzaki et al. (2006) and Sato & Iwasaki (2011) developed similar systems for automatic parallelisation of tree reductions and reduction loops, respectively. λ as is strongly influenced by those works and provides a primitive construct, the δ abstraction, that enables us to uniformly study those parallelisation strategies. The basic idea of λ as is that their essence, that is, the use of algebraic properties for modelling complex reductions, is independent of the control structures that express iterations/recursions.
A δ abstraction can be read as an annotation to express speculative evaluation. From this viewpoint, λ as is similar to the evaluation strategy approach for parallel computations (Marlow et al., 2010), in which parallelism is specified and controlled by evaluation strategies. There is a crucial difference; however, in the evaluation strategy approach, programmers can control evaluation strategies only when the language does not specify the order of evaluation. In contrast, a δ abstraction in λ as requires subterm simplification that the standard evaluation strategy does not allow. Castro et al. (2016; proposed a type-based approach for introducing parallelism to purely functional programs. Their type system certifies not only the correctness but also the cost of an obtained parallel program. Their approach, using a type system to support parallelisation of functional programs, is somewhat similar to the current proposal, but there are two essential differences. First, they focused on structured functional programs, especially those specified by algorithmic skeletons (Cole, 1989). Algorithmic skeletons are reusable parallel programming patterns such as map, reductions and prefix sum patterns. The focus on structured programs enables their method to analyse programs in detail. In contrast, the current proposal seeks to provide a foundation that can deal with complex unstructured programs. Second, while their method mainly deal with programs that apparently contain independent subexpressions, λ as can be used to parallelise programs whose divide-and-conquer implementations require breaking data dependencies using algebraic properties. Nishimura & Ohori (1999) proposed a higher-order functional programming language that has a special construct, called a parallel map, for modelling parallel reductions. Similar to λ as , the parallel map is based on substitutions for indeterminates. λ as refines their proposal in the following aspects. First, their language does not explicitly account for algebraic simplification; therefore, it is unclear when the parallel map implements efficient parallel reduction. In contrast, λ as explicitly deals with simplifications and provides a type system that guarantees successful simplification. Moreover, the parallel map is based on communications guided by the pointer structure of recursive data. Consequently, it uses a 'pointer jumping' strategy, which is less efficient than the standard divide-and-conquer approach. In contrast, λ as does not rely on pointer-based structures and can express the divide-and-conquer strategy.
The operational semantics of λ as interleaves the usual sort of evaluations of lambda calculi with simplifications based on algebraic properties. These simplifications can be regarded as a kind of semantic evaluations as they are based on the mathematical properties of the operators. Accelerating evaluations of lambda calculi through semantic evaluations is not a new idea. Terui (2012) showed that semantic evaluations enable efficient sequential evaluations of lambda expressions, thereby leading to a precise bound on computational costs. Kobayashi (2012) used type-based semantic evaluations to perform computations on compressed data without decompression.
While λ as is based on call-by-value evaluation, δ abstraction introduces a different evaluation strategy. The call-by-push-value calculus (Levy, 2003) enables a close analysis of the effect of evaluation strategies by carefully distinguishing values and computations. Unfortunately, the call-by-push-value calculus seems insufficient for expressing λ as . In λ as , a δ abstraction generates a function, which is a computation in the call-by-push-value calculus, by capturing an indeterminate in a polynomial, which is a value. In the call-bypush-value calculus, the thunk construct can obtain a computation from a value; however, it cannot introduce a new binder that captures an indeterminate. Nevertheless, a similar calculus may be useful for providing a better understanding of λ as .

Conclusion and future work
This paper has developed λ as , a simply typed lambda calculus with algebraic simplifications. The key characteristic of λ as is the δ abstraction whose function body is simplified using algebraic properties before its arguments' arrival. The operational semantics and type system of λ as were formalised. The type system guarantees that the simplification results in linear polynomials and, in turn, rules out the major possibility of unsuccessful parallelisation. The usefulness of λ as for modelling parallel reductions was demonstrated on several non-trivial examples. This is the first step in providing a foundation for parallel reductions based on lambda calculi. There are many directions for further investigation.

Inferring evaluation costs
As discussed in Section 4.6, the type system of λ as does not guarantee that parallel implementations are faster than sequential implementations. A precise cost inference for λ as is more challenging than those for usual lambda calculi because the cost depends on the number of indeterminates used during simplifications, and moreover, a δ abstraction may generate more than one, possibly unboundedly many, indeterminates. It is natural to seek for a practical subset of λ as in which the number of necessary indeterminates is known. Indeed, every example discussed in Section 2 requires a constant number of indeterminates. Such a subset might be obtained by considering structural recursions, as in the study of Castro et al. (2016;, and restrict duplication of δ-abstracted functions. If such a subset is found, it might be worthwhile to consider non-linear polynomials as well because exponential blow-ups cannot occur.

Strategy for introducing parallelism
It is hoped to have a good strategy of introducing δ abstractions. This issue is closely related to cost inference. If evaluation costs can be precisely inferred, even the following naive strategy may be useful: replace a λ abstraction to a δ abstraction if the introduction of δ abstraction improves the estimated parallel evaluation cost. This strategy is, however, not sufficient to deal with recursive functions that process large data, such as foldr and foldl. As discussed in Section 2, to obtain efficient parallel reductions for large data processing, we should combine δ abstractions with the divide-and-conquer approach.

Compilation to existing calculus
Although λ as is a theoretical model for studying reduction parallelisation, it is desirable to formulate a compilation of λ as to an existing calculus (or an abstract machine) that supports parallel evaluation. Such a compilation would lead to not only an understanding of λ as from a different perspective but also a better inference of parallel evaluation costs. However, providing such a compilation is non-trivial.
A major difficulty is the implementation of indeterminates. As discussed in Section 4.6, the evaluation of λ as may lead to unboundedly many free variables (indeterminates), which cannot be captured by usual variable environments. This situation has similarity to the case of lazy evaluations, which use a heap for managing unboundedly many thunks (Launchbury, 1993). However, a naive compilation using heaps is unsatisfactory because it threads computations and thereby prohibit exploiting parallelism.

Parallelisation of practical programs
The original motivation in developing λ as is its application for reduction parallelisation of programs written in practical programming languages. Although λ as is extensible, it is unclear whether it can incorporate practical, complex programming constructs and be applied to reason on practical complex programs.

Conflict of Interest
The author, Akimasa Morihata, is employed at the University of Tokyo. His recent collaborators are Kento Emoto (Kyushu Institute of Technology, Japan), Zhenjiang Hu (Peking University, China), Hideya Iwasaki (The University of Electro-Communications, Japan), Kiminori Matsuzaki (Kochi University of Technology, Japan), Shigeyuki Sato (The University of Tokyo, Japan) and Katsuhiro Ueno (Tohoku University, Japan).