Reasoning about multi-stage programs*

Abstract We settle three basic questions that naturally arise when verifying code generators written in multi-stage functional programming languages. First, does adding staging to a language compromise any equalities that hold in the base language? Unfortunately it does, and more care is needed to reason about terms with free variables. Second, staging annotations, as the name “annotations” suggests, are often thought to be orthogonal to the behavior of a program, but when is this formally guaranteed to be true? We give termination conditions that characterize when this guarantee holds. Finally, do multi-stage languages satisfy useful, standard extensional properties, for example, that functions agreeing on all arguments are equivalent? We provide a sound and complete notion of applicative bisimulation, which establishes such properties or, in principle, any valid program equivalence. These results yield important insights into staging and allow us to prove the correctness of quite complicated multi-stage programs.

This gap in the literature is a significant shortcoming, as ensuring the correctness of code generators can be challenging. Fixing errors in the generator often entails 2 J. Inoue and W. Taha figuring out which pieces of code came from where in the generator and why. This task can be time-consuming even with tool support, because programmers must understand the unfamiliar body of code produced by the generator. This problem is exacerbated when the generated code is heavily optimized and can change drastically as the generator changes, or if many variants of the code must be generated. Programmers would be better served by being able to minimize inspection of the generated code, for then they would only have to deal with familiar code. Moreover, a generator may produce problematic code only for certain inputs while working fine on other inputs. Verifying the generator, as opposed to individual generated instances, allows us to verify the entire family of programs that the generator can produce, giving greater payoff for the verification effort.
To address this shortcoming, we advocate in this paper an approach to verifying code generators that minimizes the need to contemplate the generated code. The idea is to find conditions under which the constructs related to code generation are semantics-preserving and can be safely ignored. The power function gives a good, concise example for demonstrating this approach, presented here in MetaOCaml syntax. 1 let rec power n x = if n = 1 then x else x * power (n-1) x let rec genpow n x = if n = 1 then x else <~x *~(genpow (n-1) x)> let stpow n = ! <fun z ->~(genpow n <z>)> This code defines a function named power which maps x and n to x n . The power function subsumes all functions of the form fun x -> x*x*...*x, but every time it is called, it wastes time on recursive calls and conditional branches. Staging annotations in genpow eliminate this overhead by resolving the branches and unrolling the recursion. Brackets <e> delay an expression e. An escape~e must occur within brackets and causes e to be evaluated without delay, locally undoing the effect of surrounding brackets. The e should return a value of the form <e >, and e replaces~e. Run ! e compiles and runs the code generated by e. These annotations in MetaOCaml are hygienic, i.e., preserve static scoping (Dybvig, 1992), but are otherwise like LISP's quasiquote, unquote, and eval (Muller, 1992). The genpow function uses these constructs to generate, for any concrete n, a compiled function that performs only multiplication.
For example, the evaluation of stpow 2 proceeds as follows: where + stands for one or more steps of evaluation (formally defined in Section 2).
Step (1) is just unfolding the definition of stpow.
Step (2) evaluates the escaped part (genpow 2 <z>) of the code being generated. Note that this step evaluates an open term; escape is forcing evaluation to occur under the binder fun z, which is a distinguishing feature of MSP.
Step (3) splices the generated code into the surrounding context to create a bigger code value. Finally, at step (4), ! compiles and executes the generated code, yielding a closure. This closure, when called, performs nothing but multiplication.
This example is typical of MSP usage, where a staged program stpow is meant as a drop-in replacement for the unstaged program power. Given stpow, we can reconstruct the unstaged program power by erasing staging annotations-we say that power is the erasure of stpow. In light of the similarity of these programs, if we are to verify stpow, we naturally expect stpow ≈ power to hold for some suitable program equivalence (≈) and hope to get away with proving that power satisfies whatever specifications it has, in lieu of stpow. Then, power can be analyzed straightforwardly by conventional reasoning techniques designed for single-stage programs. But three key concerns must be addressed before we can apply this strategy with confidence: Conservativity. Do all reasoning principles valid in a single-stage language carry over to its multi-stage extension?
Conditions for Sound Erasure. In the power example, staging seems to preserve semantics, but clearly this is not always the case: If Ω is non-terminating, then <Ω> ≈ Ω for any sensible (≈). When do we know that erasing annotations preserves semantics?
Extensional Reasoning. How, in general, do we prove equivalences of the form e ≈ t? It is known that hygienic, purely functional MSP satisfies intensional equalities like β (Taha, 1999), but those equalities are too weak to prove such properties as extensionality (i.e., functions agreeing on all inputs are equivalent). Extensional facts are indispensable for reasoning about functions, like stpow and power.
This paper settles these questions for the untyped, purely functional case with hygiene. We work without types to avoid committing to the particulars of any specific type system, since there are multiple useful type systems for MSP (Taha & Nielsen, 2003;Tsukada & Igarashi, 2010;Westbrook et al., 2010;Kameyama et al., 2015). It also ensures that our results apply to dynamically typed languages (Dybvig, 1992), where hygienic code generation is just as useful as in statically typed languages.
Hygiene, or the absence of inadvertent variable capture that it ensures, is a widely accepted safety feature that ensures many of the nice theoretical properties of MSP, which helps to reason about programs, and which we exploit in this study. This is an important point of difference from Choi et al. (2011). They also advocate another approach that eliminates code generation but in a semantics that has variable capture and delegates capture avoidance to an explicit "gensym" construct. Their approach has different trade-offs working in different settings, so our development and theirs fill complementary roles.
We believe imperative hygienic MSP is not yet ready for an investigation like this. Types are essential for having a sane operational semantics without scope extrusion (Kameyama et al., 2011), but there is no decisive solution to this problem, and the jury is still out on many of the trade-offs. The foundations for imperative hygienic MSP have not matured to the level of the functional theory that we build upon here.
Tagless final encodings (Carette et al., 2009), and the lightweight modular staging framework (Rompf & Odersky, 2012) inspired by that technique, give a different approach to metaprogramming than MetaOCaml-style MSP. They offer data types that not only represent code, like bracketed expressions do in MetaOCaml, but can also be interpreted by any semantics of the user's choosing. The semantics may evaluate the code, print the code, or perform a post-generation-pass optimization and emit some intermediate representation. These frameworks are not limited to staging (separating a program into multiple execution phases, or stages) but rather support general-purpose metaprogramming (writing programs that manipulate other programs in arbitrary ways). Reasoning in those frameworks depends on the semantics given to the object code, so it is beyond the scope of this paper. However, our approach may still be relevant when the machinery is used specifically for staging.

Contributions
We extend previous work on the call-by-name (CBN) multi-stage λ calculus, λ U (Taha, 1999), to cover call-by-value (CBV) as well (Section 2). In this calculus, we show the following results.

Unsoundness of Reasoning Under Substitutions.
Unfortunately, the answer to the conservativity question is "no". Because λ U can express open-term manipulation (see genpow above), equivalences proved under closing substitutions are not always valid without substitution, for such a proof implicitly assumes that only closed terms are manipulated at runtime. We illustrate how this pathology occurs using the surprising fact (λ .0) x ≈ 0, and explain what can be done about it (Section 3). The rest of the paper will show that λ U nonetheless conserves a wealth of useful reasoning principles. Many familiar proof rules and techniques carry over from the Reasoning about multi-stage programs 5 plain λ calculus so that a lot can be achieved, despite the fact that we can no longer focus our attention exclusively to closed instances of terms.
Conditions for Sound Erasure. We show that reductions of a staged term are simulated by equational rewrites of the term's erasure. This gives simple termination conditions that guarantee erasure to be semantics-preserving (Section 4). Considering CBV in isolation turns out to be unsatisfactory, and borrowing CBN facts is essential in establishing the termination conditions for CBV. Intuitively, this happens because annotations change the evaluation strategy, and the CBN equational theory subsumes reductions in all other strategies whereas the CBV theory does not.
Soundness of Extensional Properties. We give a sound and complete notion of applicative bisimulation (Abramsky, 1990;Gordon, 1999) for λ U . Bisimulation gives a general extensional proof principle that, in particular, proves extensionality of λ abstractions. It also justifies reasoning under substitutions in some cases, limiting the impact of the non-conservativity result (Section 5).
To demonstrate the wide applicability of our methods, we present substantial case studies proving the correctness of non-trivial generators (Section 6). In Section 6.1, we verify the LCS algorithm, which is staged into a sophisticated code generator that couples let-insertion with continuation-passing style and monadic memoization using the techniques of Swadi et al. (2006). These techniques make an exact description of the generated code hard to pin down, but our result on erasure makes such details irrelevant. We also verify a generator for fold (Section 6.2), which demonstrates that higher order generators are also amenable to our verification methodology.
Throughout the paper, we emphasize the general insights about MSP that can be gleaned from our results. In particular, we find that CBN is better behaved than CBV, as metaprogrammers who have experience with MSP in both settings may have already come to realize. The shortcomings of CBV stem largely from premature evaluation of subexpressions that may diverge, and a large part of our effort consists in building tools to reason in the face of that obstacle. Though we do stress the applicability to verification, we strive to establish a deep, general understanding of staging, and let the tools for verification fall out as natural byproducts. We demonstrate those tools along the way, using the power function as a running example.
This paper is a summary and an extension of the first author's doctoral thesis (Inoue, 2012), which was previously published at a conference (Inoue & Taha, 2012). This paper incorporates materials from the thesis that were relegated to a technical report in the conference version due to space limitations, as well as some new results: • A detailed discussion of why certain generalizations to the equational theory are unsound (Section 2.3). Together with the issue of reasoning under substitutions (Section 3), this discussion gives a thorough understanding of where the boundary lies between equalities that hold in a multi-stage language and those that don't. • A significantly improved, nuanced definition of careful equalities (Section 4.5), used for proofs in CBV. In the conference paper, this technique was not developed enough to be a serious contender to the normalization technique presented in Section 4.3, but we have succeeded in reformulating it to have a clear advantage in analyzing higher order generators. This material is new. • Proofs of soundness and completeness of applicative bisimulation (Appendix A.2). • The verification example of a higher order generator (Section 6.2). This material is new.
This paper supersedes the conference version. It gives more proof details and explanations than the conference version, but at a level that keeps the flow and should be easy to follow. The thesis writes out all proofs in meticulous detail in an appendix, so readers interested in working out, checking, or mechanizing the proofs may find the thesis to be a valuable complement. Reading the thesis is not necessary to understand this paper, however.

The λ U calculus: Syntax, semantics, and equational theory
This section introduces λ U , a simple but expressive calculus that models all possible uses of brackets, escape, and run in MetaOCaml's purely functional core, sans types. The syntax and operational semantics of λ U for both CBN and CBV are minor extensions of previous work (Taha, 1999) to allow arbitrary constants. The CBN equational theory is more or less as in Taha (1999), but the CBV equational theory is new.

Notation.
A set S may be marked as CBV (S v ) or CBN (S n ) if its definition varies by evaluation strategy. The subscript is dropped in assertions and definitions if they apply to both evaluation strategies or if clear from context. Syntactic equality (α equivalence) is written (≡). The set of free variables in e is written FV(e). Figure 1 shows the syntax of λ U . The set of terms is that of the plain λ calculus extended with constants and the three staging primitives brackets, escape, and run. A Reasoning about multi-stage programs 7 context C is an incomplete term containing exactly one hole • in place of a subterm. The result of plugging, or replacing, the hole by e is written C [e], where binders in the context C capture free variables in e. The exact level of a term is its nesting depth of escapes, where a pair of brackets cancels one level of escape, provided the brackets enclose the escape (and not the other way around). A program is a closed term with exact level 0. Levels are used to encode the following rules for nesting brackets and escapes: (a) a term is delayed iff more brackets surround it than do escapes, and (b) in a program every escape must occur in a delayed region. For example, in the following terms, e 1 and e 2 are delayed while e 3 , e 4 , and e 5 are not. The term containing e 6 is not a valid program (though it can be a subterm of a valid program), and it makes no sense to ask whether e 6 should be delayed. e 1 ˜e 2 ˜e 3 ˜˜e 4 ˜ ˜e 5 ˜˜e 6

Syntax and operational semantics
A term like˜˜e 6 is not self-contained in the sense that it cannot appear just anywhere in a program, but must appear inside at least two pairs of brackets. Otherwise, the inner˜would fall within a non-delayed part of the program term and have no delay to cancel. A term's exact level is the minimum number of brackets it must appear within in a valid program. We say that a context C is a program context for e iff C[e] is a program. Because lv e 6 lv t implies that any program context for t is a program context for e as well, upper bounds for a term's level are usually more useful than the exact level. Thus, we often say "e has level " or "e is a level-term", written e ∈ E , to mean lv e 6 . We say "e has exact level ", explicitly using the keyword "exact", when we mean lv e = .
A level-0 value (i.e., a value in a non-delayed region) is a constant, an abstraction, or a code with no undelayed region. A level-0 value of the form e 0 is called a code value. At level > 0 (i.e., inside pairs of brackets), a value is any lower level term, or in other words, a term that will have no undelayed region when plugged into a context that supplies pairs of brackets. Staging annotations use the same nesting rules as LISP's quasiquote and unquote (Dybvig, 1992), but we stress that they preserve scoping: e.g., λx.˜(λx. x ) ≡ λx.˜(λy. y ) ≡ λy.˜(λx. y ) .
Definition 1 (Erasure). Define erasure e by the following equations: We say that a term e is unstaged iff e ≡ e , and staged otherwise. For example, the power function given in the introduction is unstaged, as it is the erasure of genpow, which is equal to the erasure of stpow modulo η reduction.
The small-step operational semantics is given in Figure 2, where square brackets denote guards on grammatical production rules; e.g., In this semantics, a small-step judgment e t is marked with a level , which intuitively denotes the number of brackets this step is 8 J. Inoue and W. Taha happening in. A term takes a small-step at level iff it decomposes as E ,m [r], where E ,m is an , m-evaluation context and r is a level-m redex. If the redex r contracts to s, then E ,m [r] E ,m [s]. The SS-Ctx rule explicitly requires s ∈ E m , but this is a purely informative constraint that is always met when the other constraints are satisfied. In general, e t implies = lv e > lv t (whose proof we omit). Redex contractions are: β reduction at level 0, δ reduction at level 0, run-bracket elimination (SS-R) at level 0, and escape-bracket elimination at level 1 (SS-E). All rules are common to both evaluation strategies, except that CBN's β rule is SSβ whereas CBV's is SS-β v . The δ reductions are given by a partial map δ from pairs of constants to be closed, unstaged level-0 values, which is undefined for ill-formed pairs like (not, 5). We assume constant applications do not return staged terms.
An , m-evaluation context E ,m yields a level-term when plugged with a level-m term. The hole of an evaluation context points to the location of the unique redex that must be contracted next. At level > 0, both evaluation strategies simply walk over the syntax tree of the term to look for escapes, including ones that occur inside the arguments of applications. At level 0, the definition is mostly standard. CBV evaluation contexts can place the hole inside the argument of a level-0 application, whereas CBN evaluation contexts can do so only if the operator is a constant. This difference accounts for the fact that CBV application is always strict at level 0, while CBN application is lazy if the operator is a λ but strict if the operator is a constant.
We use the metavariables a, b ∈ Arg, ranging over the set of substitutable arguments (i.e., e 0 for CBN and v 0 for CBV), to treat both strategies uniformly. For example, the rules SS-β and SS-β v can be unified as Reasoning about multi-stage programs 9 This semantics is deterministic, and for any , level-values cannot take a small-step at level .

Notation.
We write λ U n e t for a CBN small-step judgment and λ U v e t for CBV. We use similar notation for (⇓), (⇑), and (≈) defined below. We may omit the λ U n or λ U v if the evaluation strategy either does not matter or is clear from context. For any relation R, let R + be its transitive closure and R * its reflexive-transitive closure.
Definition 2 (Termination and divergence). An e ∈ E terminates to v ∈ V at level iff e * v, written e ⇓ v. We write e ⇓ to mean ∃v. e ⇓ v. If no such v exists, then e diverges (e ⇑ ). Note that divergence includes stuckness.
where we used square brackets [ ] in lieu of parentheses to improve readability. Let e be the subterm inside the square brackets. In both CBN and CBV, p decomposes as E[e], where E ≡ λz.z (˜•) ∈ ECtx 0,0 , and e is a level-0 redex. Note the hole of E is under a binder and the redex e is open, though p is closed. The hole is also in argument position in the application z (˜•) even for CBN. This application is delayed by brackets, so the CBN/CBV distinction is irrelevant in the code generation phase, i.e., until the delay is canceled by !. Hence, p 0 λz.z (˜ z ) 0 λz.z z .
Example 5. As an example of evaluation with nested staging constructs, consider !˜˜e , where e is some level-0 term that satisfies e 0 x . Evaluation strategy does not make a difference in this example. We have˜˜e 2˜˜ x becausẽ˜• ∈ ECtx 2,0 . Moreover,˜ x 1 x , so˜˜ x 2˜ x . However,˜ x is not a level-2 redex (although it is a level-1 E-redex). The program context ! • is a 0, 2-evaluation context and not a 0, 1-evaluation context, so this˜ x is not reduced. Thus, !˜˜e 0 + !˜ x ∈ V 0 , noting that the contents of the outermost • , namely !˜ x , is a level-0 term. Intuitively, the remaining˜(as well as ! ) is delayed by the outermost brackets in !˜ x .
As usual, this "untyped" formalism can be seen as dynamically typed. In this view,˜and ! take code-type arguments, where code is a distinct type from functions and base types. Thus, λx.x 1, ˜0 , and ! 5 are all stuck. Stuckness on variables like x 5 does not arise in programs for conventional languages because programs are closed, but in λ U , evaluation contexts can pick redexes under binders, so this type of stuckness does become a concern. We will revisit this issue in Section 3. The contraction of open-term level-0 redexes is central to the expressive power of λ U . It is with this feature that we can evaluate terms like genpow 3 <x>, optimizing away the body of the power function.
The operational semantics induces the usual notion of observational equivalence, which relates terms that are interchangeable under all program contexts. In other words, two expressions are observationally equivalent iff we can silently replace one by the other in any given program without affecting its input/output behavior. This is the sense in which we would like to prove that a staged program like stpow is equivalent to its erasure power.
Definition 6 (Observational equivalence). Define e ≈ t iff for every C such that C[e], C[t] ∈ Prog we have C[e] ⇓ 0 ⇐⇒ C[t] ⇓ 0 and, whenever one of them terminates to a constant, the other also terminates to the same constant.
Remark. This definition of observational equivalence differs from that of the original formulation by Taha (1999) in two respects: • Taha's definition was stratified, with (≈ ) ⊆ E × E defined for each level . Because E ⊆ E m whenever 6 m, the (≈ ) at higher levels subsume those at lower levels. The stratification is therefore not terribly useful, and we have dropped it to simplify the notation. Open-term observation is dropped because implementations like MetaOCaml typically reject source files with unbound variables, and closed-term observation more accurately models that design. These changes do not constitute a shift in semantics, however. The old and new definitions give the same equivalence, in the sense that (≈) = (≈ ). See Appendix A.1 for a proof.

Equational theory
The equational theory of λ U is a proof system that, as we will soon show, derives a subset of (≈ {β, E U , R U , δ}, while CBV axioms are λ U v def = {β v , E U , R U , δ}. These axioms are defined below. If e = t can be proved from a set of axioms Φ, then we write Φ e = t, though we often omit the Φ in definitions and assertions that apply uniformly to both CBV and CBN, or if Φ is clear from context. Reduction is a term rewrite induced by the axioms: Φ e −→ t iff e = t is derivable from the axioms by compatible extension alone.

Name
Axiom Side Condition Note that (−→) is a superset of ( ), as the axioms subsume SS-β, SS-β v , SS-E, SS-R, and SS-δ, while the inference rule of compatible extension subsumes SS-Ctx.
The following example illustrates the difference between reduction and small-steps.
Just like the plain λ calculus, λ U satisfies the Church-Rosser property, so every term has at most one normal form (irreducible reduct). Hence, terms are not provably equal when they have distinct normal forms. Church-Rosser also ensures that reduction and provable equality are more or less interchangeable, and when we investigate the properties of provable equality, we usually do not lose generality by restricting our attention to the simpler notion of reduction.
Next, we establish that provable equality implies observational equivalence.
The containment (=) ⊂ (≈) is proper because (≈) is not computationally enumerable (since λ U is Turing-complete) whereas (=) clearly is. There are several useful equivalences in (≈) \ (=), which we will prove by applicative bisimulation. Provable equality is nonetheless strong enough to discover the value of any term that has one, so the assertion "e terminates (at level )" is interchangeable with "e reduces to a (level-) value", which in turn is equivalent to "e is provably equal to a (level-) value".
Theorem 10 is equivalent to the property known as "Plotkin-style correspondence" in the literature, which was shown for the plain λ calculus by Plotkin (1975). It can also be considered a form of the "standardization lemma", although that term usually refers to an equivalence between unrestricted reductions and leftmost, outermost reductions rather than between reductions and evaluations. The proofs of Theorems 8 to 10 can be done with standard, off-the-shelf proof techniques and are therefore omitted. The thesis (Inoue, 2012) contains a proof using Takahashi's technique (1995), which is basically the well-known Tait-Martin-Löf confluence proof using parallel reduction, but extended to also cover standardization.

Generalized axioms are unsound
This paper's equational theory is not identical to that of Taha (1999), but generalizes rule E U from˜ e 0 = e 0 to˜ e = e. In this subsection, we discuss the utility of this generalization and explain why other axioms cannot be generalized in the same manner.
The main use of the new, generalized E U is to show that substitution preserves (≈). Thus, an equivalence proved on open terms holds for any closed instance. This fact plays an important role in the completeness proof of applicative bisimulation (see Appendix A.2). It is also somewhat surprising, considering that the converse fails in CBV (see Section 3).
Proof. Take = max(lv e, lv t). Then, from e ≈ t, we get where e and t are each enclosed in pairs of brackets. Both sides are level 0, so we can apply the β or β v rule, depending on the evaluation strategy, and

Escaping both sides times gives˜·
Then, applying the E U rule times gives [a/x]e ≈ [a/x]t. The old E U rule˜ e 0 = e 0 would apply only once here because the level of the . . . [a/x]e . . . part increases, so the generalization is strictly necessary.
At this point, it is natural to wonder why the other rules, β/β v and R U are not generalized to arbitrary levels, and why E U is special. The reason is that generalizations of β/β v and R U involve demotion-moving a term from one level to another. MSP type system researchers have long observed that unrestricted demotion is a type-unsafe operation (Taha & Nielsen, 2003;Westbrook et al., 2010). We show here that it is also unsound as an equational rule. Table 1 shows generalized rules along with counterexamples that show their unsoundness. The left column names the rule that was generalized, the middle Table 1. Generalized equational axioms are unsound. Ω is some divergent level-0 term

Rule name Generalization Counterexample
x]e 0 same as Equation (6)  (7) β v (λx.e 0 ) (λy.t) = [(λy.t)/x]e 0 same as Equation (6) (10)  (11) column shows the generalization, and the right column refutes it. Dropping level constraints from R U gives Equation (5). In CBN β, relaxing the argument's level gives Equation (6). In CBV β v , simply removing the argument's level constraint produces More sensible attempts are Equations (8) and (9), which keep the constraints on head term constructors. Generalizing the function in β and β v gives Equations (10) and (11), respectively. Equations (5)-(9) fail because they involve demotion, which moves a term from one level to another. For example, the generalized Equation (5) puts e inside more brackets on the left-hand side than on the right-hand side. The counterexample exploits this mismatch by choosing an e that contains a divergent term enclosed in just enough escapes so that the divergence is forced on one side but not the other. More concretely, on the left-hand side ! ˜Ω ∈ E 0 so ! ˜Ω ∈ V 0 . However, on the right-hand side, the Ω is enclosed in fewer brackets and ˜Ω ∈ V 0 ; in fact ˜• ∈ ECtx 0,0 so assuming Ω 0 Ω 1 0 Ω 2 0 · · · ad infinitum, we have ˜Ω 0 ˜Ω 1 0 ˜Ω 2 0 · · · as well. We can formalize this insight as follows.
Intuitively, Δ C is the limiting value of lv C[e] − lv e as lv e → ∞. This difference converges to a constant independent of e because when e is sufficiently high-level, the deepest nesting of escapes in C[e] occurs within e. Then, lv C[e] − lv e depends only on the number of brackets and escapes surrounding the hole of C. The function L in Proposition 13 gives a lower bound on lv e needed to reach this limiting behavior.
. That is, a rewrite rule from which we can derive ∀e. C[e] −→ C [e] for such C and C is always unsound.
The proof of this theorem relies on the fact that if e has enough escapes, the escapes dominate all the staging annotations in C and the term they enclose is given top priority during program execution. In more technical terms, lv C[e] grows unboundedly with lv e because of Proposition 13, and beyond a certain threshold, C ∈ ECtx +Δ C, . Hence if, say, Δ C > Δ C , then e is evaluated first under C but not under C. Notice that this proof fails, as expected, if the e in C[e] −→ C [e] is restricted to e 0 .
Proof. Induction on C.
Proof. Easily seen from the fact that E ,m [t m ] takes a small step iff t m does.
Proof of Theorem 14. Take def = max(L(C), L(C ), size(C) + 1, size(C ) + 1), where L is a function that witnesses Proposition 13, and let e ≡˜˜· · ·˜ times Ω, where Ω ∈ E 0 and Ω ⇑ 0 . Then, lv e = , e ⇑ , lv C[e] = + Δ C, and lv C [e] = + Δ C . Without loss of generality, Δ C > Δ C . By Lemma 15, C ∈ ECtx +Δ C, , where the second superscript is known by Lemma 16. Then, taking C ··· def ≡ · · · • · · · with + Δ C pairs of brackets, Theorem 14 provides a quick sanity check for all equational rewrites, which we may call the level function test: A rewrite rule must always rewrite between C and C with Δ C = Δ C . In particular, Equations (5) through (9) above fail this test-they rewrite between contexts with different Δ values. Note that a sound rule can rewrite between contexts C and C such that lv C[e] − lv e and lv C [e] − lv e disagree for some e, as long as those e are all low level. For example, E U states˜ e = e, but if e ∈ E 0 , then lv˜ e − lv e = 1 = lv e − lv e. However, the differences of exact levels agree whenever lv e > 1, which is why Theorem 14 does not apply to E U . Restricting the level of expressions that can plug level-mismatching holes may also ensure soundness; non-generalized R U does this.
The Equations (10) and (11) in Table 1 happen to pass the level function test. These rules have in a sense a dual problem: The substitutions in Equations (10) and (11) inject extra brackets to locations that were previously stuck on a variable, whereas Theorem 14 injects extra escapes.

Closing substitutions compromise validity
While λ U is amenable to equational reasoning using β equality, reminiscent of equational reasoning in the plain λ calculus, there is a striking difference in the way Reasoning about multi-stage programs 15 free variables behave in the two settings. This difference is more pronounced in the CBV setting. Traditionally, CBV calculi admit the equational rule Plotkin's seminal λ V calculus (1975), for example, does so implicitly by taking variables to be values, defining (We are using [ ] as parentheses to enhance readability.) The term on the left is stuck because x ∈ V 0 and x 0 . Intuitively, the value of x is demanded before anything is substituted for it, so an implementation would raise an error saying "unbound variable: x". If we apply a substitution σ that replaces x by a value, then σ((λ .0) x) = σ0, so the standard technique of reasoning under closing substitutions is unsound. Note the β x redex itself need not contain staging annotations; thus, adding staging to a language can compromise some existing equivalences, i.e., staging is a non-conservative language extension. The problem here is that λ U v can evaluate open terms. The reader may recall that λ V reduces open terms just fine while admitting β x , but the crucial difference is that λ U evaluates (small steps) open terms under program contexts whereas λ V never does. Small-steps are the specification for implementations, so if they can rewrite an open subterm of a program, implementations must be able to perform that rewrite as well. By contrast, reduction is just a semantics-preserving rewrite, so implementations may or may not be able to perform it.
Implementations of λ U v including MetaOCaml have no runtime values or data structures representing the variable x-they implement x ∈ V 0 . They never perform (λ .0) x 0 0, for if they were forced to evaluate (λ .0) x, then they would try to evaluate the x as required for CBV and throw an exception. Some program contexts in λ U do force the evaluation of open terms, e.g., the E given above. We must then define a small-step semantics with (λ .0) x 0 0, or else we would not model actual implementations. Moreover, this behavior is conceptually the more natural choice. Variables are placeholders for as-yet-unavailable values, and it makes no sense for the placeholder itself to be offered up as the value. If a reified variable is needed, that is the role of x , not x. Therefore, we must reject β x , for it is unsound for (≈) in a small-step semantics with x ⇑ 0 . In other words, lack of β x is an inevitable consequence of the way natural, practical implementations behave.
Even in λ V , setting x ∈ V is technically a mistake because λ V implementations typically do not have runtime representations for variables either. But in λ V , whether a given evaluator implements x ∈ V or x ∈ V is unobservable. Small steps on a λ V program (which is closed by definition) never contract open redexes because evaluation contexts cannot contain binders. Submitting programs to an evaluator will never tell if it implements x ∈ V or x ∈ V . Therefore, in λ V , there is never any harm in pretending x ∈ V . A small-step semantics with x ∈ V gives the same (≈) as one with x ∈ V , and β x is sound for this (≈).
Intuitively, the reason we can pretend x is a value in λ V is that by the time execution reaches a subterm with x free, the x will always have a value. Execution only deals with closed instances of terms in the program, so reasoning also only needs to examine closed instances. By contrast, in λ U v whether a free variable will have a value during execution depends on the context. To be interchangeable under all contexts, terms must behave identically whether all, some, or none of the free variables have values. Thus, a priori, comparing terms in λ U should require comparing under all substitutions, including partial ones. But comparing terms under all substitutions involves comparing under the empty substitution. If we understand "comparing under substitutions" as establishing (≈) under substitutions, then to show e ≈ t we would have to show ?e ≈ ?t, a catch-22.
In an effort to avoid this circularity, one could consider comparing terms by a more lax criterion under the empty substitution than under other substitutions. For example, one might test (≈) under closing substitutions but equi-termination under the empty substitution. That is, to establish e ≈ t, we check e ⇓ ⇐⇒ t ⇓ and ∀closing σ. σe ≈ σt. However, these comparisons fail to distinguish between x and y. Free variables are unlike values because they can be divergent, but they are also unlike closed, divergent terms because they are distinguishable, and any attempts to characterize the equivalence between open terms must respect this distinction. Short of considering all the ways in which free variables can be independently substituted for, including not being substituted, there seems to be no clean way to encode this distinction. The applicative bisimulation to be introduced in Section 5 works along this line, considering all substitutions by default, but it allows in some cases to restrict our attention to those substitutions that substitute away some variables.
Thus, the issue with β x shown above is just the tip of the iceberg. The general, more important, challenge in λ U is that reasoning under all closing substitutions is insufficient, i.e., (∀closing σ. σe ≈ σt) =⇒ e ≈ t. We stress that the real challenge is this more general problem with substitutions, and not the special case of β x , because unfortunately β x is not only an illustrative example but also a tempting straw man. Seeing β x alone, one may think that its unsoundness is some idiosyncrasy that can be fixed by modifying the calculus. For example, type systems can easily recover β x by banishing all stuck terms including β x redexes. Alternatively, one could modify the implementation (unnaturally, in our opinion) to treat variables as values and define x ∈ V 0 , thereby subsuming β x in β v . But this little victory over β x does not matter much, for the general question of when exactly we can reason under closing substitutions remains. It is unclear if any type systems justify reasoning under closing substitutions in general, or how we might be able to prove that.
Surveying which refinements (including, but not limited to the addition of type systems) for λ U let us reason under substitutions, and why, is an important topic for future study, but it is beyond the scope of this paper. In this paper, we focus instead on showing that we can achieve a lot without committing to anything more complicated than λ U . In particular, we will show that the lack of β x is not a large drawback after all, as a refined form of β x can be proved thanks to applicative bisimulation (Section 5). The refined rule is with the side conditions that C[(λy.e 0 ) x], C[[x/y]e 0 ] ∈ E 0 and that C does not shadow the binding of x. Intuitively, given just the term (λy.e 0 ) x, we cannot tell if x is well-leveled, i.e., bound at a lower level than its use, so that a value is substituted for x before evaluation can reach it. The Cβ x rule remedies this problem by demanding a well-leveled binder. As a special case, β x is sound for any subterm in the erasure of a closed term-that is, the erasure of any self-contained generator.

The erasure theorem
In this section, we present the Erasure Theorem for λ U and derive simple termination conditions that guarantee e ≈ e . The theorem statement differs for CBN and CBV, and the latter has quite a few details to be discussed. We present the simpler CBN first.

CBN version
The intuition behind the theorem is that all that staging annotations do is describe and enforce an evaluation strategy. They may force CBV, CBN, or some other strategy that the programmer wants, but CBN reduction can simulate any strategy because it allows the redex to be chosen from anywhere. 2 Thus, erasure commutes with CBN reductions (Figure 3(a)). The same holds for provable equalities.
Proof. The first part is by induction on the derivation of the reduction judgment. The second part follows immediately.
Additionally, in CBN, erasure cannot make a term less terminating (equivalently, staging cannot make a term more terminating), unless the annotations affect the term's external interface, that is, unless the staged term's return value carries staging annotations.
How does the Erasure Theorem help prove equivalences of the form e ≈ e ? The theorem gives a simulation of reductions from e by reductions from e . If e reduces to an unstaged term t , then simulating that reduction from e gets us to t , which is just t ; thus, e −→ * t ←− * e and e = e . Amazingly, this witness t can be any reduct of e, as long as it is unstaged! In fact, by Church-Rosser, any t with e = t will do. So staging is correct (i.e., semantics-preserving, or e ≈ e ) if we can find this t . As we will see shortly, this search boils down to a termination check on the generator.

Example: Erasing CBN staged power
Let us show how the Erasure Theorem applies to stpow. First, some technicalities: we assume that the Const set of λ U is equipped with integers, arithmetic operators, and booleans, with their usual semantics captured by δ reductions. MetaOCaml's constructs are interpreted in λ U in the obvious manner, e.g., let x = e in t stands for (λx.t) e and let rec f x = e stands for let f = Θ(λf.λx.e), where Θ is some fixed-point combinator. For conciseness, we treat top-level bindings genpow and stpow like macros, so stpow is the erasure of the recursive function to which stpow is bound, with genpow inlined, and not the erasure of a variable named stpow.
As a caveat, we might wish to prove stpow ≈ power, but unfortunately this goal is unprovable. The whole point of stpow is that it processes the first argument without waiting for the second, so it may immediately diverge when partially applied to one argument, whereas power does not diverge until it is fully applied. For example, stpow 0 ⇑ 0 but power 0 ⇓ 0 . We sidestep this issue for now by concentrating on positive arguments, and discuss divergent cases in Section 5.2.
To prove k > 0 =⇒ stpow k = power k for CBN, we only need to check that the code generator genpow k terminates to some < e >; then the ! in stpow will take out the brackets and we have the witness for applying Lemma 21. To say that something terminates to < e > roughly means that it is a two-stage program, which is true for almost all uses of MSP that we are aware of. This use of the Erasure Theorem is augmented by the observation stpow = power-these functions are not quite syntactically equal, the former containing an additional η redex. This proof illustrates our answer to the erasure question in the introduction, for the CBN case. Erasure is semantics-preserving if the generator terminates to e . What is particularly pleasing about this proof is that it says so little about what e looks like, or what e computes. The only information we track about this generated code is the absence of left-over annotations. Effectively, the concern of reasoning about the annotations is decoupled from the concern of reasoning about what the generated code computes. This simplicity is a major advantage for reasoning about complex generators like LCS (Section 6.1).

CBV version: Proof by normalization
CBV satisfies a property similar to Theorem 18, but the situation is more subtle. Staging modifies the evaluation strategy in CBV as well, but not all resulting strategies can be simulated in the erasure by CBV reductions, for β v reduces only a subset of β redexes. For example, if Ω ∈ E 0 is divergent, then (λ .0) Ω −→ 0 in CBV, but the erasure (λ .0) Ω does not CBV-reduce to 0 since Ω is not a value. However, it is the case that λ U n (λ .0) Ω −→ 0 in CBN. In general, erasing CBV reductions gives CBN reductions (Figure 3(b)). This theorem has similar ramifications to the CBN Erasure Theorem but with the caveat that they conclude in CBN, despite having premises in CBV. In particular, if e is CBV-equal to an erased term, then e = e in CBN.
While these results nicely illustrate how staging is a change of evaluation strategy, without further refinement they are not terribly helpful for verification. We still need a way to prove that the program e is equal to e in CBV. We have two techniques to offer for this purpose: one is to insist that the witness t terminates to a CBN-normal form, such as a constant, and the other is to exercise some caution in applying β v equalities. The former is conceptually simpler, but the latter is sometimes more helpful for verifying higher order functions. We discuss proof by normalization in this section, and leave the other idea for Section 4.5.
The idea of proof by normalization is, given e, to show that e and e CBV-reduce to constants. Then, by chasing the diagram below, we can show e = e in CBV. Let's say we found some c, d that satisfy the two horizontal CBV equalities. Then, from the top equality, Corollary  Thus, we can prove e = e in CBV by showing that each side terminates to some constant, in CBV. Though we borrowed CBN facts to derive this lemma, the lemma itself leaves no trace of CBN reasoning. Note that we can straightforwardly generalize the lemma by requiring CBV-termination to CBN-normal forms instead of constants, but the generalized statement mixes CBN and CBV reasoning. Because many functions in practice return ground terms when fully applied, we believe the special case above strikes a good balance between generality and simplicity.

Example: Erasing CBV staged power by normalization
Let us show how the CBV Erasure Theorem applies to stpow. The proof is similar to the CBN case, but we need to fully apply both stpow and its erasure to confirm that they both reach some constant. The beauty of Lemma 26 is that we do not have to know what those constants are. Just as in CBN, the erasure stpow is equivalent to power, but note this part of the proof uses Cβ x .
Proof. Contract the η expansion by Cβ x .
Proposition 28 (Erasing CBV power). Suppose k ∈ Z + and m ∈ Z. Then, we have λ U v stpow k m ≈ power k m.
Proof. We stress that this proof works entirely with CBV equalities; we have no need to deal with CBN once Lemma 26 is established. Induction on k gives ∃e. genpow k <x> = < e > and [m/x] e ⇓ 0 m for some m ∈ Z. We can do so without explicitly figuring out what e looks like. The case k = 1 is easy; for k > 1, the returned code is <x * e >, where [m/x] e terminates to an integer by inductive hypothesis, hence so does <x * e >. Then, Clearly, power k m terminates to a constant. By Lemma 27, stpow k m also yields a constant, so by Lemma 26, stpow k m = stpow k m ≈ power k m.
This proof illustrates one possible answer to the erasure question in the introduction for CBV: Erasure is semantics-preserving if the staged and unstaged terms terminate to constants in CBV. Showing the latter requires propagating type information and a termination assertion for the generated code. Type information would come for free in a typed system, but it can be easily emulated in an untyped setting. Hence, we see that correctness of staging generally reduces to termination not just in CBN but also in CBV-in fact, the correctness proof is only a slight modification of the termination proof.

CBV version: Careful erasure
In the last two sections, we have let erasure map CBV equalities to the superset of CBN equalities and performed extra work to show that the particular CBN equalities we derived hold in CBV as well. An alternative approach is to find a subset of CBV equalities that erase to CBV equalities, which is roughly how Yang (2000) handled CBV erasure. This subsection develops this technique in λ U . The equalities turn out to be more convenient when presented as pairs of equalities than as restrictions of CBV equalities. The result is a trickier, though more versatile, proof method than proof by normalization.
As discussed in Section 4.3, the problem with erasing CBV reductions is that the argument in a β v redex might no longer terminate when erased. To eliminate this case, we might restrict β v to a "careful" variant with a side condition, like If we define a new set of axioms λ U v⇓ def = {β v⇓ , E U , R U , δ}, then reductions (hence equalities) under this axiom set erase to CBV reductions. However, β v⇓ is much too restrictive. It prohibits contracting redexes of the form (λy.e 0 ) x (note x ⇑ 0 ), which are ubiquitous-a function as simple as stpow already contains one.
Going back to a concrete example is instructive here. As it turned out, β v -reducing genpow n <x> appearing in stpow was safe (as evidenced by Proposition 28), despite the <x> which has a divergent erasure. Intuitively, the reason is that stpow is expected to be used like stpow k m, which expands to The m is waiting to be substituted for x, and indeed it would be substituted right away if it weren't for the staging annotations. Therefore, it is reasonable to exploit this substitution in checking the side condition for β v⇓ , because that condition is a check on the behavior of the erasure. Thus, genpow k <x> should be reduced not by β v⇓ but by the refined rule This refinement lets us reduce redexes with open-term arguments as long as the σ covers the relevant variables. An axiom set with β v⇓ /σ in place of β v can be formulated so that equalities erase as The resulting system is strong enough to equate genpow k <x> to an erased term, using σ = [m/x], but it still falls short of equating stpow k to an erased term. The β v⇓ /σ rule requires the reduction of the staged term to be performed in lockstep with the reduction of the erasure; however, the reduction of expression (13) substitutes m for x at the end whereas the reduction of the erasure (fun x -> genpow k x) m substitutes first. In general, the whole point of staging is to reorder the reductions, so we must allow escaping the lockstep at a few strategic places in order to align the rest of the reductions of the staged term and the erasure. To this end, we define careful equalities by formula (14) and take β v⇓ /σ to be a theorem, instead of the other way around.
Definition 29. For any σ : Var fin V 0 , an expression e reduces carefully modulo σ to t, The σ is called the speculative substitution accompanying the careful reduction. Careful equalities are defined analogously, using (=) in place of (−→).
The rules E U , R U , and β v⇓ /σ are admissible in this system. Compatible extension (e = t =⇒ C[e] = C[t]) is not always admissible, for the context C can capture variables in the speculative substitution. This rule must be constrained to avoid variable capture, as shown in CR-Compat below.
Notation. Let BV(C) stand for the set of variables that C captures, or binds. Given a substitution σ : Var fin E, let FV(σ) def = dom σ ∪ x∈dom σ FV(σx) . Remark. FV(σ) is just the "support" of σ in the terminology of nominal logic (Gabbay & Pitts, 2001). Intuitively, it is the set of variables whose names are significant, i.e., renaming them alters the substitution.
Proposition 30. For any σ : Var fin V 0 , the following rules are admissible. The same holds if we replace all occurrences of (−→) by (=) and add reflexivity, symmetry, and transitivity.
[BV(C) ∩ FV(σ) = ?] Proof. Reflexivity, symmetry, transitivity, E U , and R U are obviously admissible, so we will focus on the other rules. They are special cases of rules for deriving non-careful CBV reductions, so only the σ e = σ t part needs to be shown.
[CR-Compat] By the side condition, all variables bound in C are fresh for σ, so and likewise for C [t]. The premise gives λ U v σ e = σ t , so it follows by compatible extension that λ U With these rules, careful reductions can be performed almost like ordinary CBV reductions. The correctness lemma that applies to the result is pleasantly similar to the one for CBN (cf. Lemma 21), with essentially the same proof.

Example: Erasing CBV staged power by careful erasure
We now demonstrate the correctness of erasing stpow in CBV using Lemma 31.
The key issue in such proofs is how to introduce the speculative substitution, which is usually where we need to temporarily escape the lockstep of the reductions of the staged term and the erasure. In the case of ! <fun x ->~(genpow k <x>)> m, the speculative substitution is [m/x], speculating the β v -substitution of m into the fun x.
Alternative Proof of Proposition 28. Let us recall the proposition's statement: We have stpow ≈ power (Lemma 27), so it suffices to show ∀k ∈ Z + . ∀m ∈ Z. λ U v stpow k m ≈ stpow k m. By induction on k, we can show the existence of an e such that This equality can be proved entirely with the deduction rules in Proposition 30. The required reductions are: δ reductions to simplify if-then-else and integer arithmetic, β reductions to simplify the fixed-point operator and pass around the counter k, and β substitutions of <x> into the body of genpow. Only the last of these needs the speculative substitution.

J. Inoue and W. Taha
Now, we need to justify the speculative substitution. Directly applying Lemma 31 to Equation (15) would give an erased equality under the substitution [m/x], but we want an equality without substitutions. To solve this problem, we will extend Equation (15) to where ? is the empty substitution. Then, applying Lemma 31 will leave the desired equality with only the empty substitution (or equivalently no substitutions) attached.
To establish Equation (16), we will make a hop outside of the lockstep reduction rules of Proposition 30. That is, we will reason explicitly in λ U v , with substitutions applied to terms instead of being kept under λ U v⇓ /. On the one hand, On the other hand, Therefore, Equation (16)  Overall, the analyses involved in proof by normalization and proof by careful equalities are quite similar. In both approaches, we track the reduction of the generated code while reducing the generator, which requires tracking the substitution that will be applied when the generated code runs. The normalization approach exploits the substitutions when analyzing the termination of code returned by the generator, whereas careful equalities exploits them when analyzing the termination of code that is passed into the generator.
Both approaches have advantages and disadvantages. The normalization approach can be done entirely in the well-behaved reduction system of λ U v , and it does not require us to explicitly relate the staged term's execution with that of the erasure at all. These properties make the approach easier to use, but in exchange, we must supply enough context to force the return value to be a constant. The careful reduction approach does not offer a crisp deductive system but provides only a set of admissible rules reminiscent of λ U v 's reduction system (Proposition 30). Those rules suffice for a large part of the reasoning, but some details must be filled in by devising ad-hoc arguments for justifying the speculative substitution. In return, careful equalities do not require getting down to ground values, which as we shall see in Section 6.2 is a notable advantage for verifying higher order generators.

Extensional reasoning by applicative bisimulation
This section presents applicative bisimulation, a well-established tool for analyzing higher order functional programs (Abramsky, 1990;Gordon, 1999). Bisimulation is sound and complete for (≈), in particular justifying Cβ x (Section 3) and extensionality, allowing us to handle the divergence issues we glossed over in Section 4.2.

Proof by bisimulation
Here, we present the definition and usage of applicative bisimulation in λ U and leave the proofs of soundness and completeness to Appendix A.2. Due to technical complications, the indexed applicative bisimilarity defined in that appendix, which coincides with observational equivalence, is notationally dense and unwieldy. Therefore, in this section, we work with a reasoning principle packaged up more conveniently, which hides the indexing. The packaged principle is also enhanced (Pous & Sangiorgi, 2011) up to observational equivalence, i.e., bisimulations can contain pairs of terms that transition to terms that are in the bisimulation only modulo observational equivalence. We say "applicative bisimulation" to denote this unindexed, enhanced relation.
For a pair of terms to be applicatively bisimilar, they must both terminate or both diverge. If they terminate, their values must be bisimilar again under experiments that examine their behavior. In an experiment, functions are called, code values are run, and constants are left untouched. Effectively, this is a bisimulation under the labeled transition system consisting of evaluation (⇓) and experiments. If e R t implies either that e ≈ t or that e and t are bisimilar, then R ⊆ (≈).

Definition 32 (Relation under experiment). Given a relation
Definition 33. The substitution closure of a binary relation R ⊆ E × E, written R • , is defined as R • def = {(σe, σt) : e R t ∧ (σ : Var fin Arg)}. A binary relation is substitution-closed iff it equals its own substitution closure.

Definition 34 (Applicative bisimulation). A substitution-closed binary relation R ⊆
E × E is an applicative bisimulation iff every (e, t) ∈ R satisfies the following: Letting = max(lv e, lv t), we have e ⇓ ⇐⇒ t ⇓ , and if e ⇓ u ∧ t ⇓ v, then u R † v.

Theorem 35. For a substitution-closed binary relation
iff R is contained in an applicative bisimulation.
In particular, (≈) is an applicative bisimulation-the largest one under set inclusion, called applicative bisimilarity. Thus, the observably equivalent pairs of terms are precisely the pairs that are applicatively bisimilar. This is our answer to the extensional reasoning question in the introduction: Bisimulation can in principle derive all valid equivalences, including all extensional facts. Unlike in single-stage languages (Abramsky, 1990;Howe, 1996;Gordon, 1999), σ ranges over non-closing substitutions, which may not substitute for all variables or may substitute open terms. Closing substitutions are unsafe because λ U has open-term evaluation. But for CBV, bisimulation gives a condition under which substitution is safe, i.e., when the binder is at level 0 (in the definition of (λx.e) R 0 † (λx.t)). In CBN, this is not an advantage as ∀a.[a/x]e R [a/x]t entails [x/x]e R [x/x]t, but bisimulation still gives a more approachable alternative to (≈).
The importance of the substitution in the definition of (λx.e) R 0 † (λx.t) for CBV is best illustrated by the proof of extensionality, from which we get Cβ x introduced in Section 3.
To see that R is a bisimulation, fix σ, and note that σλx.e, σλx.t terminate to themselves at level 0. By the variable convention (Barendregt, 1984) Proof. Apply both sides to an arbitrary a and use Proposition 36 with β/β v .
Our proof of Proposition 36 would have failed in CBV if we had defined (λx.e) R 0 † (λx.t) ⇐⇒ e R t, without the substitution. For when e ≡ (λ .0) x and t ≡ 0, the premise ∀a.[a/x]e ≈ [a/x]t is satisfied but e ≈ t, so λx.e and λx.t are not bisimilar with this weaker definition. The binding in λx.e ∈ E 0 is guaranteed to be well-leveled, and exploiting it by inserting [a/x] in the comparison is strictly necessary to get a complete (as in "sound and complete") notion of bisimulation.
Function extensionality is a common addition to the equational theory of the plain λ calculus, usually called the ω rule (Plotkin, 1974;Intrigila & Statman, 2009). But unlike ω in the plain λ calculus, λ U functions must agree on openterm arguments and not just on closed-term arguments. This is no surprise given λ U functions do receive open arguments during program execution; however, we know of no specific functions that fail to be equivalent because of open arguments. Whether extensionality can be strengthened to require equivalence only under closed arguments is an interesting open question.
Another important fact which can be proved with applicative bisimulation is that two divergent terms are equivalent. An exception has to be made for the case where one term gets stuck on a free variable while the other diverges for a different reason, but a difference of this kind can be detected by comparing the terms under substitutions. This result will let us show that stpow is interchangeable with its erasure not just in terminating cases but also in non-terminating cases.
Lemma 38. For a fixed e, t, if for every σ : Var fin Arg we have σe ≈ ⇑ σt, then e ≈ t. Proof. Notice that {(e, t)} • is an applicative bisimulation.
Remark. The only difference between Definition 34 and applicative bisimulation in the plain λ calculus is that Definition 34 avoids applying closing substitutions. Given that completeness can be proved for this bisimulation, it seems plausible that the problem with reasoning under substitutions is the only thing that makes conservativity fail. Hence, it seems that, for closed unstaged terms, λ U 's (≈) could actually coincide with that of the plain λ calculus. Such a result would make a perfect complement to the Erasure Theorem, for it lets us completely forget about staging when reasoning about erased programs.
We do not have a proof of this conjecture, however. Conservativity results for observational equivalences are often proved by semantic arguments that exploit denotational models (Mitchell, 1993;Riecke & Subrahmanyam, 1994;McCusker, 2003), but giving such a model for hygienic MSP is notoriously difficult (Benaissa et al., 1999). Although Riecke & Subrahmanyam (1994) do also discuss a more syntactic approach, that proof also occasionally uses semantic arguments. Investigating whether such techniques can be made to work for λ U deserves consideration in a separate paper.

Example: Tying loose ends on staged power
In Section 4.2, we sidestepped issues arising from the fact that stpow 0 ⇑ 0 whereas power 0 ⇓ 0 . If we are allowed to modify the code, this problem is usually easy to avoid, for example, by making power and genpow return dummy return values for non-positive arguments. If not, we can still persevere by finessing the statement of correctness. The problem is partial application, so we can force stpow to be fully applied before it executes by stating power ≈ λn.λx.stpow n x.
Proof. We just need to show ∀e, t ∈ E 0 . power e t ≈ ⇑ stpow e t, because then ∀e, t ∈ E 0 . ∀σ : Var fin Arg. σ(power e t) ≈ ⇑ σ(stpow e t), whence power ≈ λn.λx.stpow n x by Lemma 38 and extensionality. So fix arbitrary, potentially open, e, t ∈ E 0 , and split cases on the behavior of e. As evident from the following argument, the possibility that e, t contain free variables is not a problem here.
[If e ⇑ 0 or e ⇓ 0 u ∈ Z + ] Both power e t and stpow e t diverge.

J. Inoue and W. Taha
Proof. By the same argument as in CBN, we just need to show power u v ≈ ⇑ stpow u v for arbitrary u, v ∈ V 0 .
[If u ∈ Z + ] Both power u v and stpow u v get stuck at if n = 0.
[If u ∈ Z + ] If u ≡ 1, then power 1 v = v = stpow 1 v. If u > 1, we show that the generated code is strict in a subexpression that requires v ∈ Z. Observe that genpow u <x> ⇓ 0 <e>, where e has the form <x * t>. For [v/x]e ⇓ 0 , it is necessary that v ∈ Z. It is clear that power u v ⇓ 0 requires v ∈ Z. So either v ∈ Z and power u v ⇑ 0 and stpow u v ⇑ 0 , in which case we are done, or v ∈ Z in which case Proposition 28 applies.
Remark. Real code should not use λn.λx.stpow n x, as it regenerates and recompiles code upon every invocation. Application programs should always use stpow, and one must check (outside of the scope of verifying the function itself) that stpow is always eventually fully applied so that the η expansion is benign.

Case studies
In this section, we verify two concrete generators that are more illustrative of the techniques used in realistic applications than power to demonstrate that this article's approach can cover more complex generators. Each example illustrates specific complicating factors that can arise in practical generators: • The LCS (Section 6.1) couples monadic memoization with continuationpassing style and let-insertion (Swadi et al., 2006). This technique is essential for generating code of acceptable quality, but it complicates the generated code. Nonetheless, the proof strategy remains roughly the same. • The staged fold function (Section 6.2) is an example of a higher order generator-one that takes another generator as input. Despite the fact that the normalization approach (Lemma 26) demands ground terms, we demonstrate that it can also handle higher order code. We also show that, in this context, careful equalities can give a more natural characterization of correctness than the normalization approach.
These examples illustrate that our techniques apply to a wide range of generators, and that we need not hold back on sophisticated programming techniques in order to make the program amenable to analysis.

Longest common subsequence
In this section, we work out the correctness proof of LCS. Although this example is not practically useful but rather chosen for ease of explanation, much like power, its structure is representative of programs exploiting the monadic memoization technique useful for staging a wide range of memoized functions (Swadi et al., 2006). The technique has an effect similar to performing common subexpression elimination on the generated code, although it differs in that the common subexpressions are detected and shared as generation progresses. Compilers, by contrast, usually perform this optimization in a post-generation pass that inspects the syntactic structure of generated code. This difference notwithstanding, the technique has proved indispensable for compiler-like applications of MSP, such as domain-specific language implementation (Brady & Hammond, 2006;Taha, 2008) and circuit generation (Kiselyov & Taha, 2005).

The code
The code for LCS is displayed in Figure 4. We have several versions of the same function, which maps integers i, j, and 0-based arrays P , Q to the length of the LCS of P and Q. For simplicity, we compute the length of LCS instead of the actual sequence, but modifying it to return the sequence is straightforward. The function naive lcs is a naïve exponential-time implementation serving as the specification, while lcs is the textbook polynomial-time version with memoization. The function stlcs is the staged version of lcs that we wish to verify, which specializes lcs to the lengths i and j of the input sequences. All recursive calls in lcs and stlcs go through memoizing combinators mem and memgen, respectively. Memo tables are represented by functions mapping a key k and functions f, g to f v if a value v is associated with k, or to g () if k is not in the table. The value empty is the empty table, ext extends a table with a given key-value pair, and lookup looks up the table. This interface is chosen to make the correspondence with λ U clear. In MetaOCaml, lookup can return an option type, but since we left out higher order constructors from λ U we Church-encoded the option type here. Const covers first-order types like int option, but not higher order types like (int -> int) option or (int code) option.
We use a state-continuation monad to hide memo-table passing and to put the code into CPS. Computation in this monad takes a state (the memo table) and a continuation, and calls the continuation with an updated state and return value. All of the effectful computation happens inside the memoization combinators: Both mem and memgen look up the memo table, call the memoized function if no suitable value is cached, and update the memo table.

Purpose of CPS
The purpose of continuation-passing style translation in LCS is to implement the binding-time improvements first discovered by Bondorf (1992). This section briefly reviews the significance of this improvement. See Swadi et al. (2006) for a more thorough treatment.
The crux of the monadic memoization technique is the following part of memgen: This is executed precisely when the key (i,j) is not in the table tab. The variable r holds a new code value returned from genlcs to be associated with the key (i,j), and the k is the continuation to the call to memgen genlcs, which generates further code using the code associated with (i,j). Instead of directly registering r with tab as the value corresponding to (i,j), the code above binds r to a new level-1 variable z, then registers <z> instead. Without this trick, the code inserted into the memo table snowballs exponentially. Suppose we modified memgen so that the code in listing (17) is replaced by k (ext tab (i,j) r) r Let <e ij > be the code generated for a given i, j pair by genlcs using this modified memgen. Then, e ij would be as follows, containing e (i−1)(j−1) , e (i−1)j , and e i(j−1) as subterms: if p.(i) = q.(j) then e (i−1)(j−1) + 1 else max e (i−1)j + e i(j−1) But e (i−1)(j−1) in turn contains e (i−2)(j−2) , e (i−2)(j−1) , and e (i−1)(j−2) , and likewise for e (i−1)j and e i(j−1) . The code size is hence exponential in i and j.
Code (17) solves this problem by generating a binding let z ij = e ij for each i, j, passing on <z ij > to be used in place of <e ij >. Overall, the generated code looks like: For each i, j pair, the variable z ij is bound to Equation (18), but with subterms e ij replaced by variable references. Thus, the right-hand side of the let has the same size regardless of i, j. Because a let is generated only when memo-table lookup fails, the e ij is bound at most once for every i, j pair, ensuring the code size is polynomial.
Thus, monadic memoization is essential for generating code of acceptable quality. But the generated code takes much more work to describe formally, as one must predict and spell out the order of let bindings appearing in the generated code. Fortunately, like with the power example, erasure makes such details largely irrelevant, for it lets us get away with rather coarse characterizations of the generated code.

Correctness proof
We now sketch the main parts of the correctness proof for LCS. We focus on the harder CBV case and leave CBN as an exercise. We adopt the proof-by-normalization approach, although careful reductions could also work. Let us assume Const has unit, booleans, integers, tuples of integers, and arrays thereof with 0-based indices. The symbol A stands for the set of all arrays (a subset of Const), σ ranges over substitutions Var fin V 0 , and e ⇓ 0 Z means ∃n ∈ Z. e ⇓ 0 n.
Despite all the complications introduced by monadic memoization, our strategy remains the same as for power: check termination and apply the Erasure Theorem. Just like with power, to show the termination of stlcs we track the invariant that the generated code terminates under suitable substitutions. The difference is that the set of variables that the "suitable substitution" has to cover grows as more let bindings are generated.
The invariant is captured in two parts, one for the memo table and one for the continuation. For the memo table, every key should be mapped to some <z>, where z should have an integer value under the substitution that will be in effect when the generated code is run.

Definition 41.
A good memo table is a T ∈ E 0 such that for every i, j ∈ Z and every f, The set of all good memo tables is written G. A good memo table T is covered by σ iff σ is a substitution such that for all of the z's we have z ∈ dom σ and σz ∈ Z. The set of all good memo tables covered by σ is written G σ .
The continuation should then preserve termination under substitution, i.e., it should map terminating code < e > to terminating code < t >. As a caveat, the continuation will be invoked under lets, like the call to k in code listing (17), so the termination of e must be assessed under substitutions that cover more variables than were visible when the continuation was created. In the following definition, σ ⊇ σ means σ is an extension of σ (i.e., dom σ ⊇ dom σ and σ | dom σ = σ).

Definition 42.
Let the set K σ of all good continuations under σ consist of all k ∈ V 0 s.t. for any e, σ ⊇ σ, and T ∈ G σ with σ e ⇓ 0 Z, we have ∃t. k T < e > = < t > and σ t ⇓ 0 Z.
With these invariants, the following lemma can be proved: Given a memo table and a continuation that respect these invariants, genlcs returns terminating code.
Theorem 44. λ U v naive lcs ≈ λx.λy.λp.λq.stlcs x y p q Proof. By extensionality and Lemma 38, it suffices to prove naive lcs i j P Q ≈ ⇑ stlcs i j P Q for every i, j, P , Q ∈ V 0 . Here, we focus on the case where both sides converge, that is, leaving the less interesting cases to the thesis (Inoue, 2012). Under these assumptions, stlcs i j P Q reduces to ! <fun p q -> (genlcs i j <p> <q> empty (fun s r -> r))> P Q (19) Let σ = [P , Q/p, q]. Then, it is easily seen that empty ∈ G σ and (fun s r -> r) ∈ K σ , so by Lemma 43, there exists a term e with σ ⇓ 0 Z such that Reasoning about multi-stage programs

33
We omit the proof that stlcs i j P Q ⇓ 0 Z, since it involves no staging. Therefore, by the Erasure Theorem (specifically Lemma 26), stlcs i j P Q = stlcs i j P Q ≡ lcs i j P Q One can show that lcs i j P Q = naive lcs i j P Q, so it follows that naive lcs i j P Q = stlcs i j P Q It's worth noting how the argument let us ignore many details about the generated code. We did track termination and type information, but we never specified what the generated code looks like or what values it should compute. In fact, we were blissfully ignorant of the fact that stlcs computes (the length of) the LCS. Erasure thus decouples the reasoning about staging from the reasoning about return values, just as we saw earlier in the power example.
In the thesis (Inoue, 2012), it is shown that the proof of naive lcs ≈ lcs is also quite routine. The lack of surprise in this part of the proof is itself somewhat noteworthy, because it demonstrates that despite the challenges of open-term evaluation (Sections 3 and 5), the impact on correctness proofs is very limited, especially when reasoning about closed, unstaged terms.

Higher order generators
In this section, we verify a higher order generator-one that takes another code generator as a parameter. The key issue in this scenario is how to specify the behavior of the generators that are passed in as parameters. To this end, we find that proof by careful equalities (Section 4.5) has concrete advantages over proof by normalization.
This section's material is not covered in the thesis (Inoue, 2012). Proof details for this section are given in Appendix A.3.

The code
The code we will analyze is the inlining fold function, which captures a very common pattern in which guaranteed inlining can make a significant difference: let rec fold f y xs = match xs with | [] -> y | x::xs -> fold f (f y x) xs let stfold f = ! <let rec loop y xs = match xs with | [] -> y | x::xs -> loop~(f <y> <x>) xs in loop> In this listing, fold is the usual left-fold function that reduces a list by a binary operator f, in a left-associative manner with seed value y. The stfold function is a staged variant that inlines the binary operator, assuming it is given as a generator that maps two code values to code. The unstaged and staged functions can be invoked like: fold (+) 0 [1;2;3] (* returns 6 *);; stfold (fun x y -> <~x +~y>) 0 [1;2;3] (* returns 6 *);; which both sum together the given list. However, fold repeatedly invokes the binary operator for every element whereas stfold generates a new loop that inlines the operator, thereby avoiding repeated function calls. In the rest of this section, the symbol f is reserved for the staged binary operator passed in as the first argument to stfold.

Correctness proof for CBN
As always, the proof is simpler in CBN than in CBV, so we will present CBN first. We assume Const contains integers and lists thereof. Pattern-matches on lists are modeled similarly to if-then-else: we include a constant match, with the reduction rules: Then, the expression match xs with [] -> e 1 | x::xs -> e 2 is modeled by the λ U term match xs (λ . e 1 ) (λx xs.e 2 ). Ill-formed combinations like δ(match,1) are undefined.
We need to be careful about what exactly correctness means for higher order generators like stfold. It is not the case that ∀f, a, l ∈ Arg. stfold f a l ≈ fold f a l, because stfold expects f to be a generator whereas fold expects f to be an unstaged function. Intuitively, f is a part of the code that stfold generates, so it must be erased together with stfold. The correct statement to aim for is thus ∀f, a, l ∈ Arg. stfold f a l ≈ fold f a l, where f must be a function mapping two code values to a code value.
Proposition 45 (CBN correctness of stfold). In CBN, for any f ∈ E 0 such that ∀x, y ∈ FV(f). ∃e. λ U n f y x = e , we have λ U n stfold f = fold f .

Proof.
Using the assumption f y x = e , we see directly that stfold f reduces to an unstaged form, so the Erasure Theorem, specifically Lemma 21, applies. See Appendix A.3.1 for details.
Example 46. Let f def ≡ λx y. ˜x +˜y . Then, f y x = y + x which is of the form e , so stfold f = fold f , hence λ U n stfold f 0 = fold f 0 = sum, where sum is a function that sums up all elements of a list.

Correctness proof for CBV using normalization
For CBV, the generated code must additionally terminate to constants under relevant substitutions. In a proof by normalization, this additional condition can be ensured by requiring f y x to produce code that terminates to a constant, as long as the y and x are substituted by constants drawn from the right domain. For a set S ⊆ V 0 , let e ⇓ 0 S mean ∃v ∈ S. e ⇓ 0 v.
Proposition 47 (CBV correctness of stfold by normalization). In CBV, let f ∈ V 0 and D ⊆ Const, and let D * def = {[c 1 ,c 2 ,...,c n ] | c 1 , . . . , c n ∈ D, n ∈ N}. Assume that ∀x, y ∈ FV(f). ∃e. λ U v f y x = e and, for the same e, Then, for any d ∈ D and l ∈ D * , we have λ U v stfold f d l = fold f d l.
Proof. By the Erasure Theorem (specifically, Lemma 26) and some easy arguments, the proof reduces  Example 48. Let f def ≡ λx y. ˜x +˜y . Then, λ U v f y x = y + x , which is of the form e . Taking D def = Z, for every n, m ∈ Z, we have f n m = n + m ⇓ 0 Z and [n, m/y, x] y + x ≡ n + m ⇓ 0 Z, so by Proposition 47, stfold f = fold f . Hence, stfold f 0 l = fold f 0 l = sum l in CBV, where l is a list of integers and sum is a function that sums up all elements of a list.
Assumption (20) is essentially a type constraint saying that the binary operator f is a first-order function mapping constants to constants. This constraint implicitly gives a set of contexts that force return values to be of ground type (i.e., constants), which is needed to invoke Lemma 26.

Correctness proof for CBV using careful equalities
Assumption (20) in Proposition 47 can be somewhat limiting. For example, it does not cover the case where the binary operator f is itself higher order; for example, f might be (a staged version of) function composition.
We can avoid this kind of restriction by specifying the behavior of f by careful equalities instead. Careful equalities do not dictate the shape of return values, so the statement of correctness becomes cleaner and more general. However, the proof system of Proposition 30 falls short here, because f is invoked only once with a fixed set of arguments whereas the code it produces is invoked multiple times-their reductions simply do not align with each other. The proof must therefore consider the staged and unstaged counterparts separately. Proposition 30 can be used instead for checking the properties of f.
Proof. By some arguments using extensionality (Proposition 36) and equivalence of divergent terms (Lemma 38), it suffices to show ∀v 0 , l ∈ V 0 . stfold f v 0 l ≈ ⇑ stfold f v 0 l. For simplicity, assume l def ≡ [u 1 , . . . , u n ] and let us focus on the case stfold f v 0 l ⇓ 0 . By induction on the length of the input list l, we get a sequence v 1 , . . . , v n ∈ V 0 such that  [v, u/y, x] f y x = y (x z) which is of the form e , so by Proposition 49 we have stfold f (λz.z) = fold f (λz.z). Remark. Technically, this example applies only when composing lists of builtin functions that can be modeled as constants, such as unaryor the partial applications of + and * because lists are modeled as constants. (See Remark 1 about how partially applied operators are viewed as constants.) This restriction can be lifted either by Church-encoding lists or by adding constructors to λ U , perhaps in the style of Arbiser et al. (2006) if the addition of types is undesirable.

Comparison of proofs in CBV
As we have just seen, careful equalities can be better suited for higher order generators than proof by normalization because the former gives a more natural vocabulary for specifying the behavior of the generator coming in as input (f in the case of stfold). Then, should we abandon proof by normalization and always work with careful equalities? Not necessarily, as one can check in Appendices A.3.2 and A.3.3, once the restriction to first-order is in place, the details of the proof can be much simpler in the normalization approach. Only when the restriction to first-order is unacceptable do we need careful equalities.
Note that this limitation to first-order is not as grave as it may seem. If f can be given a fixed type, the returned code can be applied to more arguments to force ground-type return values. In Example 50, if the input list is known to contain only functions of type int -> int, a correctness proof by normalization simply needs to prove stfold f m l k = fold (•) m l k, with an extra argument k ∈ Z, so that the return type is int. Moreover, higher orderness is itself an abstraction that ought to be eliminated with staging, so practical generators tend to produce first-order code. As a case in point, the code generation in Example 50 is barely beneficial in practice, since performance gains from inlining (•) are dwarfed by the costs of repeatedly allocating the results of the inlined (•).
An important class of generators that genuinely require higher order correctness statements is staged interpreters for higher order languages (Brady & Hammond, 2006;Taha, 2008;Carette et al., 2009). When a programming language interpreter is written in MetaOCaml and staged, the result is a translator-i.e., a compiler-from the object language to OCaml. If the object language has higher order features, the generated code may have to be higher order as well (though that can limit the gains from staging). Hence, it is desirable to have an alternative correctness proof that allows the generated code to be higher order. In the absence of such a functional requirement, proof by normalization can be a sensible choice. Taha (1999) first discovered λ U , which showed that functional hygienic MSP admits intensional equalities like β, even under brackets. The key was to drop intentional analysis, or pattern-matching on the syntactic structure of code values. By contrast, earlier systems that allowed intentional analysis were forced to have trivial equational theories (Muller, 1992). However, Taha showed the mere existence of the theory and did not explore how to use it for verification or investigate extensional equivalences. Moreover, though Taha laid down the operational semantics of both CBV and CBN, he gave an equational theory for only CBN and left the trickier CBV unaddressed. Yang (2000) pioneered the use of an "annotation erasure theorem", which stated e ⇓ 0 t =⇒ t ≈ e . But there was a catch: The conclusion t ≈ e was asserted in the unstaged base language, instead of the staged language. Translated to our setting, the conclusion of the theorem was λ t ≈ e and not λ U t ≈ e . In practical terms, this meant that the context of deployment of the staged code could contain no further staging. Code generation must be done offline, and application programs using the generated t must be written in a single-stage language, or else no guarantee was made. This interferes with combining analyses of multiple generators and precludes dynamic code generation by run (!). Yang also worked with operational semantics and did not explore in depth how equational reasoning interacts with erasure.

Related works
This paper can be seen as a confluence of these two lines of research: we complete λ U by giving a CBV theory with a comprehensive study of its peculiarities, and adapt erasure to produce an equality in the staged language λ U .
Berger & Tratt (2015) devised a Hoare-style program logic for the typed language Mini-ML e . They develop a promising foundation and prove strong properties about it, such as relative completeness, but concrete verification tasks they consider concern relatively simple programs. Mini-ML e also prohibits manipulating open terms, so it does not capture the difficulty of reasoning about free variables, which is one of the main challenges we face up to. Insights gained from λ U should help extend such logics to more expressive languages, and our proof techniques will be a good toolbox to lay on top of them.
An interesting line of work that mitigates the expressivity problems in Mini-ML e yet successfully avoids issues with open terms is contextual modal type theory (Nanevski et al., 2008). Its application to MSP offers a staging construct which, through typing, restricts code values to closed terms of a form that roughly translates to λx 1 . . . . λx n . [˜x 1 , . . . ,˜x n /x 1 , . . . , x n ]e 0 in λ U notation. That is, code values must be closed, and any references to level-1 (or higher) free variables must be expressed via reference to escaped level-0 variables. The resulting closure is applied to terms like x , replacing the escaped level-0 variables by level-1 variables. (Strictly speaking, both the abstraction and the application to x use custom constructs, so the open term x is never explicitly constructed as a first-class value.) We expect the Erasure Theorem to still apply to this setting and be augmented with cleaner characterizations of observational equivalence than those developed in this paper.
For MSP with variable capture, Choi et al. (2011) proposed an alternative approach with different trade-offs than ours. They provide an "unstaging" translation of staging annotations into environment-passing code. Their translation is semanticspreserving with no proof obligations but leaves an unstaged program that is complicated by environment-passing, whereas our erasure approach leaves a simpler unstaged program at the expense of additional proof obligations. Their approach also has the advantage that the target language of the translation has no staging, so reasoning principles need not be ported to that setting-provided that, like with Yang's results, the context of deployment contains no further staging. It will be interesting to see how these approaches compare in practice or if they can be usefully combined, but, for the moment, they seem to fill different niches.
There is a wealth of publications on representing free variables and binding structures, often with the goal of supporting syntactic transformations and/or mechanized reasoning (Gabbay & Pitts, 2001;Aydemir et al., 2008;Licata et al., 2008;Pouillard & Pottier, 2010). The nominal (Gabbay & Pitts, 2001) and definitional variation (Licata et al., 2008) approaches in particular provide deep insights into the mathematical properties of binding and scope. While the present paper gives only an operational intuition as to the cause of pathologies relating to open-term manipulation (see Section 3), these more developed theories of binding may be able to provide more formal, mathematical explanations.
Applicative bisimulation has been studied extensively as a characterization of observational equivalence in the plain λ calculus and its variants (Abramsky, 1990;Howe, 1996;Gordon, 1999), which made it a natural starting point in our investigation. However, more advanced flavors of bisimulation exist, offering greater flexibility and lighter proof obligations. Small bisimulations (Koutavas & Wand, 2006) and environmental bisimulations  do not directly relate terms but more abstract states that can track contextual information, allowing the handling of effects. It will be interesting to see how they can be adapted to the multi-stage setting. Up-to techniques are often indispensable in simplifying the proof obligations for establishing bisimilarity between concrete terms (Pous & Sangiorgi, 2011). Our Definition 32 builds in reasoning up-to observational equivalence, which can be seen as a reformulation of bisimulation-up-to-bisimilarity (Milner, 1989) with the understanding that bisimilarity coincides with observational equivalence. This enhancement simplified the proof of extensionality (Proposition 36).
There are also other characterizations of observational equivalence. The CIU Theorem (Mason & Talcott, 1991) states that terms are observationally equivalent iff all of their closed instances equiterminate under arbitrary evaluation contexts. The Context Lemma (Milner, 1977;Jim & Meyer, 1996) states, in a typed setting, that closed terms of some type τ are equivalent exactly when they cannot be distinguished by the elimination forms for the type τ. Both approaches reduce the set of contexts that must be considered and are arguably simpler than bisimulation. As a result, Mason and Talcott observe that the proofs tend to be simpler (Mason & Talcott, 1991). We have not investigated how these techniques can be adapted to λ U .

Conclusion and future work
We have addressed three basic concerns for verifying staged programs. First, we showed that staging is a non-conservative extension because reasoning under substitutions is unsound in a multi-stage language, even if we are dealing with unstaged terms. Despite this drawback, untyped functional MSP has a rich set of useful properties. Second, we proved that simple termination conditions guarantee that erasure preserves semantics, which reduces the problem of proving the irrelevance of annotations on a program's semantics to the better studied problem of proving termination. Finally, we showed a sound and complete notion of applicative bisimulation in this setting, which allows us to reason under substitution in some cases. In particular, the shocking lack of β x in λ U v is of limited practical relevance as we have Cβ x instead, which covers β x completely when we are dealing with closed, erased terms.
These results yield important insights into the semantics of hygienic MSP. The Erasure Theorem gives intuitions on what staging annotations can or cannot do, with which we may educate the novice multi-stage programmer. Applicative bisimulation adapts in a natural manner and the familiar notion of function extensionality carries over. The key difference from single-stage languages is the behavior of free variables, which greatly affect the formulation of bisimulation. However, the notion of bisimulation that we formulated in light of this difference is sound and complete, suggesting that free variables' behavior is the only essential difference between λ U and λ. This broad set of insights has brought us to a level where the correctness proof of a sophisticated generator like LCS is easily within reach, as are similar proofs for higher order generators.
This work may be extended in several interesting directions. We have specifically identified some open questions about λ U : Which type systems, if any, allow reasoning under substitutions? Is λ U conservative over the plain λ calculus for closed terms? Can the extensionality principle be strengthened to require equivalence for only closed-term arguments? What is a sensible notion of partial erasure, and is it useful for stating and proving correctness of higher order generators? Answering these questions will strengthen our understanding of staging even further.
In this paper, we pointed out that λ U is not conservative in the sense that not all observationally equivalent terms in the standard CBN and CBV λ calculi (Plotkin, 1975) remain equivalent in λ U . However, a major theme in this paper is that λ U nonetheless conserves useful reasoning principles. The β/β v equality with confluence, β x , extensionality, and sound and complete applicative bisimulation all carry over, albeit with some changes, from the standard λ calculi. For β/β v , we have also established that the changes-namely, level restrictions-cannot be reduced any further. For applicative bisimulation, its completeness suggests that the modifications from the plain λ calculus are minimal. It will be very interesting to explore if other reasoning principles, like recursion induction (McCarthy, 1963), carry over, and how much modification is strictly necessary.
It will also be interesting to investigate which of the more advanced alternatives to applicative bisimulation (Milner, 1977;Mason & Talcott, 1991;Koutavas & Wand, 2006;Sangiorgi et al., 2011) can be adapted to the multi-stage setting. Many of them have had success in handling effects, so they may make imperative hygienic MSP languages (Westbrook et al., 2010;Kameyama et al., 2011;Rompf & Odersky, 2012) susceptible to analysis. However, like with applicative bisimulation, it seems common practice in these techniques to assume that only closed instances of terms matter. Koutavas and Wand, for instance, start by ruling out open terms in the states being compared. Thus, these techniques will probably need similar treatment to applicative bisimulation in order to track substitutions. Perhaps, environmental bisimulation can capture them with little modification, using its existing machinery for tracking contextual information.
As a caveat, the Erasure Theorem does not apply as-is to imperative languages, since modifying evaluation strategies can commute the order of effects. Two mechanisms will be key in studying erasure for imperative languages-one for tracking which effects are commuted with which, and another for tracking mutual (in)dependence of effects, perhaps separation logic (Reynolds, 2002) for the latter. In any case, investigation of imperative hygienic MSP may have to wait until the foundation matures, as noted in the introduction. Adapting erasure and other techniques to lightweight modular staging (Rompf & Odersky, 2012), like we noted in the introduction, will need further development. The additional challenge there is to cope with the flexibility in the semantics that can be attached to the object code. It may require the host language semantics to be able to mix different semantics, so that the erasure makes sense.
Devising a mechanized program logic would also be an excellent goal. Berger and Tratt's program logic (2015) may be a good starting point, although whether to go with Hoare logic or to recast it in equational style is an interesting design question. A mechanized program logic may let us automate the particularly MSP-specific proof step of showing that erasure preserves semantics. The Erasure Theorem reduces this problem to essentially termination checks, and we can probably capitalize on recent advances in automated termination analysis, for example, those of Heizmann et al. (2010).
Finally, this work focused on functional (input-output) correctness of staged code, but quantifying performance benefits is also an important concern for a staged program. It will be interesting to see how we can quantify the performance of a staged program through formalisms like improvement theory (Sands, 1998

Appendix A. Proofs
This appendix includes formalizations of claims and proofs of theorems that were omitted from the main text. The proofs are kept brief, just enough to get the idea across. For complete details, see the thesis (Inoue, 2012).

A.1 Equivalence of open-and closed-term observations
In this section, we prove that Definition 6 (observational equivalence) is equivalent to the stratified definition with non-closing contexts used by Taha (1999). We recall Taha's definition first. We will ignore constants throughout this section, but adding them is straightforward.
The idea behind the proof is that observation of open terms can be recast as observation of closed terms. In CBV, the machinery to do this recasting is λ U 's ability to force evaluations of open terms within programs (which are closed by definition). In CBN, the machinery does not rely on staging and requires a lemma that also holds in the plain λ calculus.
Proof. By symmetry, proving σe ⇓ =⇒ σ e ⇓ will suffice. The proof proceeds by induction on the number of steps that σe takes to terminate, using a technical lemma to classify the shape of σe. See the thesis for details.
Remark. Note that it need not be the case that dom σ ⊇ FV(e).

Proposition 53. (≈) = (≈ ).
Proof. Suppose (e, t) ∈ (≈ ). Then, for some , we have e, t ∈ E and e ≈ t. In CBV, let e 1 ; e 2 denote sequencing, which checks that e 1 terminates, discarding the return value, and then evaluates e 2 . Sequencing is just syntactic sugar for (λ .e 2 ) e 1 in CBV. Then, the context C def ≡ λx i .˜(C; 0 ) satisfies C [e], C [t] ∈ Prog, yet C [e] ⇑ 0 and C [t] ⇓ 0 , so λ U v e ≈ t. In CBN, let Ω be a closed, divergent level-0 term. Then, C def ≡ (λx i .C) Ω is a program context for e and t, where there are as many copies of Ω as there are Reasoning about multi-stage programs 45 variables in x i . Now, C[e] ⇑ 0 ⇐⇒ C [e] ⇑ 0 and C[t] ⇓ 0 ⇐⇒ C [t] ⇓ 0 by Lemma 52 and SS-β, so λ U n e ≈ t.

A.2 Soundness and completeness of applicative bisimulation
This section explains the proof of Theorem 35. We basically just adapt Howe's method (1996), but the details are complicated by the inconsistent handling of substitutions in λ U 's bisimulation. Definition 34 says that terms must be compared under all substitutions, yet Definition 32, for (λx.e) R 0 † (λx.t), says that only the substitutions that eliminate x should matter. When we try to prove Theorem 35 by coinduction, we find that Definition 32 refers not to the bisimulation whose definition it is a part of, but to a different bisimulation that holds only under substitutions that eliminate x, undermining the coinduction. To solve this problem, we recast bisimulation to a family of relations indexed by a set of variables to be eliminated, so that the analogue of Definition 32 can refer to a different member of the family. Theorem 35 is then proved by mutual coinduction.

A.2.1 Overview
We first review Howe's method for single-stage calculi and motivate the change that we made in adapting it to λ U , focusing on CBV. Howe shows that bisimilarity (∼), the union of all bisimulations, is a non-trivial congruence, whence (∼) = (≈). The hardest part is showing that (∼) respects contexts, i.e., e ∼ t =⇒ C[e] ∼ C [t]. For this step, Howe defines an auxiliary relation e ∼ t, the precongruence candidate, which holds iff e can be transformed into t by one bottom-up pass of replacing successively larger subterms e of e by some t such that e ∼ t . Formally, this relation is defined along the following lines: λx.e ∼ t e 1 ∼ s 1 e 2 ∼ s 2 s 1 s 2 ∼ t e 1 e 2 ∼ t etc.
If we try to apply this idea to λ U directly, we get stuck in the proof that ( ∼) is a bisimulation. Concretely, we cannot seem to prove ∀e, t ∈ E 0 . λx.e ∼ 0 † λx.t =⇒ λx.e ∼ λx.t (A 1) Implication (A 1) arises as a subgoal in an inductive proof that e ∼ t and e ⇓ u imply t ⇓ v and u ∼ † v, where = max(lv e, lv t). The induction is on the number of steps e takes to terminate. Recall that we are focusing on CBV and consider the case • = 0, • e ≡ (λy.e 1 ) e 2 ∧ t ≡ (λy.t 1 ) t 2 , and • e 2 ⇓ 0 λx.e ∧ t 2 ⇓ 0 λx.t.