Confusion of Tagged Perturbations in Forward Automatic Differentiation of Higher-Order Functions

Forward Automatic Differentiation (AD) is a technique for augmenting programs to both perform their original calculation and also compute its directional derivative. The essence of Forward AD is to attach a derivative value to each number, and propagate these through the computation. When derivatives are nested, the distinct derivative calculations, and their associated attached values, must be distinguished. In dynamic languages this is typically accomplished by creating a unique tag for each application of the derivative operator, tagging the attached values, and overloading the arithmetic operators. We exhibit a subtle bug, present in fielded implementations, in which perturbations are confused despite the tagging machinery.


Introduction
The classical univariate derivative of a function f : R Ñ R is a function f 1 : R Ñ R (Leibniz, 1684;Newton, 1704). Multivariate or vector calculus extends the notion of derivative to functions whose domains and/or ranges are aggregates, that is vectors, introducing notions like gradients, Jacobians, and Hessians. Differential geometry further extends the notion of derivatives to functions whose domains and/or ranges are-or can contain-functions.
Automatic differentiation (AD) is a collection of methods for computing the derivative of a function at a point when the function is expressed as a computer program (Griewank & Walther, 2008). These techniques, once pursued mainly by a small quiet academic community, have recently moved to the forefront of deep learning, where more expressive languages can spawn new industries, efficiency improvements can save billions of dollars, and errors can have far-reaching consequences.
From its earliest days, AD has supported functions whose domains and/or ranges are aggregates. There is currently interest from application programmers (machine learning in particular) in applying AD to higher-order functions. Here, we consider extending AD to support functions whose domains and/or ranges are functions. This is natural: we wish AD to be completely general and apply in an unrestricted fashion to correctly compute the derivative of all programs that compute differentiable mathematical functions. This includes applying to functions whose domain and/or ranges include the entire space of data types supported by programming languages, including not only aggregates but also functions. In doing so, we uncover a subtle bug. Although for expository purposes we present the bug in the context of forward AD (Wengert, 1964), the underlying issue can also manifest itself with other AD modes, including reverse AD (Speelpenning, 1980) of higher-order functions. The bug is insidious: it can lead to production of incorrect results without warning. We present and discuss the relative merits of two fixes, and exhibit code implementing them.
Our solutions are not ideal. While we believe that the solutions will always produce the correct result, they can foil both the space and time complexity guarantees of forward AD described in the next section.
Let D denote the true mathematical derivative operator. D is classically defined for firstorder functions R Ñ R in terms of limits, and thus this classical definition does not lend itself to direct implementation.
We seek to materialize D as a program construct D. We can view this classical limit definition as a specification of D and proceed to develop an implementation of D. Below, we use " to denote mathematical equality, Ÿ " to denote definition of program constructs, and ùñ to denote evaluation.
One can extend D to functions R Ñ α, where: We first focus on this extension in Sections 2-8. We consider further extension to functions α 1 Ñ α 2 in Section 9. Since by (2) any type α must be of the form α 1 Ñ¨¨¨Ñ α n Ñ R, functions R Ñ α can be viewed as multivariate functions R Ñ α 2 Ñ¨¨¨Ñ α n Ñ R whose first argument domain is R and whose range is R. We take D f where f : R Ñ α 2 Ñ¨¨¨Ñ α n Ñ R to be the partial derivative with respect to the first argument.
D f " Bf px 1 , x 2 , . . . , x n q Bx 1 We will see below that past work has implemented a D that appears to coincide with the specification D in (1) for functions R Ñ R, but this past implementation fails to coincide with the specification D in (3) for functions R Ñ α. We then proceed to demonstrate two new implementations of D that do appear to coincide.
2 Forward AD as differential algebra Forward AD can be formulated as differential algebra (Karczmarczuk, 2001). Its essence is as follows.
The purely arithmetic theory of complex numbers as pairs of real numbers was introduced by Hamilton (1837). These form an algebra over two-term polynomials a`bi where i 2 "´1. Arithmetic proceeds by simple rules, derived algebraically.
Complex numbers can be implemented in a computer as ordered pairs pa, bq, sometimes called Argand pairs. Since arithmetic over complex numbers is defined in terms of arithmetic over the reals, the above rules imply that computation over complex numbers is closed. Clifford (1873) introduced dual numbers of the form a`b . In a dual number, the coefficient of is called a perturbation or a tangent. These can similarly be viewed as an algebra over two-term polynomials where 2 " 0 but ‰ 0. Arithmetic over dual numbers is again defined by simple rules derived algebraically.
pa`b q`pc`d q " pa`cq`pb`dq (5a) pa`b qpc`d q " ac`pad`bcq `bd 2 " ac`pad`bcq (5b) Again, dual numbers can be implemented in a computer as ordered pairs pa, bq. Again, since arithmetic over dual numbers is defined in terms of arithmetic over the reals, the above rules imply that computation over dual numbers is closed. The essence of forward AD is viewing dual numbers as truncated two-term power series. Since, following Taylor (1715), f px 0`x1 `Op 2 qq " f px 0 q`x 1 f 1 px 0 q `Op 2 q, applying f to a dual number a`1 will yield a dual number f paq`f 1 paq . This leads to the following method for computing derivatives of functions f : R Ñ R expressed as computer programs.
‚ Arrange for the programming language to support dual numbers and arithmetic thereupon. ‚ To compute f 1 at a point a, 1. form a`1 , 2. apply f to a`1 to obtain a result f paq`f 1 paq , and 3. extract the tangent, f 1 paq, from the result.
Step 2 constitutes a nonstandard interpretation of the arithmetic basis functions with (5a) and (5b). This can be implemented in various ways, for example, overloading or source-code transformation. Further, dual numbers can be represented in various ways, for example, as unboxed flattened values or as boxed values referenced through pointers. These different implementation strategies do not concern us here. While different implementation strategies have different costs, what we discuss applies to all strategies.
It is convenient to encapsulate steps 1-3 as a higher-order function D : f Þ Ñ f 1 . Indeed, that seems to be one of the original motivations for the development of the lambda calculus (Church, 1941, ¶ 4). We can do this with the following code that implements D.
Here, x`1 denotes step 1 above, that is, constructing a dual number, and tg pa`b q denotes step 3 above, that is, extracting the tangent of a dual number. Equation (6a) handles the case where the output of f is independent of the input x.
Forward AD provides certain complexity guarantees. Steps 1 and 3 take unit time.
Step 2 introduces no more than a constant factor increase in both the space and time complexity of executing f under a nonstandard interpretation. Thus computing f x and D f x have the same space and time complexity.

Tagging dual numbers to avoid perturbation confusion
Siskind & Pearlmutter (2008) discuss a problem with the above. It is natural to nest application of D. Doing so would allow taking higher-order derivatives and, more generally, derivatives of functions that take derivatives of other functions.
This can lead to perturbation confusion (Siskind & Pearlmutter, 2005, Section 2, Eqs. (4)-(11)), yielding an incorrect result. The essence of perturbation confusion is that each invocation of D must perform its computation over a distinct differential algebra. While it is possible to reject programs that would exhibit perturbation confusion using static typing (Buckwalter, 2007;Kmett, 2010), and static typing can be used to yield the desired correct result in some cases with some user annotation (Shan, 2008), no static method is known that can yield the desired correct result in all cases without any annotation. It is possible, however, to get the correct result in all cases (except, as we shall see, when taking derivatives of functions whose ranges are functions) without user annotation, by redefining tg and D to tag dual numbers with distinct s to obtain distinct differential algebras (or equivalently, distinct generators in a differential algebra) introduced by different invocations of D (Lavendhomme, 1996). We will indicate different tags by different subscripts on , and use ε to denote a variable that is bound to an .
These redefine (6a)-(6c). Here, the tags are generated dynamically. Many systems employ this approach. 1 Many of these systems are implemented in "mostly functional languages," like SCHEME, ML, F7, PYTHON, LUA, and JULIA, and are intended to be used with pure subsets of these languages. Prior to this change, that is with only a single , the values a and b in a dual number a`b would be real numbers. With this change, that is with multiple s, the values a and b in a dual number a`b 1 can be dual numbers over 2 where 2 ‰ 1 . Such a tree of dual numbers will contain real numbers in its leaves and will contain a given only once along each path from the root to the leaves. Equation (8c) provides the ability to extract the tangent of an that might not be at the root of the tree.

Extending to functions whose range is a function
If one applies D to a function f whose range is a function, f px`1εq in (8d) will yield a function. In this higher-order case, when f returns a function g, an invocation D f x yields a functionḡ which performs a derivative calculation when invoked. It will not be possible to extract the tangent of this with tg as implemented by (8a)-(8c). The definition of tg can be augmented to handle this case by post-composition. 2 tg εḡ Ÿ " ptg εq˝ḡḡ is a function (8e) However, this extension (alone) is flawed, as we proceed to demonstrate.

A bug
Consider the following commonly occurring mathematical situation. We define an offset operator: The derivative of s at zero should be the same as the derivative operator, that is, D s 0 " D, since: we would hope thatD " D. However, we exhibit an example where it does not. We can computeD pD hq y for h : R Ñ R with simple reduction steps: tallocate a fresh tag 0 ; this is problematic; see discussion belowu This went wrong, yielding 0 instead of h 2 p yq.
The process of allocating a fresh tag in step (12d) was problematic. The proper way to handle such fresh tag allocation might be to use nominal logic (Pitts, 2003), perhaps in a dependent-type-theoretic variant (Cheney, 2012). Below, we offer alternate mechanisms that are suitable for use in programming-language implementations that lack type systems that support first class names and binding. This is not an artificial example. It is quite natural to construct an x-axis differential operator and apply it to a two-dimensional function twice, along the x and then y axis directions, by applying the operator, flipping the axes, and applying the operator again, thus creating precisely this sort of cascaded use of a defined differential operator.

The root cause of the bug
This incorrect result was due to the tag 0 being generated exactly once, in (12b), whenD was calculated from D s 0 as (12a)-(12f) using the definition (11). The invocation D s 0 is the point at which a fresh tag is introduced; early instantiation can result in reuse of the same tag in logically distinct derivative calculations. Here, the first derivative and the second derivative become confused at (12l). We have two nested applications of tg for 0 , but for correctness these should be distinctly tagged: 0 versus 1 . This can be accomplished by making two copies ofD by evaluating D s 0 twice. Performing an analogous computation with two copies ofD yields the correct result.
tapply h to a dual numberu tg 0 ptg 1 php y`1 0 q`h 1 p y`1 0 q 1 qq (14r) ùñ tapply h to a dual numberu tapply h to a dual numberu Here, (14r) corrects the mistake in (12l). However, this is tantamount to requiring the user to manually write This should not be necessary since if D correctly implemented D,D 0 andD 1 should be equivalent.
The essence of the bug is that the implementation of D in (8d) generates a distinct for each invocation D f x, but a distinct is needed for each derivative calculation. In the first-order case, when f : R Ñ R, these are equivalent. Each invocation D f x leads to a single derivative calculation. But in the higher-order case, when f returns a function g, an invocation D f x yieldsḡ which performs a derivative calculation when invoked. Sinceḡ can be invoked multiple times, each such invocation will perform a distinct derivative calculation and needs a distinct ε. The implementation in the Appendix illustrates the bug when setting *eta-expansion?* and *tag-substitution?* to #f to use the definitions in (8d) and (8e).

A first solution: Eta expansion
One solution would be to eta expand the definition of D. Such eta expansion would need to be conditional on the return type of f .
With such eta expansion conditioned on the return type of f , (8e) is not needed, because the appropriate variant of D should only be invoked in a context that contains all arguments necessary to subsequently allow the call to tg in that invocation of D to yield to a nonfunction-containing value. This seemingly infinite set of D i and associated definitions can be formulated as a single D with polymorphic recursion.
We can see that this resolves the bug in (12a)-(12o) and accomplishes the desiderata in (14a)-(14l) without making two copies ofD.

Issues with eta expansion
This solution presents several problems.
• First, this manuscript only considers a space of types that includes scalar reals and functions but not aggregates (exclusive of dual numbers). Complications arise when extending the space of types to include aggregates. The Appendix illustrates that the above mechanism works with functions that return Church-encoded aggregates.
With a function that returned native aggregates, one would need to emulate the behavior that occurs with Church-encoded aggregates on native aggregates by delaying derivative calculation, with the associated tag allocation and tg applied to the native returned aggregate, until an accessor is applied to that aggregate. Consider D t 0 where t : R Ñ pRˆppR Ñ Rq Ñ Rqq as above. One could not perform the derivative calculation when computing the value p returned by D t 0. One would have to delay until applying an accessor to p. If one accessed the first element of p, one would perform the derivative calculation, with the associated tag allocation, at the time of access. But if one accessed the second element of p, one would have to further delay the derivative calculation, with the associated tag allocation, until that second element was invoked. This could require different amounts of delay that might be incompatible with some static type systems. • Second, with a type system or other static analysis mechanism that is unable to handle the unbounded polymorphism of (17a), (17b), (17c), . . . or infer the "is [not] a function" side conditions of (18a) and (18b), achieving completeness might require run-time evaluation of the side conditions. This could involve calling f twice, once to determine its return type and once to do the eta-expanded derivative calculation, and lead to exponential increase in asymptotic time complexity. • Third, the solution can break sharing in curried functions, even with a type system or other static analysis mechanism that is able to eliminate the run-time evaluation of "is [not] a function" side conditions. Consider The programmer would expect h 8 to call f once in the calculation of the temporary t " f 8. And indeed this is what would occur in practice. Now consider D h 8. The strategy discussed above would (in the absence of memoization or similar heroic measures) end up calculating f 8 twice, as the delayed tag allocation would end up splitting into two independent tag allocations with each independently redoing the calculation. This violates the constant-factor-overhead complexity guarantee of forward AD, imposing, in the worst case, exponential overhead.

A second solution: Tag substitution
Another solution would be to wrapḡ with tag substitution to guard against tag collision, replacing (8e) with: Here rε 1 {ε 2 s x substitutes ε 1 for ε 2 in x. In a language with opaque closures, tag substitution must operate on functions by appropriate pre-and post-composition. The intent of (24d) is to substitute ε 1 for ε 2 in values closed-over inḡ. An ε 2 in the output ofḡ can result either from closed-over values and/or input values. We want to substitute for instances of ε 2 in the output that result from the former but not the latter. This is accomplished by substituting a fresh tag for instances of ε 2 in the input and substituting them back at the output to preserve the extensional behavior ofḡ. Equation (23) operates in a similar fashion. The intent of (23) is to extract the coefficient of instances of ε 1 in the output ofḡ that result from closed-over values, not input values. This is accomplished by substituting a fresh tag for instances of ε 1 in the input and substituting them back at the output to preserve the extensional behavior ofḡ.
We can see that this also resolves the bug in (12a)-(12o) and accomplishes the desiderata in (14a)-(14l) without making two copies ofD.

Issues with tag substitution
This solution presents several problems, when implemented as user code in a pure language. In the presence of aggregates, unless care is taken, the computational burden of tag substitution can violate the complexity guarantees of forward AD. The call to tg in step 3 might take longer than unit time as tag substitution must potentially traverse an aggregate of arbitrary size. When that aggregate shares substructure, a careless implementation might traverse such shared substructure multiple times, leading to potential exponential growth in time complexity. Moreover, a careless implementation might copy shared substructure multiple times, leading to potential exponential growth in space complexity. Laziness, memoization, and hash-consing might solve this, but it can be tricky to employ such in a fashion that preserves the requisite time and space complexity guarantees of forward AD, particularly in a pure or multithreaded context. We are unsure, however, that laziness, memoization, and hash-consing completely eliminate the problem. First, some languages like PYTHON and SCHEME lack the requisite pervasive default laziness. Failure to explicitly code the correct portions of user code as lazy in an eager language can break the complexity guarantees in subtle ways. But there are subtle issues even in languages like HASKELL with the requisite pervasive default laziness, and even when laziness is correctly introduced manually in eager languages. One is that memoization and hash-consing implicitly involve a notion of equality. But it is not clear what notion of equality to use, especially with "gensym" and potential alpha equivalence. One might need eq?, pointer or intensional equivalence, rather than equal?, structural or extensional equivalence, and all of the impurity that this introduces. Further, memoization and hash-consing might themselves be a source of a new kind of perturbation confusion if tags can persist. One would then need to substitute the memoized tags or the hash-cons cache. Beyond this, memoization and hash-consing could break space complexity guarantees unless the cache were flushed. It is not clear when/where to flush the cache, and even whether there is a consistent place to do so. There might be inconsistent competing concerns. Finally, many systems don't provide the requisite hooks to do all of this. One would need weak pointers and finalization. All of this deserves further investigation.
The above difficulties only arise when implementing tag substitution as user code in a pure language. The opacity of closures necessitates implementing tag substitution on functions via pre-and post-composition (24d). The complexity guarantees of forward AD could be maintained if the substitution mechanism rε 1 {ε 2 s x were implemented so that it (a) did not traverse shared substructure multiple times, (b) copied shared substructure during renaming in a fashion that preserved structure sharing, and (c) could apply to closures, by accessing, copying, renaming, and reclosing around the environments inside closures, without resorting to pre-and post-composition.
This could be accomplished either by including the rε 1 {ε 2 s x mechanism as a primitive in the implementation, or by providing other lower-level primitives out of which it could be fashioned. One such mechanism is map-closure, the ability to reflectively access and modify closure environments .

Differential geometry and the push forward operator
The definition (3) only extends D, and the mechanisms of Sections 7 and 8 only extend D, to higher-order functions R Ñ α whose ranges are functions. Differential geometry provides the framework for extending D to functions α 1 Ñ α 2 whose domains too are functions. Differential geometry concerns itself with differentiable mappings between manifolds, where intuitively a manifold is a surface along which points can move smoothly, like the surface of a sphere or the space of nˆn rotation matrices. Given a point x, called a primal (value), on a manifold α, we can consider infinitesimal perturbations of x. The space of such perturbations is a vector space called a tangent space, denoted by T x α. This is a dependent type, dependent on the primal x. A particular perturbation, an element x 1 of the tangent space, is called a tangent (value). A pair px, x 1 q of a primal and tangent value is called a bundle (value), which are members of a bundle space Tα " ř x:α txuˆT x α. Bundles generalize the notion of dual numbers. So if x has type α, for some α, the tangent x 1 has type T x α, and they can be bundled together as px`x 1 q which has type Tα.
The machinery of differential geometry defines T x α for various manifolds and spaces α. For function spaces α Ñ β, where f is of type α Ñ β, T f pα Ñ βq " pa : αq Ñ T f paq β and Tpα Ñ βq " α Ñ Tβ. The function bundle px : αq px 1 : T x αq Þ Ñ px, x 1 q : Tα constructs a bundle from a primal and a tangent, and the function tangent px, x 1 q : Tα Þ Ñ x 1 : T x α extracts a tangent from a bundle. Differential geometry provides a push forward operator that generalizes the notion of a univariate derivative from functions f of type R Ñ R to functions f of type α Ñ β.
This augments the original mapping pa : αq Ñ β to also linearly map a tangent T a α of the input a to a tangent T f paq β of the output f paq.
Here we sketch how to materialize differential geometry as program constructs to generalize D to functions α 1 Ñ α 2 whose domains (and ranges) are functions. A full treatment is left for future work. We first note that: (27) This only applies when x : R because of the constant 1. We can generalize this to a directional derivative: This further generalizes to x of any type. With this, D becomes a special case of Ý Ñ J : To materialize Ý Ñ J in (28), we need to materialize tangent, pf, and bundle. The definition of tg in (8a)-(8c) and (8e) materializes tangent with the first solution, eta expansion (Section 7), while that in (8a)-(8c) and (23) does so with the second solution, tag substitution (Section 8). The nonstandard interpretation of the arithmetic basis functions sketched in (5a) and (5b) materializes pf by lifting a computation on real numbers to a computation on dual numbers. All that remains is to materialize bundle. So far, we have been simply writing this as step 2, a map from a to a`1 or a map from x to x`1ε in (8d). This only works for numbers, not functions. With the framework of the first solution, eta expansion (Section 7), we can extend this to functions: bun ε x x 1 Ÿ " x`x 1 ε x and x 1 are not functions (30a) bun ε f f 1 y Ÿ " bun ε p f yq pf 1 yq f and f 1 are functions (30b)

Conclusion
Classical AD systems, such as ADIFOR (Bischof et al., 1992), TAPENADE (Hascoët &Pascual, 2004, andFADBAD``(Bendtsen &Stauning, 1996), were implemented for firstorder languages like FORTRAN, C, and C``. This made it difficult to formulate situations like (7) where the kind of perturbation confusion reported by Siskind & Pearlmutter (2005) can arise. Thus classical AD systems did not implement the tagging mechanisms reported by  and . Moreover, such classical AD systems do not expose a derivative-taking operator as a higher-order function, let alone one that can take derivatives of higher-order functions. In these systems, it is difficult to formulate the bug in Section 5. Note that the difficulty arises from the nature of the language whose code is differentiated and not the fact that many classical systems like ADIFOR and TAPENADE expose AD to the user via a source-code transformation implemented via a preprocessor rather than a higher-order function. Conceptually, both a higher-order function and a preprocessor applying a transformation to source code map functions to functions. Thus while one might write: in a system that exposes AD to the user with an interface as a higher-order function D, one would accomplish essentially the same thing in a system that exposes AD to the user with a preprocessor that implements a source-code transformation by having the preprocessor compute the let binding f 1 Ÿ " D f . The issue presented in this manuscript would arise even in a framework that exposes AD to the user with a preprocessor that implements a source-code transformation if one would write let s 1 Ÿ " D s in letD Ÿ " s 1 0 inD pD hq y (37) and have the preprocessor compute the let binding s 1 Ÿ " D s. The difficulty in formulating the issue presented in this manuscript follows from the fact that classical languages like FORTRAN, C, and C``lack the capacity for higher-order functions (closures) needed to perform the let bindingD Ÿ " s 1 0, not from any aspect of the difference between exposing AD via an interface via a higher-order function versus a preprocessor that implements a source-code transformation. Indeed, the issue described here would manifest in a system that exposed AD via a preprocessor that implements a source-code transformation in a language such as PYTHON that supports the requisite closures and higher-order functions (e.g., MYIA, Breuleux &van Merriënboer, 2017 andTANGENT, van Merriënboer et al., 2018).
Recent AD systems, such as MYIA, TANGENT, and those in footnote 1, as well as the HASKELL AD package available on Cabal (Kmett, 2010), the "Beautiful Differentiation" system (Elliott, 2009), and the "Compiling to Categories" system (Elliott, 2017), have been implemented for higher-order languages like SCHEME, ML, HASKELL, F7, PYTHON, LUA, and JULIA. One by one, many of these systems have come to discover the kind of perturbation confusion reported by Siskind & Pearlmutter (2005) and have come to implement the tagging mechanisms reported by  and . Moreover, all these recent systems expose a derivative-taking operator as a higherorder function. However, except for SCMUTILS, none supported taking derivatives of higher-order functions.
Prior to its 30 August 2011 release, SCMUTILS, the only forward AD system that supported taking derivatives of higher-order functions, employed the mechanism of (8a)-(8e) and exhibited the bug in Section 5. An attempt was made to fix this bug in the 30 August 2011 release of SCMUTILS, using the second solution, tag substitution, discussed in Section 8, in response to an early version of this manuscript. SCMUTILS was patched to include code that is similar to, but not identical to, (23) and (24a)-(24d). Crucially, it allocates a fresh tag in its implementation of (23) but not in its implementation of (24d); its implementation of (24d) being This, however, is incorrect, as illustrated by the following variant of the bug in Section 5 : Variants of (10a)-(10c) show that D v 0 pD v 0 iq h y " h 2 p yq. The 27 August 2016 release, the current release at the time of writing, however, yields D v 0 pD v 0 iq h y ùñ 0. Both solutions presented here yield the correct result.
In 2019, the authors reached out to Gerald Jay Sussman, one of the authors of SCMU-TILS, to help fix SCMUTILS. He asked whether we could produce an example that illustrated the necessity of performing substitution on functions (24d) and why an alternate rε 1 {ε 2 sḡ Ÿ "ḡḡ is a function (41) that did not perform substitution on functions wouldn't suffice. A variant of (9) and (11) that wraps and unwraps arguments and results in Church-encoded boxes illustrates the necessity of (24d). The same analysis as (10a)-(10c) shows that: UNWRAP pD pWRAPTWORESULT sq 0 pD pWRAPTWORESULT sq 0 pWRAP hqqq " h 2 (42h) While UNWRAP pD pWRAPTWORESULT sq 0 pD pWRAPTWORESULT sq 0 pWRAP hqqq " h 2 (42i) with both (24d) and (41), with (24d), UNWRAP pWRAPPEDD pWRAPPEDD pWRAP hqqq " h 2 (42j) but with (41), UNWRAP pWRAPPEDD pWRAPPEDD pWRAP hqqq ‰ h 2 (42k) The authors of SCMUTILS are in the process of fixing it again in response to this updated manuscript. The tenacity of this bug illustrates its subtlety and cries out for a proof of correctness. Practically all systems that expose a derivative-taking operator as a higher-order function generalize that operator to take gradients and Jacobians of functions whose domains and/or ranges are aggregates, and most have come to implement tagging. The current forefront of deep learning research often involves nested application of AD and application of AD to higher-order functions (Maclaurin et al., 2015b;Andrychowicz et al., 2016;Raissi, 2018;Chen et al., 2018;Salman et al., 2018). This work often combines building custom frameworks to support the particular derivatives of interest, and performing transformations (closure conversion or even full AD transforms) manually. Under the pressure of machine learning programmers' desire for nesting and for derivatives of higher-order functions, it is reasonable to speculate that many, if not most, of the above systems will attempt to support these usage patterns. We hope that the awareness provided by this manuscript will help such efforts avoid this particular subtle bug.
Without formal proofs, we cannot really be sure whether the first solution, eta expansion ((8a)-(8c), (18a), (18b)), or the second solution, tag substitution ((8a)-(8d), (23)), correctly implements the specification in (3). We cannot even be sure that (8a)-(8d) correctly implement the specification in (1). These are tricky due to subtleties like nondifferentiability, nontermination, and the difference between function intensions and extensions pointed out by Siskind & Pearlmutter (2008, footnote 1). Ehrhard & Regnier (2003), Manzyuk (2012aManzyuk ( , 2012b, Kelly et al. (2016), and Plotkin (2018) present promising work in this direction. Given these sorts of subtle bugs, and the growing interest in-and economic and societal importance of-complicated software systems driven by nested automatically calculated derivatives, it is our hope that formal methods can bridge the gap between the Calculus and the Lambda Calculus, allowing derivatives of interest of arbitrary programs to be not just automatically and efficiently calculated, but also for their correctness to be formally verified.