Floating-point arithmetic

Floating-point numbers have an intuitive meaning when it comes to physics-based numerical computations, and they have thus become the most common way of approximating real numbers in computers. The IEEE-754 Standard has played a large part in making floating-point arithmetic ubiquitous today, by specifying its semantics in a strict yet useful way as early as 1985. In particular, floating-point operations should be performed as if their results were first computed with an infinite precision and then rounded to the target format. A consequence is that floating-point arithmetic satisfies the ‘standard model’ that is often used for analysing the accuracy of floating-point algorithms. But that is only scraping the surface, and floating-point arithmetic offers much more. In this survey we recall the history of floating-point arithmetic as well as its specification mandated by the IEEE-754 Standard. We also recall what properties it entails and what every programmer should know when designing a floating-point algorithm. We provide various basic blocks that can be implemented with floating-point arithmetic. In particular, one can actually compute the rounding error caused by some floating-point operations, which paves the way to designing more accurate algorithms. More generally, properties of floating-point arithmetic make it possible to extend the accuracy of computations beyond working precision.

Floating-point numbers have an intuitive meaning when it comes to physics-grounded numerical computations and they have thus become the most common way of approximating real numbers in computers. The IEEE-754 standard has played a large part in making floating-point arithmetic ubiquitous today, by specifying its semantics in a strict yet useful way as early as 1985. In particular, floating-point operations should be performed as if their results were first computed with an infinite precision and then rounded to the target format. A consequence is that floating-point arithmetic satisfies the "standard model" that is often used for analyzing the accuracy of floating-point algorithms. But that is only scrapping the surface and floating-point arithmetic offers much more. In this survey, we remind the history of floating-point arithmetic as well as its specification mandated by the IEEE-754 standard. We also remind what properties it entails and what every programmer should know when designing a floating-point algorithm. We also provide various basic blocks that can be implemented with floating-point arithmetic. In particular, one can actually compute the rounding error caused by some floating-point operations, which paves the way to designing more accurate algorithms. More generally, properties of floating-point arithmetic make it possible to extend the accuracy of computations past the working precision.

Introduction
Floating-Point arithmetic has a long history. In one respect, as advocated by Knuth (1998), it can be traced back to the Babylonian radix-60 number system, invented around 2000 BC, with which (due to the lack of a digit for zero), only the floatingpoint significands of numbers were represented, that is, the numbers , /60 and · 60 5 had the same representation (Iffrah 1999). The invention of the exponents took place in several stages. A kind of exponential notation for representing huge numbers was described by  in his treatise The Sand-Reckoner (Hirshfeld 2009). It seems that the first person to consider zero or negative exponents was Nicolas Chuquet (c. 1450-1488) (Flegg, Hay and Moss 1985). Some of the first computers, designed from the 1940's, provided a floatingpoint arithmetic. Notably enough, Konrad Zuse's Z3 computer, completed by the end of 1941, had a binary floating-point arithmetic far ahead of its time, with special representations for undefined results and infinities (Ceruzzi 1981). Many very different solutions (various radices-2, 8, 16, 10, even 3-various precisions, various ways of rounding the inexact operations and handling the exceptional cases such as forbidden operations, overflows and underflows, etc.) were introduced in the 50's, 60's and 70's, resulting in a total chaos well described by Kahan (1981) in his paper Why do we need a floating-point arithmetic standard? Sometimes, writing portable software required magic tricks only known by old programming wizards, such as inserting, at well-chosen places, multiplications by 1, or adding 0.5 twice instead of adding 1. One could check that a variable is nonzero and despite this obtain a "zero divide" error when attempting to divide some number by .
The IEEE-754 Standard for Floating-Point arithmetic put an end to that mess. It was adopted in 1985. It greatly facilitated the design and portability of numerical software. The reader interested by the history of the birth of the standard can read the interview with William Kahan by Severance (1998). A significant revision of IEEE-754 was published in 2008, and a minor revision of the 2008 version was released in 2019 (IEEE 2019). Since 2012, the IEEE Standards have a 10-year validity period only,1 a new revision is therefore expected to be published around 2029. It will likely be a thorough revision since the domain has been drastically evolving in the recent years. Now, most FPUs are compliant with IEEE-754. The first generations of GPUs were not compliant, but the situation has improved substantially and one can perform IEEE-754 arithmetic on most of them. By contrast, machine learning cores are nowhere near, as the frequent lack of a clear documentation on their arithmetic makes it very difficult to prove anything on the behavior of an algorithm. To mitigate this difficulty and guess what the arithmetic operators exactly do, one has to craft carefully designed numerical tests (Fasi et al. 2021). This is 4 S. Boldo,G. Melquiond, reminiscent of the first days of the IEEE-754 era, when one had to run software such as PARANOIA (Karpinsky 1985) to check for compliance with the standard.
It is commonplace that computers are much faster today than 35 years ago. However, the pace of this impressive performance increase has not been the same for all parts of our machines. Between 1986 (i.e., just after the release of the first version of the IEEE Standard for Floating-Point arithmetic) and 2000, the improvement factor per year has been around 1.52 in processor performance, and 1.07 in memory latency (Hennessy and Patterson 2012, Chapter 2). As a consequence, the ratio time to read/write in memory time to perform +, ×, ÷, √ has increased by a factor around 140 between 1986 and 2000. It has continued to increase after 2000, but at a somehow slower pace. Figure 1.1 presents typical current latencies of arithmetic operations and accesses to cache or main memory. Due to this drastic evolution the current situation of numerical computing is quite different from the situation at the time of the birth of IEEE-754. The challenge is no longer (or, at least, not only) to design fast arithmetic operators, but to be able to feed them with data at a very high rate, so that they do not become idle too often. All this has led to the implementation of many new architectural concepts: multiple levels of cache, pipelining, vector instructions, branch prediction, and so on. The impact of these changes on floating-point computing is very important, as heavy pipelining makes division and square root significantly slower than addition or multiplication, and may make branching very costly, unless some regularity in the branching patterns allows for a very efficient branch prediction. It also makes checking the IEEE-754 flags (overflow, inexact, etc.) without losing too much performance an almost hopeless task. Numbers represented in very small formats (e.g., 16-bit numbers) are processed faster than the "usual" binary32 or binary64 numbers (in the delay required to transfer one 128-bit number to/from memory, one can transfer eight 16-bit numbers), so that one is tempted to use these small formats whenever possible, and hence do mixed-precision arithmetic ). An example is digital neural network training, which requires a huge amount of calculations, that are performed on numbers of very small (16-bit, 8-bit or even less) widths. This, however, requires much care. Whereas catastrophic events such as overflow, underflow, and total loss of accuracy, are relatively rare when computing in binary64/double-precision arithmetic, they can occur very quickly when using small formats. And, with many different arithmetics emerging, there is the fear of going back to the pre-1985 chaos. . . Another significant change in the last 40 years lies in the applications of floatingpoint computing. Whereas most users of 1985 were safely running numerical simulation programs on the ground, embedded computing is now ubiquitous, with numerical calculations being performed on board trains, aircraft, and partly-automated cars, with potential risks to human lives if something goes wrong. Rigorous validation of critical numerical software is now essential. As pointed out by Demmel  Figure 1.1. Typical current latencies (in number of cycles) of various FP operations and cache/memory accesses (beware, the scale is logarithmic). These figures vary from one processor to another, but the orders of magnitude remain similar. With a 2 GHz processor, one cycle is 0.5 nanoseconds.
and Riedy (2021), at a high level, one significant change is the burgeoning demand for reliability. This is especially true of the basic arithmetic software, since all of numerical computing is built upon it. The use of formal proof techniques has become more and more important since the years that followed the Pentium FDIV bug (Moore, Lynch and Kaufmann 1998, Harrison 1999, Cornea-Hasegan, Golliver and Markstein 1999. This is especially true given that the proofs of some arithmetic algorithms are rather long and tedious, with the consequence that they are seldom read and, therefore, that an error may remain unnoticed. See the book by Boldo and Melquiond (2017) for a thorough presentation, and the work by Muller and Rideau (2022) for recent examples in double-word arithmetic. Throughout the history of floating-point arithmetic, useful material describing what need to be known by an applied mathematician, a computer scientist or an engineer, has been published. Still of much interest are the article by Goldberg (1991) What Every Computer Scientist Should Know About Floating-Point Arithmetic, the books by Overton (2001) and Higham (2002), and the notes of Kahan (1997Kahan ( , 2004a. Some of us contributed, more recently to a Handbook of Floating-Point Arithmetic . Much useful information on floating-point arithmetic (especially for C programmers) and IEEE-754 can be found in Beebe (2017, Chapter 4). Finally, a very useful overview has been given recently by Higham (2021a).
which there exists at least one representation ( , ) such that = · − +1 , (2.1) where and are integers satisfying  In the following, the set F , , min , max will be denoted as F, whenever the values of , , min , and max are unambiguous. We also define F = F ∪ {−∞, +∞}, and R = R ∪ {−∞, +∞}. For a given nonzero floating-point number , several possible values of and may satisfy (2.1), (2.2), and (2.3). The normalized representation of is the one for which | | is maximal (or, equivalently, is minimal). The number zero is somehow particular and in practice a special representation is reserved for it. The integral significand of the normalized representation of is called the integral significand of , noted , and the exponent of the normalized representation of is called the exponent of , noted . We say that a nonzero number ∈ F is a normal number if | | ≥ min , and a subnormal number otherwise (according to IEEE 754-2019, zero is neither normal nor subnormal). Without difficulty, one finds that if ∈ F is normal then | | ≥ −1 . The number min is sometimes called the underflow threshold.
In all cases of interest for a numerical analyst,2 is equal to 2, but radix 10 is provided anyway on most systems (mostly for accounting applications). Other radices (such as 4 or 8) have been used in the early days of computer arithmetic, but studies performed in the 70's have shown that they are of little interest (Brent 1973, Kuki andCody 1973). As a consequence, from now on, unless stated otherwise, we assume a binary floating-point arithmetic, that is, One of the advantages of radix 2 is that the leftmost digit of the significand of a floating-point number is necessarily a 1 if it is normal, and a 0 if it is subnormal. Since normality can easily be encoded in the exponent field of the representation, the leftmost bit of the significand does not need to be stored. This is the "hidden bit" or "implicit bit" convention of IEEE-754. This saving of one bit is especially precious in small formats.
The IEEE-754 Standard for Floating-Point arithmetic (IEEE 2019) specifies several binary and decimal formats. Among them, of special importance, are the interchange formats, whose encodings are fully specified as bit strings, which allows for lossless data interchange between platforms. The parameters of the binary interchange formats are given in Table 2.1, along with those of the bfloat16 format (Intel 2018).
The binary32, binary64, and binary128 formats are named basic formats in the standard. IEEE-754 also recommends, for the widest available basic format, the support of an extended format with a wider precision and wider range (IEEE 2019, Table 3.7). The underlying idea is that the availability of the extended format greatly facilitates the implementation of functions for the corresponding basic format that are both very accurate and free of spurious underflows or overflows (see Section 2.2). In practice, the only extended format that our readers will encounter 8 S. Boldo, C.-P. Jeannerod, G. Melquiond, J.-M. Muller (IEEE 2019), and of the Bfloat16 format (Intel 2018 is the Intel double extended format ( = 64, max = 16383, min = −16382), used in the x86 instruction set. Let us say a few words about the very small floating-point formats. Among the binary formats specified by IEEE-754 since its 2008 version, only binary32, bi-nary64, and binary128 are called basic formats, which means that for the designers of the standard, only these ones were supposed to be used for doing arithmetic. The binary16 format was meant to be used for storage only. However, it became evident in the years thereafter that 16-bit floating-point arithmetic was useful for the huge amount of computation required when training neural networks. Furthermore, for these applications the small exponent range of binary16 was a clear penalty. For example, experiments performed at Google (Wang and Kanwar 2019) suggest that such training computations are more sensitive to the exponent range than to the precision. This has motivated the introduction of the Bfloat16 (or BF16) format (Intel 2018, Henry, Tang and Heinecke 2019, Osorio et al. 2022) and of the DLFloat (or DLFLT-16) format (Agrawal, Mueller, Fleischer, Sun, Wang, Choi andGopalakrishnan 2019, Lichtenau, Buyuktosunoglu, Bertran, Figuli, Jacobi, Papandreou, Pozidis, Saporito, Sica andTzortzatos 2022). Smaller, 8-bit or 9-bit floating-point formats (with 2-bit or 3-bit significands) (Chung et al. 2018), or even 4-bits formats (Sun et al. 2020), have been suggested for AI applications. In particular, 8-bit floating-point formats recently received specific attention from several major companies (Noune, Jones, Justus, Masters andLuschi 2022, Micikevicius et al. 2022). At the time of writing these lines, a working group is being set up to prepare a standard on arithmetic formats for machine learning.3 For experimenting with very low precisions even if they are not yet available in hardware, several simulation strategies and tools have been designed (Lefèvre 2013, Rump 2017, Higham and Pranesh 2019, Fasi and Mikaitis 2020.
To help the reader get an intuition about these numerous formats, Table 2.2 presents the dynamic range of the most supported floating-point formats. is the smallest positive FP number, 2 min is the smallest positive normal FP number, and is the largest finite FP number. Beware: in bfloat16 arithmetic, subnormal numbers are not necessarily supported.

Rounding and standard model
In general, the sum, product, quotient of two elements of F is not an element of F. It must be rounded. With some of the early computers, this just meant that the returned value was "not too far" from the exact result. One of the most fruitful ideas popularized by IEEE-754 is the notion of correct rounding, although the term was already mentioned by Wilkinson (1960). One chooses a rounding function • from R to F (see below), and each time an arithmetic operation Δ is called, what is returned is •( Δ ). If the computer implementation of Δ satisfies this requirement, Δ is said correctly rounded. IEEE-754 requires correct rounding of the four arithmetic operations, the square root, and the fused multiply-add (FMA) instruction, defined for , , ∈ F as FMA( , , ) = •( + ). (2.4) That instruction first appeared in 1990 in the IBM POWER instruction set (Cocke and Markstein 1990). It was incorporated in the 2008 version of IEEE-754. Beyond making the evaluation of polynomials and the computation of inner products faster and, in general, more accurate, it greatly facilitates the software implementation of correctly rounded division and square root (Markstein 1990, Cornea-Hasegan et al. 1999, in part thanks to Properties 2.13 and 2.14. In some cases it also makes possible correctly-rounded multiplication by real constants (Brisebarre and Muller 2008). The IEEE-754 standard also recommends (but does not require4) correct rounding of a small set of algebraic and transcendental functions. In some instruction sets, e.g., Intel AVX512F (Anderson, Zhang and Cornea 2018), the rounding function can be encoded in the opcode of the floating-point instruction.
Let us now give a more formal definition of a rounding function, inspired from Kulisch (1971). Boldo, C.-P. Jeannerod, G. Melquiond, J.-M. Muller • it is monotone: A function • that satisfies Definition 2.3 is such that for any real number , if ∉ F then •( ) is one of the two elements of F that surround (i.e., it is what we call a faithful rounding, see below). In practice, we seldom use "general" rounding functions such as the ones specified by Definition 2.3, because they lack useful properties such as simple relations between •( ) and •(− ), or between •( ) and •(2 ) for some integer , etc.
The classical rounding functions that are of practical interest in floating-point arithmetic are: • Directed rounding functions: round towards −∞: RD( ) is the largest element of F less than or equal to ; round towards +∞: RU( ) is the smallest element of F larger than or equal to ; round towards zero: RZ( ) = RD( ) if ≥ 0, and RU( ) otherwise; • Round-to-nearest functions: RN( ) is the element of F nearest to , with the following possible tie-breaking choices if is halfway between two consecutive FP numbers (in the following, such halfway numbers are called midpoints): round to nearest, ties-to-even: RN ( ) is the one of these two FP numbers whose integral significand is even; round to nearest, ties-to-away: RN ( ) is the one with largest magnitude; round to nearest, ties-to-zero: RN 0 ( ) is the one with smallest magnitude.
They are called Rounding direction attributes in the IEEE-754 Standard. See also Figure 2.3 for an illustration of the rounding functions.
To that list, we should add a more complex rounding "function," that does not follow the requirements of Definition 2.3, but is slowly gaining importance: stochastic rounding. If ∈ F, then SR( ) = , otherwise SR maps the real number to RD( ) with probability P( ) and to RU( ) with probability 1 − P( ). Two choices for P( ) have been considered in the literature. The first one is a constant function (in general equal to 1/2), and the second one is Stochastic rounding can be traced back to the first years of electronic computing (Barnes, Cooke-Yarborough andThomas 1951, Forsythe 1959). It is not deterministic (which makes calculations difficult to reproduce) and it is significantly more complex to implement than the other rounding functions, but (especially when we consider the second choice for P) it has many interesting properties. More precisely, it prevents some correlation of errors, avoids the phenomenon of stagnation,5 and gives much better error bounds on common algorithms such as sums or inner products. This active topic is covered by a recent survey by Croci et al. (2022). Stochastic rounding has also been suggested to obtain probabilistic estimates of the round-off error of a calculation (La Porte andVignes 1974, Parker, Pierce andEggert 2000).
In the IEEE-754 Standard, for the round to nearest ties-to-even or ties-to-away rounding functions, the above-presented rules are slightly modified for huge arguments. If | | is larger than or equal to 2 max +1 − 2 max − then ∞ (with the same sign as ) is returned. Equivalently this can be viewed as if we first applied the above-presented rules with an unbounded above exponent range before replacing all results of magnitude larger than by ±∞. The designers of the standard rightfully estimated that if the exact result of a calculation is huge (i.e., of magnitude much larger than ), returning an infinity makes more sense than returning ± .
The default rounding function in the IEEE-754 Standard is RN ( ). The "ties-toeven" tie breaking rule may look strange, but it has the double advantage of being unbiased (which may matter in long summations) and straightforwardly implementable.6 The directed roundings RD, RU, and RZ are useful for getting certain upper or lower bounds on a result, and more generally for implementing interval arithmetic (see Section 4.1). RN is used in accounting; the IEEE Standard requires its availability in radix-10 arithmetic only. RN 0 is used with the "augmented operations" (see Section 4.2.3) and is in particular needed in some reproducible summation algorithms (Riedy and Demmel 2018). In the following, RN means any of RN , RN or RN 0 , unless stated otherwise, and R U D is RN, RZ, RU, or RD. Although it is not a rounding in the sense expressed by Definition 2.3, an important notion is that of faithful rounding. We will say that ∈ F is a faithful rounding of ∈ [− , + ] if is either RD( ) or RU( ). Note that any of the previously-defined roundings is a faithful rounding by definition. Now, let us define the following parameters: Definition 2.4 (unit roundoff, ulp, and ufp).
• The unit roundoff is the number Let ∈ R, | | ≤ , • the unit in the last place of ≠ 0 is the number ulp( ) = 2 max{ min , ⌊log 2 | |⌋ }− +1 . 5 An example of stagnation is the computation of = =1 1 in rounded-to-nearest arithmetic. We all know that the exact value of goes to +∞ as → +∞, however, the computed value of remains constant when is larger than some threshold 0 that depends on the precision . 12 S. Boldo, C.-P. Jeannerod, G. Melquiond, J.-M. Muller • the unit in the first place of , for ≠ 0, is the number ufp( ) = 2 ⌊log 2 | |⌋ . By a simple extension by continuity we can also define ulp(0) = 2 min − +1 and ufp(0) = 0. In the normal domain, the ulp and ufp functions can be interchanged (with the adequate scaling factor) in all formulas, since ufp( ) = 2 −1 ulp( ). But in the subnormal domain, they behave very differently. Indeed, all nonzero reals in the subnormal domain have the same ulp, equal to , while their ufp can take − 1 distinct values.
The rounding unit was already appearing in a paper by Wilkinson (1960), and that notion is now widespread in rounding error analysis of numerical algorithms. The acronym "ulp" for unit in the last place was coined by Kahan. Several slightly different definitions of the ulp function appear in the literature (Goldberg 1991, Harrison 1999, Kahan 2004a, Overton 2001, they differ near powers of 2. The ufp function seems to have appeared for the first time in an article by Rump, Ogita and Oishi (2008).
Let ∈ R, | | ≤ . Then RN( ) is a multiple of ulp( ). The number ulp( ) is the distance between the two consecutive FP numbers and that satisfy ≤ | | < . It follows that | − RN( )| ≤ 1 2 ulp( ). Therefore: • if is a subnormal number (i.e., | | < 2 min ) then From this, we deduce that, provided the arithmetic operations are correctly rounded, if ∈ F, ∈ F, op ∈ {+, −, ×, /}, and 2 min ≤ | op | ≤ then there exist real numbers 1 and 2 such that Identity (2.5a), already used by Wilkinson (1960) for analyzing some problems of linear algebra, is sometimes called the standard model of floating-point arithmetic.
Most of rounding error analysis is based on it (Higham 2002). Several modifications of this model are possible.
• First, if the result of an operation is subnormal, one cannot in general guarantee a relative error bound such as (2.5a) or (2.5b), but the absolute error is bounded by /2 = 2 min − . • Second, since all FP numbers are multiple of , the exact sum (or difference) of two elements of F is a multiple of too. It follows from Property 2.2 that if such a sum has absolute value less than or equal to 2 = 2 min +1 , it is an element of F. In particular, when the result of a floating-point addition or subtraction is in the subnormal range, that addition or subtraction is exact. That property is sometimes called Hauser's Theorem (Hauser 1996); • Finally, the bound u on | 1 | in (2.5a) can be replaced by the slightly smaller u/(1+u), which is attained for example at the midpoint 1+u, see Knuth (1998, p. 232). This may seem a very small change, but it suffices to get simpler values for the error bounds on the evaluation of arithmetic expressions such as sums, dot products, and Euclidean norms (Rump 2019).
Finally, for some specific arithmetic operations, the relative error bounds u or u/(1 + u) can be slightly improved. More precisely, Jeannerod and Rump (2018) prove the optimality of the bound u/(1 + u) for addition, subtraction, and multiplication. They also show that in binary FP arithmetic the optimal bound is u − u 2 for division and 1 − 1/ √ 1 + 2u for square root. The standard model can also be used with directed (and faithful) roundings. Indeed, if RN is replaced by RU, RD, or RZ, (2.5), (2.5b), (2.6a) and (2.6b) still apply, with the bounds on | 1 | and | 2 | now equal to 2u and the bounds on | 1 | and | 2 | replaced by . Table 2.3 summarizes the notation introduced in this section, and Figure 2.3 illustrates parts of it in the case of the toy system = 2 and = 3.
One should not underestimate the importance of the standard model. Indeed, the fact that the individual arithmetic operations have a bounded relative error is at the heart of all of numerical error analysis since the pioneering work of Wilkinson (1960). Performing critical calculations using a computer number system that does not guarantee that is very unwise.7 = 3 (i.e., u = 1/8).

Underflow and Overflow
We will say that an operation underflows if its result is of absolute value less than 2 min and inexact. This is the condition for raising the underflow flag in the default exception handling of the IEEE-754 Standard.8 At first glance, that choice may look strange, but it makes sense. Indeed, since the underflow flag is a warning that the computed result may be less accurate than expected (because we are outside of the domain of validity of the standard model), there is no point in giving that warning when the result is exact. An operation overflows if the rounded result we would obtain if the exponent range was unbounded has a magnitude strictly larger than . For instance, with the round-to-nearest ties-to-even or ties-to-away rounding function, an operation overflows when the exact result has absolute value larger than or equal to 2 max +1 − 2 max − . Still with the RN function, when an operation overflows, the returned result is ±∞ with the correct sign. With the other rounding functions, it will be ± or ±∞, according to the definition of these rounding functions given in Section 2.3.
As the IEEE-754 Standard specifies two infinities, it also specifies two zeros, +0 and −0. The idea of having signed zeros may sound strange, but it greatly facilitates the handling of branch cuts when implementing complex functions (Kahan 1987).
One of the major difficulties when designing function software that is supposed to work in all cases is to avoid spurious underflows or overflows. These are underflows or overflows that occur in an intermediate operation, resulting in inaccurate or infinite results, whereas the exact value of the function is in the normal domain. A simple example is the "naive" calculation of hypot( , ) = √︁ 2 + 2 . In binary64 arithmetic, with = 1.5 × 2 511 and = 2 512 we obtain an infinite result because the computation of 2 overflows, whereas the exact result is 5 · 2 510 , which is much smaller than . Similarly, with = = (45/64) × 2 −537 , the computed result is 0, then has an even integral significand round to nearest, ties-to-even RN( ) "generic" round-to-nearest (arbitrary tie-breaking rule) "generic" round-to-nearest R U D ( ) any of RD( ), RU( ), RZ( ), RN( ) "generic" rounding function FMA( , , ) R U D ( · + ) fused multiply-add whereas the exact result is around 1.9887 × 2 −538 , i.e., much larger than 2 min . To overcome this difficulty, two strategies have been suggested over the years: 1 Start with the simple, straightforward, and fast, algorithm. Check with the IEEE-754 flags if an exception has occurred (Hull, Fairgrieve and Tang 1994). If this is the case (which, hopefully, should not happen frequently), redo the calculation with an alternative (and in general, significantly slower) algorithm. 2 Scale the operands, i.e., multiply them by a well-chosen value (Beebe 2017, §8.2). In the case of function hypot( , ) a typical scaling factor is 1/max(| |, | |), or, even better, 2 − where = ⌊log 2 (max(| |, | |))⌋.

Not a Number (NaN)
With some early floating-point systems the occurrence of a "forbidden" or "undefined" operation such as √ −7 or ∞/∞ (with any signs) would halt the computation. In general, this is a poor decision, as the result of a complex calculation is rarely just one number. Imagine a long computing process that would have generated thousands of meaningful outputs and that is stopped because an error occurred during the calculation of just one, possibly non-important, parameter.
The IEEE-754 Standard introduced a special floating-point value, NaN, for representing the result of such operations. NaN stands for "Not a Number" or "Not any Number" (Kahan 1997). More precisely, the standard distinguishes two different kinds of NaNs: signaling NaNs (sNaN) and quiet NaNs (qNaN). Signaling NaNs do not appear as the result of an arithmetic operation (the standard suggest that they can be used to represent uninitialized inputs), and the reader will seldom encounter them in practice.
On the contrary, under the default exception handling of the standard, a quiet NaN is delivered when performing one of the following arithmetic operations: √︁ negative number, (±0) × (±∞), 0/0, ±∞/±∞ (with any signs), ∞ − ∞ (when both signs are equal), Remainder(anything, 0), and Remainder(±∞, anything). Moreover, when at least one of the inputs of an arithmetic operation is a quiet NaN, a quiet NaN is returned.
There is an exception though. If a multivariate function always delivers the same value once some of its inputs are set, then that same value is returned even if the remaining inputs are a quiet NaN. For instance, the standard suggests that an implementation of hypot( , ) = √︁ 2 + 2 should satisfy hypot(±∞, qNaN) = +∞.

Beyond the standard model
The Standard Model (2.5a) is simple and powerful, and, interestingly, floating-point arithmetic is not the only kind of arithmetic that satisfies it. Logarithmic Number Systems (Swartzlander and Alexpoulos 1975) also guarantee a bounded relative error for rounding functions, so an error analysis that only uses the standard model also applies to computations using these systems.
However, some useful algorithms and properties of floating-point arithmetic require more than just the standard model to be analyzed and proved. The simplest (and maybe the most useful) example is a simple subtraction of two floating-point numbers that are close enough to each other: Theorem 2.5 (Sterbenz (1974)). Let , ∈ F. If 2 ≤ ≤ 2 then − ∈ F. This implies that the subtraction will be computed exactly in FP arithmetic, whatever the chosen rounding function.
Clearly, the Standard Model does not suffice to deduce Theorem 2.5, as Logarithmic number systems do not satisfy that property. We will not give many proofs in this survey, but, following Rump et al. (2008), let us quickly prove Theorem 2.5 to give an idea of the kind of reasoning that frequently appears in FP arithmetic. Since and play a symmetrical role we can assume ≥ . The two FP numbers and are multiple of ulp( ), therefore − is a (positive) multiple of ulp( ). From the hypothesis 2 ≤ , we deduce − ≤ . Since ∈ F, < 2 ulp( ). Let = log 2 (ulp( )) ≥ min − + 1. Since − is a multiple of 2 less than 2 + , it is a FP number by Property 2.2.
Two important things must be said concerning Theorem 2.5: • It remains true in radix-FP arithmetic. Beware: the "2" that appears in the expressions " /2" and "2 " in Theorem 2.5 must remain a "2", as the theorem no longer holds if it is replaced by . • There is no contradiction between Theorem 2.5 and the traditional (and rightful!) rule that says that subtracting two numbers very close together is dangerous. Indeed, − is computed exactly but if approximates a real number * and approximates a real number * , even if these approximations have a very small relative error, − may be a very loose approximation to * − * , as the relative errors between and * and and * may be magnified.
Another useful example that cannot be analyzed just by using the standard model is the Fast2Sum algorithm (Møller 1965, Dekker 1971, that is, Algorithm 1 below. If the floating-point exponents and of and are such that ≥ and if the first operation does not overflow, then is the error of the floating-point addition RN( + ), i.e., + = + , as shown by Knuth (1998) and later formally proved by Daumas, Rideau and Théry (2001); see also the work by Lange and Oishi (2020) for further extensions and applications. If we make the stronger assumption | | ≥ | |, the proof becomes a straightforward consequence of Sterbenz theorem (Theorem 2.5). We will see some other similar algorithms in Section 4.2.

Errors in ulps vs relative errors
The errors of numerical programs are in general expressed either in terms of relative error, or in ulps. Both ways of expressing the errors do not convey the same amount of information. Let us now examine that difference. First, note that in general when we talk about the error in ulps of a program, we implicitly mean "error in ulps of the exact result", as expressing an error in ulps of the computed result might result in dubious conclusions. Assume for instance that the exact result is the real = 1 + u and consider two (quite poor) computed floating-point results: = 2 − 2u and = 2 + 4u. Since < < , it would make no sense to consider that is a better approximation to than . And yet is within 2 −1 − 3/2 ulp( ) from , and within 2 −2 + 3/4 ulp( ) from .
Consider a program is written for approximating a function : R → R. For the sake of simplicity, assume that the rounding function is RN. A "perfect" program, when called with input ∈ F will return RN( ( )). The error in ulps of this program9 will be bounded by 0.5ulp, and (assuming that the output is in the normal range), the relative error will be bounded by u. The converse is almost true for the error in ulps. Indeed, if a program has an error bounded by 0.5ulp then, when called with an input ∈ F, it always returns one of the FP numbers nearest ( ). But the result is not necessarily equal to RN( ( )), in particular if ( ) is a midpoint and the error is exactly 0.5ulp.
On the other hand, a program may have a relative error bounded by u and deliver results that are quite far from being correctly rounded. For example, if the exact result is 2 − 2u + 5u 2 and the computed result is 2, then the relative error is about u − 3 2 u 2 (and thus less than u) and, at the same time, correct rounding is not achieved, since the exact result is much closer to the FP number 2 − 2u than to the computed result 2.
The reason for this difference is that the bound (2.5a) is tight only when is just above a power of 2. The local maximum relative error wobbles between u/2 and u. Figure 2.4 illustrates this. With a radix-floating-point arithmetic, the situation would be worse, as the "wobbling factor" would be instead of 2.10 From that point of view, an error expressed in ulps conveys more information than a relative error. This explains why the error of "atomic" calculations (e.g., the elementary functions: sine, cosine, logarithm, etc.) are in general expressed in ulps; see for instance a very useful analysis of the quality of current elementary function software by Innocente and Zimmermann (2022). On the contrary, manipulating ulps in large algorithms is almost infeasible, whereas relative errors are rather easily manipulated. For instance, one easily deduces a relative error bound on × from the relative error bounds on and , and the relative error bound u of the multiplication. As a consequence, relative errors are favored for larger computations.
Finally, in the normal domain, one can always deduce an error bound in ulps from a relative error bound, and conversely, using Property 2.6 below. Note that we may lose information during these conversions. It shows when using Property 2.6 twice, to convert an error in ulps to relative error, and then back to error in ulps, we end up with twice the initial bound.
Property 2.6 (Link between error in ulps and relative errors). If ∈ R and ∈ F are in the normal domain, then Figure 2.4. The relative error due to rounding to nearest, i.e., | − RN( )|/ , for between 0 and 1, assuming a FP system of radix 2 and precision 6. The local maximum value of the relative error wobbles between u/2 and u.
It is frequent to read that "correct rounding" is equivalent to "error less than half an ulp", and that "faithful rounding" is equivalent to "error less than one ulp". This is almost, but not entirely, true. First, ulp is not a constant but a function, and it must clearly be stated if we consider the ulp of the exact result or the ulp of the computed result (as said above, the latter choice is a dubious one). Second, the ulp function is discontinuous at the powers of 2. Properties 2.7 and 2.8 clarify the slight differences.

Various properties
When analyzing numerical programs, we frequently iteratively have to compute bounds on the possible values of variables.12 For example, assume that we have 11 We have not defined the ulp function in the case of nonbinary radices, but in the case considered here it is the distance between the two FP numbers that surround . 12 Let us give an example with function hypot( , ) = √︁ 2 + 2 assuming that and are normal FP numbers and that the rounding function is RN. Without l.o.g., we can assume 1 ≤ < 2 and previously shown that ≤ and ≤ , and that we want to bound the possible values of a variable after execution of the program line s = a + b Since, + ≤ + , if + is in the normal domain, the standard model tells us that ≤ ( + ) · (1 + u). However, it is often the case that + ∈ F. In such a case, we have = R U D ( + ) ≤ R U D ( + ) = + . The corresponding (straightforward!) property is Property 2.9. Ifˆ∈ F then ≤ˆ⇒ R U D ( ) ≤ˆand ≥ˆ⇒ R U D ( ) ≥ˆ. The following property is frequently used in proofs for bounding rounding errors.

Exact correcting terms
Although, as explained before, the sum, product, or quotient of two floating-point numbers is not, in general, a floating-point number and hence must be rounded, Dekker (1971) and Pichat (1976) remarked early that very frequently, a correcting term can be represented by a floating-point number. We will see later on that such correcting terms can be computed quite easily in floating-point arithmetic (with the same precision), which makes it possible to use them later on in a calculation, to at least partly compensate for the rounding error. A systematic study of these correcting terms was done by Bohlender et al. (1991), with the assumption that no underflow occurs. Later on, this assumption was removed by Boldo and Daumas (2003). We give the main results of that paper below, adapted to the notation of this article.
Note that Property 2.11 does not necessarily hold with rounding functions different from RN. For example, if = 1 and = u 3 then := RU( (For RD, the same can be observed by simply negating these values of and .) Now, for multiplication such a restriction on the rounding function is not necessary and we can thus state the next property using the "generic" rounding function R U D ∈ {RN, RD, RU, RZ}. integers , , and , with | |, | | ≤ 2 − 1 and , ≥ min such that = · 2 − +1 , = · 2 − +1 , and + ≥ min + − 1.
In particular (and this will be the most useful application of Property 2.12), if the exponents and of and satisfy + ≥ min + − 1 then − ∈ F, but Property 2.12 is more general.13 Let us give two examples. Assume binary32 arithmetic ( = 24, min = −126) and round-to-nearest, ties-to-even. Note that min + − 1 = −103.
For division, an exact relationship is obtained by considering not the absolute error itself (whose bit string can be infinitely long), but the associated residual. This fact was already noted by Pichat (1976, p. 43) and a general statement is as follows.
Property 2.13. Let , ∈ F, ≠ 0 and = R U D ( / ). If the FP division does not overflow, then − ∈ F if and only if there exist integers , , and , with | |, | | ≤ 2 − 1 and , ≥ min such that = · 2 − +1 and = · 2 − +1 , and Again, − ∈ F holds when the second condition is satisfied and the first is replaced by + ≥ min + −1, where and are the floating-point exponents of and , but Property 2.13 is more general. If the FP division does not underflow (i.e., | | ≥ 2 min ) then the second condition is satisfied.
Let us give an example in binary32, rounded-to-nearest, ties-to-even, arithmetic. Consider = 8388609 · 2 −128 and = 8388611 · 2 −100−23 . The number = RN( / ) is equal to 4194303 · 2 −27 = 4194303 · 2 −4−23 . The largest and such that one can write = · 2 − +1 and = · 2 − +1 are −100 and −4 (because 8388611 and 4194303 are odd integers). Therefore + ≤ −104 < min + − 1 and the condition of Property 2.13 is not satisfied. One easily observes that − = 3 · 2 −150 is not a floating-point number. 13 This can be useful when the significand of (or ) has some trailing zeros, which occurs in particular when performing splitting operations-see Section 4.3. In such a case, we know that has representation · 2 − +1 with | | ≤ 2 − 1 and > , so that the conditions of Property 2.12 may be satisfied even if + < min + − 1. Now, with the same value of , if = 8388609 · 2 −127 , then is twice larger than previously, so that − = 3 · 2 −149 = 3 · 2 min − +1 ∈ F. Property 2.13 is one of the main motivations for implementing an FMA instruction, as defined in (2.4). Indeed, when − ∈ F, it is computed exactly with just one FMA. This makes it possible, with some care, to deduce RN( / ) from a faithful rounding of / , that is, from a value equal to either RD( / ) or RU( / ). In turn, this makes it possible to efficiently implement floating-point division in software (Cornea-Hasegan et al. 1999).
Finally, a similar property holds for square root as well, but under the restriction that rounding is to nearest.
Note that Boldo and Daumas (2003) give an additional condition, equivalent with our notation to ≥ min . It is not needed here since, as we have assumed min < 0, it is a consequence of the assumption 2 ≥ min + − 1. Again, if the floating-point exponent of satisfies 2 ≥ min + − 1, then − 2 ∈ F, but Property 2.14 is slightly more general.
Note also that unlike multiplication and division (but similarly to addition), this property of square root does not always hold if rounding is not to nearest. For

Playing with the exceptions
The default exception handling of IEEE-754 arithmetic was designed so that it should frequently allow one to obtain a usable result, even when an exception occurs. For instance, for a huge value of , the calculation of 3 + 1/ 5 with rounded to nearest arithmetic still delivers the very accurate result 3, even if 5 is infinite due to overflow. However, one should be cautious. Consider the following example by Lynch and Swartzlander (1992) However, if is large enough so that 3 overflows, but not as large as to make 2 overflow, the computed value will be 0. A typical example in binary64 arithmetic is with = 2 400 ≈ 2.5822 · · · × 10 120 : the computed value of ( ) is 0 whereas the exact value is around 1.6069 × 10 60 .
For the arithmetic operations, what should be returned in exceptional cases is in general rather intuitive.14 There is no debate on the fact that (+∞) · (−∞) should be −∞. When it comes to more complex functions, things become less clear and from time to time, somebody reopens the debate on what should be returned when computing (±1) +∞ or 0 0 ; see an interesting discussion by Kahan (1997), and the 14 An exception is choices suggested by the standard (IEEE 2019, §9.1). Take as an example function cospi( ) = cos( · ) for = ±∞. The mathematical function cospi( ) has no limit as → ±∞, hence it is natural to suggest that cospi(±∞) should be a NaN (this is the choice suggested by IEEE-754). However, since any large enough floating-point number is an even integer, one may arguably claim that the choice cospi(±∞) = 1 would maintain a consistent behavior of a numerical program when the input variable becomes infinite because of an overflow. These debates are interesting, and it is of course important to maintain the consistence and correctness of our calculations as much as possible, even in extreme cases. Nevertheless, despite the cleverness of the arithmetic of infinities, NaNs, and zeros in IEEE-754, it does not automatically solve all problems, so it is necessary a case-by-case analysis of any critical program that is supposed to work even when variables become infinite.

When more than one floating-point format is available
There is seldom only one floating-point format available on a given platform. There is even much more diversity now than 20 years ago. In general, we used to have only binary32/single precision, binary64/double precision, and the 80-bit Intel "doubleextended" format. Now, we still have these formats, plus binary128/quad precision, at least one 16-bit "half precision" format (binary16 or bfloat16), and sometimes, even an 8-bit format. This diversity gives us more control for finding compromises between speed and accuracy, and to do mixed precision computations (Higham and Mary 2022), but it also brings the following difficulties.

Underflows/overflows due to change of formats
As Table 2.2 shows, the various floating-point formats of current use have considerably different ranges. Hence, to avoid underflows and overflows, conversion to a narrower format requires much care, that is, scaling data before conversion is often necessary. This is especially true when the target format is binary16.15 Scaling strategies in LU factorizations algorithms are presented by Higham and Mary (2022, §7.4).

Double roundings
Double rounding occurs in rounded-to-nearest arithmetic when a result is first computed with a larger precision than the target precision. This was very frequent with the x86 instructions, as all arithmetic operations would first be performed using "double extended", precision-64, arithmetic operators, and then converted to the target binary64 format. In fact, to prevent double rounding, a control register allows setting the floating-point precision so that the results are not first rounded to double-extended precision, but setting it is a very costly operation, since it requires flushing the processor pipeline (Boldo and Melquiond 2008).
If the intermediate result is a midpoint of the target format, one cannot guarantee that the final result is a floating-point number of the target format nearest the exact result. Let us give an example. Assume that RN is round to nearest-ties-to-even,16 and suppose that the target precision is 9 and that the intermediate precision is 14. Consider the real number The floating-point number of the target precision nearest is correct = 1.00101101. When rounding to intermediate precision, we obtain ext = 1.0010110110000, and then, when rounding ext to target precision, we obtain final = 1.00101110, which is different from correct . Indeed, we have rounded up, whereas if should have been rounded down. Figueroa (1995) and Roux (2014) showed that, when the intermediate precision is large enough, double rounding is innocuous for the basic arithmetic operations.
A solution to avoid double rounding would be to have the operations of the "larger" format performed with the "round-to-odd" rounding function RO, defined as follows. If ∈ F then RO( ) = , otherwise RO( ) is the one of the two consecutive FP numbers that surround whose integral significand is an odd integer. Unfortunately, RO is not among the rounding functions specified by IEEE-754. A variant of Round to odd (for which RO( ) is always odd, which implies that when ∈ F is even, RO( ) ≠ ) was first considered by Von Neumann when designing the arithmetic unit of the EDVAC. A recent use of round-to-odd to avoid double roundings is in the decimal-to-floating-point conversion of the GCC compiler.17 This rounding and its properties were also formally studied by Boldo and Melquiond (2008).

Portability/reproducibility issues
When an arithmetic operation = op is performed, even if , , and do not have the same floating-point format, the IEEE-754 standard is clear about what must be returned: the exact result rounded (with the chosen rounding function) to the format of . The problem is that, frequently, there is nothing such as "the format of ". This occurs whenever is not an explicit variable but an implicit one such as the result of ( + ) when evaluating the expression ( + ) · ( + ).
Several choices may make sense: the widest format available in hardware (this would preserve accuracy and minimize the risk of spurious underflow of overflow, but might sometimes considerably slow-down the calculation), the maximum of the width of the operands, the format chosen for the variable that stores the final result of the expression evaluation, a "preferred format" chosen by the programmer through some mechanism, etc. Different choices made by compiler designers may result in different behaviors of the same program, which would result in portability and reproducibility issues. Even with the same choice we may meet reproducibility problems, as "the widest format available in hardware" may not mean the same thing on different platforms. In its appendix (Section 10), the IEEE 754-2019 Standard recommends (yet does not require) "preferred width" mechanisms to, at least partly, alleviate this problem.

Design of function libraries: need to consider many programs
A large number of floating-point formats is a challenge for the designers of libraries for the elementary (cos, sin, exp, arctan, etc.) or special functions. If one wishes to implement a library of around 30 functions (which is not that many, physicists and statisticians would like to have more, see Beebe (2017) for examples of functions that are frequently needed in numerical computing), the library designer needs to provide real and possibly complex functions for each of the FP formats. This is already a large number but in practice many variants are needed for each of the pairs function/format: one optimized for accuracy, one for latency, one for throughput. Beyond that, it will be interesting, for each of these combinations, to have a "generic" version that works reasonably well on every platform, and specialized versions that are optimized for each widely-distributed architecture. One may easily end up with thousands of programs to maintain, improve, keep mutually consistent, etc. The medium-term solution to do that is presumably to (at least partly) automatize the generation of function libraries (Brunie et al. 2015), and the long-term solution might well be to generate the function approximations at compilation time.

Problems related to the decimal input or output of numerical values
Numerical constants in a program, input values entered on a keyboard or read from a file, output values displayed or written into a file are often represented in decimal. Radix conversions are therefore a very frequent operation. It is wellknown that some numbers that have a finite decimal representation do not have a finite binary representation (a typical example is 1/10). Although the numbers with a finite binary representation do have a finite decimal representation, that decimal representation may be too long to be convenient for most purposes. For instance, the smallest positive normal number in the binary64 format, i.e., 2 −1022 , once converted into decimal, is 2.2250738585072013830902327173324040642192159804623318305533274168872044348139181958542831590125110205640673397310358110051 524341615534601088560123853777188211307779935320023304796101474425836360719215650469425037342083752508066506166581589487204 911799685916396485006359087701183048747997808877537499494515804516050509153998565824708186451135379358049921159810857660519 924333521143523901487956996095912888916029926415110634663133936634775865130293717620473256317814856643508721228286376420448 468114076139114770628016898532441100241614474216185671661505401542850847167529019031613227788967297073731233340869889831750 67838846926092773977972858659654941091369095406136467568702398678315290680984617210924625396728515625 × 10 −308 .
As a consequence, inexact conversions between decimal and binary are very Floating-Point Arithmetic 27 frequent. While this is harmless in general, the reader should be aware of a few issues, listed below.

Constants expressed in decimal in programs
When a constant such as 0.1 appears in a program, it is converted to a floatingpoint number. The format of that floating-point number is not obvious and depends on the specification of the programming language. For instance in C it will be double-i.e., binary64-unless an f, F, l, or L suffix is appended to the constant, whereas in Fortran it will be binary32 by default-although compiler options18 may allow one to change this. Numerical programmers should always explicitly declare the type of their constants. An even better, yet less readable solution, is to express the constants in the hexadecimal representation of FP numbers specified by IEEE-754 (IEEE 2019, §5.12.3), which totally avoids conversion errors.

Roundtrip conversions
An intermediate result, computed in binary floating-point arithmetic, may be stored in decimal in a file, or displayed in decimal on a screen, and later on re-read to carry-on a calculation. The consequence is a roundtrip radix conversion: first from binary to decimal and then from decimal to binary. Ideally, one would like that roundtrip conversion to be errorless, i.e., the initial binary FP number is recovered.
Assuming that the binary format has precision 2 , that the decimal input/output format has precision 10 , and that both conversions are to a nearest value, the condition that ensures that, at least in the absence of underflow and overflow, all roundtrip conversions are identity (Matula 1968, Goldberg 1967) is 10 10 −1 > 2 2 .
(2.7) Table 2.4 gives the smallest value of 10 , say * 10 , that satisfies (2.7) for the most common binary precisions. The IEEE-754 Standard (IEEE 2019, §5.12.2) requires that correctly rounded conversions to and from decimal formats of precision up to at least * 10 + 3 are provided. For a given binary number , a good binary-to-decimal conversion choice is to use the smallest value of 10 that allows for an errorless roundtrip conversion (Steele Jr. and White 2004).

Exotic yet standard operators
Beyond the usual arithmetic operations, the IEEE-754 Standard specifies some "operations" that can be extremely useful in practice. Let us now present the most important of them.  Table 2.4. Minimal decimal precision * 10 that allows for an error-free roundtrip conversion, depending on the precision 2 of the binary floating-point format.

scaleB and logB operations
If they are implemented efficiently, the following functions are extremely useful when one needs to scale a calculation to avoid spurious underflow or overflow. Typical examples are the calculation of Euclidean norms, complex square roots, etc. They are specified by IEEE (2019, §5.3.3).
• scaleB( , ) returns (in a binary format, which is the case considered in this paper) · 2 (where is a FP number and is an integer); • logB( ) returns log 2 | | (where is a FP number).
In the C programming language, they are called scalbn and logb in binary64 arithmetic, and scalbnf and logbf in binary32 arithmetic.

min, max, and copysign operations
In pipelined calculations, branchings can considerably hinder performance. The following instructions (if implemented efficiently) make it possible in some cases to avoid them. They are specified by IEEE (2019, §9.6) and IEEE (2019, §5.5.1).
• minimum( , ) and minimumNumber( , ) are equal to if ≤ and otherwise. If one of the operands is −0 and the other one is +0 then −0 is returned.
• maximum( , ) and maximumNumber( , ) are equal to if ≤ and otherwise. If one of the operands is −0 and the other one is +0 then +0 is returned.
The difference between both instructions is that maximum( , NaN) = maximum(NaN, ) = NaN, whereas maximumNumber( , NaN) = maximumNumber(NaN, ) = . The availability of these instructions is only recommended (i.e., not required) by IEEE 754; The availability of this instruction is only recommended; The availability of this instruction is only recommended; • copysign( , ) returns sign( ) × .

Execution environment
While the IEEE-754 standard precisely specifies what floating-point numbers are and how arithmetic operators handle them, this should be taken with a grain of salt, as the execution environment might not comply with it. Indeed, for performance reasons, some corners might have been cut. This might happen at the hardware level (Section 3.1) or in programming languages and compilers (Section 3.2). Moreover, when it comes to mathematical libraries, the standard is much more permissive, and various trade-offs between accuracy and performance exist (Section 3.3). Finally, considerations related to parallelism can also lead to non-reproducible computations (Section 3.4). Monniaux (2008) explored some of these issues in more details.

Hardware
The IEEE-754 standard does not mandate that a compliant environment has to be implemented in hardware. A pure software library would be just as compliant (and indeed, there are such libraries, such as Hauser's excellent Berkeley SoftFloat19) but this would have a major impact on the performance. Let us explore what a hardware implementation entails, in order to understand which features are likely to be provided in hardware.

Subnormal numbers
Consider the rounded product of two binary floating-point numbers 1 · 2 1 − +1 and 2 · 2 2 − +1 , that is, RN(( 1 × 2 ) · 2 1 + 2 −2 +2 ). At first glance, this is a simple operation; one computes the integer product of 1 and 2 and then truncates it to fit the target format, possibly adjusting it by one unit when rounding toward infinity. Let us assume that both inputs are positive normal numbers, i.e., 2 −1 ≤ 1 < 2 and 2 −1 ≤ 2 < 2 . Thus, the integer product has either 2 − 1 or 2 bits. This means that, if the result is also a normal number, the truncation position is almost known statically. One can systematically drop the − 1 least significant bits, and possibly one more, if the most significant bit of the product is non-zero. Thus, in the normal range, the hardware design for the floating-point product is straightforward (Muller et al. 2018, §8.4.1).
But if the result is in the subnormal range, the truncation position is no longer known beforehand, so the circuit has to be able to perform a large shift to the right. This increases the area of the floating-point unit, and more importantly, it might increase the pipeline depth by a few cycles. Subnormal numbers are also an issue when they appear as inputs, as some circuits are designed with normal inputs in mind. In that case, the inputs need to be renormalized beforehand, which again incurs the use of shifters, and hence an additional delay.
Thus, processor designers might consider that supporting subnormal numbers in hardware is not worth the cost. Several options then appear, the first of which being to just forsake both gradual underflow and compliance with the IEEE-754 standard.
In that case, subnormal inputs will be silently interpreted as floating-point zeros, while subnormal outputs will be flushed to zero. Note that the behavior might vary depending on the operator. For example, it is possible to design a floating-point adder in such a way that subnormal inputs and outputs are handled at no extra cost, contrarily to the multiplier (Muller et al. 2018, §7.3.3).
Since the IEEE-754 standard does not mandate any hardware support, another approach is to emulate subnormal numbers using either some microcode or a software library. As a consequence, performance will be optimal in general (i.e., in the normal range), but might plummet down as soon as a computation encounters a subnormal number (Lawlor et al. 2005). To avoid this pitfall, the hardware might give the programmers the choice, either globally or on a per-instruction basis, to disable subnormal handling. In that case, the floating-point unit does not have to request any kind of software emulation to complete the operation. In other words, the floating-point environment is no longer compliant, but the decision to drop compliance is put in the user hands.

Rounding directions
Subnormal numbers are not the only part of the standard that are an impediment for hardware. Rounding directions are also an issue. This time, there is no difficulty at the level of the floating-point unit. If it supports rounding to nearest, supporting the other directions does not incur any extra cost. The issue lies instead in the way the information about the rounding direction reaches the floating-point unit.
A first approach would be to encode this direction in the opcode of the floatingpoint instruction, but space is usually at a premium there. So, hardware designers might decide that wasting several bits for it is not worth supporting the few programs that might need to round in another direction than to nearest (mainly users of interval arithmetic). These programs might thus have to rely on software emulation.
Another approach is to dedicate a control register to dynamically inform the floating-point unit about the current rounding direction. This time, all the computations are performed in hardware. Unfortunately, on some architectures, changing the value of control registers might have a drastic impact. For example, it might stall the processor pipeline, so that all the floating-point operations in flight can complete before the rounding direction is effectively changed. Thus, routinely switching from rounding to nearest to another direction and back might considerably degrade performance.

FMA, division, and square root
Another impediment for hardware design is the FMA operation (defined in (2.4)). Since there is no rounding after the multiplication, the subsequent addition needs to be much wider than usual (e.g., 106 bits for binary64) and so is the shifter to normalize the result. But that might not be the main issue. Indeed, the FMA operation might actually be the only opcode in the whole instruction set that needs to read three registers at once. As a consequence, supporting it in hardware might require to fully redesign the whole processor architecture just for it.
Division and square root are two other operations that come with their share of difficulties. Indeed, rather than having some costly dedicated hardware for them, it might be cheaper to use an iterative algorithm that computes a few bits of the result every cycle. Another approach is to implement a variant of the Newton-Raphson iteration using an FMA, which roughly doubles the number of bits of the result at each iteration, and performs a final correction essentially based on Property 2.13 (Cornea-Hasegan et al. 1999). In both cases, division and square root no longer fit inside the traditional pipeline; they might instead be handled by a microcode loop, thus impacting other operations around them.
Useful and detailed tables of latencies and throughput of instructions for the most common processors are regularly updated by Agner Fog.20 In terms of latency as well as in terms of throughput, the cost of floating-point division is between 3 and 10 times the cost of floating-point addition or multiplication, and the cost of square root is similar or worse. This must be taken into account, for instance, when hesitating between approximating some function by a polynomial or by a rational function. There is an egg-and-chicken issue here: Since division is slower, programmers of numerical applications tend to avoid using it, and since it is less used, computer manufacturers do not make the necessary efforts to significantly accelerate it.

Decimal arithmetic
While binary arithmetic is most often used when it comes to numerical computations, the revised IEEE-754 standard also describes what a decimal floating-point arithmetic shall comply with. Most of the previous considerations apply equally to binary and decimal arithmetic. Subnormal numbers, however, are much less of an issue as inputs. Indeed, there is not a unique representation of a given decimal number but a cohort of them; any number of most significant digits can be zero, as long as no information is lost (IEEE 2019, §3.5.1). In other words, any decimal number has a subnormal representation, in essence.
Another peculiarity of decimal arithmetic is that the IEEE-754 standard did not succeed in mandating a single encoding of the interchange formats for decimal floating-point numbers (IEEE 2019, §3.5.2). Indeed, two encodings of the significand are proposed.21 When using the binary encoding of decimal floating-point numbers, the significand is mostly stored as a binary integer, which makes it simpler for software implementations to read and write floating-point numbers. When using the decimal encoding, the significand is mostly stored by groups of three decimal digits (hence ten bits), which makes it simpler for hardware implementations to perform shifts on the representation. Both encodings fortunately represent the same set of floating-point values, so they have no impact on the computed results. But they might introduce some portability issues, when accessing or transmitting decimal numbers.

Systems, languages, and compilers
Hardware is not the only piece one has to take into consideration when devising floating-point programs. Programming languages and compilers might also impede portability and reproducibility, even when the code seems to have a deterministic semantics according to the IEEE-754 standard.

Java
Let us illustrate the issue with the Java programming language. The original language aimed at reproducibility of floating-point computations, and to do so, it was mostly sticking to the semantics of the IEEE-754 standard. Unfortunately, few processors at the time were complying to the standard, thus crippling the implementation of the Java Virtual Machine. In particular, the legacy floatingpoint unit of the x86 architecture was using 80-bit registers but made it possible to emulate binary32 and binary64 operations by setting a control register. This support, however, was only partial; the significand was rounded at the specified precision, but the exponent range was left unchanged. As a consequence, the floating-point result would always comply with the IEEE-754 standard, except in the subnormal range where it could rarely be off by one, due to double rounding.
The inability to have both hardware support and reproducible results led the developers to weaken the language in the late 1990s. Starting from Java SE 1.2, reproducibility of floating-point computations would no longer be guaranteed by the language, unless the Java methods are explicitly annotated with the strictfp keyword. In that case, the Java Virtual Machine would fallback to software emulation on hardware that were not compliant with the IEEE-754 standard. Nowadays, non-compliant hardware have been sufficiently phased out to restore the original Java semantics and make strictfp unnecessary (Darcy 2017).

C and Fortran
Contrarily to Java, languages such as C and Fortran predate the IEEE-754 standard and reproducibility was hardly a concern at the time. Instead, in the case of Fortran, the semantics of floating-point operations was motivated by the usage of the language, e.g., numerical simulation. So, floating-point numbers were just an approximation of real numbers. This led the Fortran standard to mandate that "the processor may evaluate any mathematically equivalent expression" (ISO Fortran 2008, §7.1.5.2.4). Here, mathematically equivalent expressions should be understood as having the same values when rounding, underflow, and overflow are ignored. For example, if the original program evaluates the expression − + , the compiler can rewrite it into − ( − ), regardless of any consideration for catastrophic cancellation.
The authors of the Fortran standard did, however, provide an escape hatch, that is, parentheses. Indeed, the transformation is only allowed if "the integrity of the parentheses is not violated". In other words, while a Fortran compiler can rewrite − + into − ( − ), it cannot perform the converse transformation, as the input expression now contains parentheses. This also means that parentheses are not just used to disambiguate the parsing of arithmetic expressions; they are semantically relevant. For example, a compiler can factorize × + × into × ( + ), but it cannot factorize ( × ) + × . It can, however, use an FMA for evaluating both expressions.
Regarding floating-point computations, the C language is stricter than Fortran. Thus, compliant compilers cannot perform this kind of algebraic rewriting, even in the absence of parentheses.22 Yet, compilers still have some leeway. Indeed, except for assignment and cast, the C11 standard states that "the values yielded by operators with floating operands [. . .] are evaluated to a format whose range and precision may be greater than required by the type" (ISO C 11, §5.2.4.2.2). The main consequence is that, even if the type of an expression is float, it might be evaluated as a binary64 number rather than a binary32 number. This excess precision can in turn cause double rounding issues and thus lower the accuracy in some rare cases.
A less obvious consequence is that it allows C compilers to make use of FMA operations. Consider the expression × + . Irrespective of the type of × , this product could be virtually performed with an infinite precision and thus fused with the subsequent addition into a single FMA operation. To prevent this optimization, the user has to explicitly cast the product × to its target type or to assign its result to a variable, before proceeding with the sum.
Note that the ability of C and Fortran compilers to fuse multiplication and addition can lead to some unintuitive behaviors. Indeed, given the expression × − × , the language standards allow the compilers to produce a code that actually computes FMA( , , −RN( 2 )), whose result is often non-zero.

Mathematical libraries
Previous sections dealt with basic arithmetic operators. Let us now move to mathematical functions. While processors might provide some primitives and compilers might know about them, floating-point approximations of mathematical functions are, for the most part, handled in software libraries. These libraries provide various trade-offs. The reader interested by mathematical function algorithms can consult the books by Beebe (2017) and Muller (2016). Tests of recent mathematical libraries can be found in Innocente and Zimmermann (2022).
We do not address here the problem of generating random FP numbers, but it is an important topic, and the naive solutions can be deceiving (Goualard 2022).

The table-maker dilemma
First of all, contrarily to the situation with basic arithmetic operators, correct rounding is no longer a given. Indeed, correct rounding in rounding to nearest (respectively, in directed rounding) requires the ability to decide on which side of the midpoint between two consecutive floating-point numbers (respectively, which side of a floating-point number) lies an infinitely-precise mathematical value. This might require a library to perform its internal computations with a very high precision. For example, consider sin( ) = sin( − 2 ) for some integer . By choosing a large enough dyadic number , one can make − 2 arbitrarily close from any real number in [− ; ], which in turn makes sin( ) arbitrarily close from a floating-point midpoint. Fortunately, as a finite floating-point number, has both a bounded precision and a bounded range, so there is an actual limit to how close sin( ) can be from a midpoint. This limit gives an indication regarding the accuracy needed when approximating sin( ) in order to correctly round it. This issue is not specific to the periodic nature of the trigonometric functions. It also impacts other mathematical functions. For example, there are a few inputs in the binary64 range such that ln( ) is at a relative distance ∼ 2 −117 of a floating-point number or a midpoint, which means that ln( ) has to be approximated with an accuracy of about 117 bits in order to compute its correctly rounded binary64 value. An example by Lefevre and Muller (2001) is the binary64 number = 6239214722492854 × 2 626 , whose logarithm (with the significand expressed in binary) is equal to 53 bits 1.1101011001000111100111101011101001111100100101110001 000000000000000000 · · · 000000000000000000 65 zeros 111001 · · · × 2 8 .
In the case of the natural logarithm in binary64, the hard-to-round results are known, which means that one can bound the time and space complexities of an algorithm that computes its correctly rounded approximation. But for some other functions, e.g., sin in binary64, this is not yet the case. For such a function, the algorithm has to compute increasingly accurate approximations of the infinitely precise result until it can decide what the correctly rounded result is. Thus, while the time and space complexities are necessarily bounded for a given floating-point format, our inability to put upper bounds on those means that correct rounding of these functions is not yet advisable for critical real-time systems. Hopefully, this inability is temporary: progress is being made on this topic (Brisebarre, Hanrot and Robert 2017).
Despite performance issues, correct rounding has some important properties. Indeed, its uniqueness guarantees portability and reproducibility. Moreover, its accuracy guarantees that monotony properties of the mathematical functions are also satisfied by their approximation. Thus, it would be unfortunate to drop correct rounding entirely. For example, while sin is not yet fully understood on the whole binary64 range, we know its hard-to-round cases for some useful input intervals, e.g., [−2 ; 2 ]. So, a trade-off can be reached: Even if a mathematical library does not provide correct rounding everywhere, it can still document the functions and the input intervals for which it is guaranteed. This is the approach followed by the IEEE-754 standard. Let us mention the important current effort around the CORE-MATH library23 of very efficient correctly-rounded functions (Sibidanov, Zimmermann and Glondu 2022).

Rounding directions
Considering correct rounding is only meaningful when a rounding direction is decided. Most correctly rounded libraries at least support the default rounding mode, i.e., rounding to nearest, tie breaking to even. The situation regarding directed rounding is more diverse. First, a library might only support rounding to nearest. Second, a library might provide alternate implementations of all the mathematical functions, one for each rounding mode. Third, a library might provide only one implementation of each function, but written in such a way that the behavior of the function respects the dynamic rounding mode of the processor, the way the hardware arithmetic operators do. Conversely, it should be noted that libraries that fall into the first two categories are usually not resilient to the dynamic rounding mode and will work correctly only when it is set to rounding to nearest.

Trade-off between accuracy and performances
Let us now assume that a given mathematical library has forsaken the optimality of correct rounding on some input ranges. This opens the way to a wide range of potential implementations, with a trade-off between speed and guaranteed accuracy. For example, a mathematical library could promise that the computed results are at a distance of 0.5 + ulp of the infinitely precise results when rounding to nearest. For small enough, e.g., 10 −4 , this would allow for correct rounding in most of the cases. By decreasing even further, at the expense of a larger computation time, one could even make the probability of an incorrectly rounded result negligible. On the other side of the spectrum, some applications might not even need an accuracy close to the precision of the format. They could live with only faithfully rounded results, or even much worse.
Up to now, we have considered a single evaluation of sin( ) in isolation. But some applications might need to perform numerous evaluations in parallel. Some mathematical libraries hence provide implementations that, through careful use of pipelined and/or vector floating-point units, can perform such parallel computations. This severely constrains the kind of algorithms that can be implemented, and might thus limit the accuracy of such implementations (Shibata and Petrogalli 2020). Consider the use of a width-vector floating-point unit to evaluate sin( 1 ), . . . , sin( ) in parallel. The evaluations need to have the same control flow as often as possible, which means that conditional branching has to be avoided. Instead, the algorithm will use predicated instructions to control whether a given floating-point operation is performed on a given sin( ) lane.24 As a consequence, the latency for producing sin( 1 ), . . . , sin( ) is at least the latency of the slowest lane. When the predication varies widely between the lanes, it could even end up being as slow as if the evaluations had been performed sequentially. Thus, algorithms have to be simple and short to benefit from the use of a vector unit, which limits the accuracy. Mathematical libraries for graphical processing units suffer from similar shortcomings.

Reproducible computations
Except for Not-a-Number floating-point data, the principles of the IEEE-754 standard, especially correct rounding, guarantee the reproducibility of floating-point computations across several program executions. These principles could even ensure the portability of floating-point results across various environments. Unfortunately, for larger systems, some other factors have to be taken into account, most prominently parallelism.
Let us consider the example of a large sum of floating-point numbers, e.g., an inner product as commonly encountered in matrix computations. If the order of floating-point additions is set once and for all, then reproducibility and portability are guaranteed. But setting this order statically might hinder performances. Indeed, assuming that several computation cores are available, one might want to split this large sum into smaller blocks of numbers and dispatch them to the various cores, so that all the blocks can be summed in parallel. The final sum can then be performed by adding the partial sums. While this approach is sensible in an infinitely precise setting, the lack of associativity of floating-point addition makes the results nonreproducible when the size of the blocks is set dynamically, depending on the availability, speed, and bandwidth of the cores.
Even if only one core is available, computations might still be performed per block, for performance reasons. For instance, matrix multiplication might be subdivided into smaller blocks (hence partial inner products) in order to best benefit from cache locality. Such an algorithm would thus have non-portable results, as they would depend on the size of the cache. Another cause for divergence is the availability of a floating-point vector unit. This is quite similar to the previous instances, except that the number of blocks then depends on the width of the vector unit, and the blocks are interleaved instead of contiguous. Even if there is no vector unit, performances might still benefit from interleaving floating-point computations depending on the length of the instruction pipeline.
While some of these choices can be decided statically at compilation time rather than dynamically at execution time, they still impact the portability of something as simple as matrix multiplication, when performances matter. An illustration of these considerations lies in the ATLAS implementation of the BLAS routines.25 As explained by Whaley, Petitet and Dongarra (2001), its routines take into account the size of the L1 cache, whether the FMA instruction is available, the pipeline depth, the number of registers (for loop unrolling), and so on.
To improve reproducibility, various solutions have been proposed. Most of them come down to performing only exact operations, so as to completely ban the issue of floating-point addition not being associative once rounded. For the binary64 format, Kulisch (2013) proposed to use a long accumulator that would cover the whole range of floating-point numbers or, in case of the scalar product, the whole range of products of floating-point numbers. This 4288-bit register requires some expensive hardware (Uguen and de Dinechin 2017), but offers a correctly rounded result by not losing any bit of information during computation. Software implementation of the long accumulator usually do not fare well, due to the indirect memory accesses it entails. But, if the dynamic range of the inputs is not too large, a careful implementation of the summation can get close to the optimal performance of the non-associative floating-point addition, as shown by Collange et al. (2015).
If correct rounding is not needed, one does not need the whole accumulator, only the most significant bits of the sum can be kept. For the performed additions to be exact, all the inputs can be split according to a given boundary (Section 4.3), ensuring that their significands are aligned. In the case of a distributed computation, it might be difficult to find a suitable boundary in a reproducible way. Demmel and Nguyen (2015) have proposed that every core should compute its partial sum as a triple of floating-point numbers aligned on boundaries of the form 2 · where is 40 for the binary64 format (see Section 5.3). The final reduction can then keep only the relevant most-significant parts of the triples and sum them, in order to obtain the same triple that would have been obtained if there had been only one core.

Taming rounding errors
Most prominently, floating-point arithmetic is used as a cheap way of performing computations on real numbers. As a consequence, the main issue is the rounding error that taints the result of every operation. This begs the question: how does one make sure that the computed result is close to the real one? Interval arithmetic answers by enclosing the real result between two floating-point bounds instead of approximating it with only one floating-point number, as shown in Section 4.1.
In some cases, one can recover the rounding error as a floating-point number, thus making it possible to propagate it along computations. That is the purpose of the error-free transformations presented in Section 4.2. Splitting floating-point numbers is another way of controlling rounding errors, by making sure only exact operations will be performed, as shown in Section 4.3.
When we are able to bound the error of a calculation (for instance using the standard model), we can use floating-point filters to avoid long-precision calculations in many situations. This is described in Section 4.4.
In the same way floating-point numbers are used to approximate real numbers, pairs of floating-point numbers are used to approximate complex numbers. The rounding error can then be considered from two different perspectives: normwise or componentwise. Section 4.5 provides accurate algorithms for both cases. Section 4.6 then considers rounding errors in higher dimensions.
Finally, Section 4.7 presents a few tools that make it possible to accurately approximate functions using floating-point polynomials.

Interval arithmetic
Given an expression over the real numbers, it can be mechanically translated into an expression˜over the floating-point numbers. Replacing exact operations by their rounded counterparts introduce rounding errors in general, but evaluatingh opefully gives a result sufficiently close to . Guaranteeing this property requires an error analysis of the algorithm.
Rather than approximating by˜, a slightly different approach consists in enclosing by an interval e. This can be done mechanically by replacing every exact operation by its counterpart given by interval arithmetic (Moore 1979, Neumaier 1990, Moore, Kearfott and Cloud 2009, Rump 2010, IEEE 2015. Indeed, interval operators are defined in a way that guarantees the containment property. For instance, the difference of two intervals v and w satisfies the following property: Thus, by induction on the expression , we get the enclosure e of . Intervals are usually implemented as pairs of floating-point numbers. These can represent either the lower and upper bounds of the interval, or its center and radius. The monotonicity properties of subtraction and rounding, as well as the correct handling of infinities in case of overflow, guarantee that the resulting interval indeed satisfies the containment property above. Note that even seemingly straightforward operations such as computing the midpoint of an interval require some care (Goualard 2014).
The containment property guarantees that the mathematical result lies in the computed interval e. Thus, as long as the resulting interval is narrow, this completely alleviates the need for an error analysis, at the expense of a computation that is roughly twice as slow. Another way to see it is that the error analysis that would be performed on˜is performed at runtime by the code of e.
Even disregarding the performance issue, interval arithmetic is not a panacea against rounding errors, though. Indeed, there is no guarantee that the interval e is narrow enough for a given application. For example, if one needs to decide whether ≥ 0, one could compute its enclosing interval e and check whether it contains only nonnegative values (or only nonpositive values). But what if it straddles zero? In the worst case, one might even end up with e = [−∞; ∞], which trivially satisfies the containment property but is utterly pointless.
The main reason for this issue is the dependency problem, also known as the wrapping effect in the context of dynamical systems (Lohner 2001). Indeed, naive interval arithmetic does not remember how sub-expressions are correlated; it has to assume that they are completely independent from each other. As soon as two sub-expressions depend on the same input variables, this assumption no longer holds, which implies that the resulting interval may be overly pessimistic. The archetypal example is the subtraction − : Unless the interval x that encloses is a singleton, the resulting interval cannot be the optimal interval [0; 0].
The example − might seem artificial, but it is representative of various algorithms. For example, consider the Newton-Raphson iteration that is used to find the root of a function : If we denote = − the distance between and the target root , we hope that the quotient ( )/ ′ ( ) is close to , so that +1 is close to the root . In other words, the code is expected to compute a difference akin to ( + ) − . This is bound to fail with interval arithmetic. While the floating-point iteration usually converges toward , the following interval iteration would compute an interval larger and larger: .
Thus, one should be careful not to blindly replace floating-point operations by interval operations. An analysis of the original algorithm might be needed to reduce correlations and hence to ensure useful results. For instance, there is an interval version of the Newton-Raphson iteration given by Moore (1979, Eq. (5.16)), but it is slightly different from a naive translation. In particular, it involves the center (x ) of the input interval x in some select places: .
If the initial interval x 0 is large enough to contain the root , any subsequent interval x is guaranteed to contain too, thanks to a combination of both the containment property and the mean-value theorem applied to around (x ):

Error-free transformations
An immediate consequence of the properties of Section 2.8 about correcting terms is that the error is, more often than not, also an FP number that can be computed with FP operations, with no need of multiple-precision software. The sequence of operations that returns both the result of a floating-point operation and the error of that operation is called an error-free transformation (EFT). The first ideas seem to go back to Gill (1951), in the context of fixed-point arithmetic.
Algorithms 1 (page 17) and 2 (below) are explicitly mentioned by Møller (1965). Algorithm 1 also appears in the summation algorithm of Kahan (1965), which is Algorithm 20 below. Several EFTs are given and used in the seminal article by Dekker (1971). The term error-free transformation was introduced by Ogita, Rump and Oishi (2005). We will now review the various error-free transformations and their specifics.
Note that some of these EFT are given below with the rounding function RN. With other rounding functions, these transformations will not always be errorfree. In particular, when discussing Property 2.11 we have given the example of an addition with a directed rounding function, whose error is not a floatingpoint number. However, under conditions discussed by Boldo, Graillat and Muller (2017), the various EFTs of addition and subtraction return anyway a value close to the exact error,26 so their results can still be useful for designing compensated algorithms (see Section 5.3). In all cases, the availability of subnormal numbers is needed for the EFTs to return a correct result.

EFT for addition
When rounding is to nearest, the error of an FP addition is an FP number (this is Property 2.11), and algorithms for computing that error have been available since at least the 1960s (Møller 1965).
A first such algorithm is Fast2Sum, that is, Algorithm 1 described in Section 2.4. The two input FP numbers and are transformed into two FP numbers and such that, exactly, + = + with = RN( + ). But it requires that their floating-point exponents satisfy ≥ (which is implied by | | ≥ | |, a condition easier to check in practice). This condition is important, since can be very far from the error of the FP addition of and when it is not satisfied. A simple example in binary64 arithmetic is = 1 and = 2 55 , for which the algorithm returns = 0 instead of 1.
In the general case, when we do not know beforehand whether | | ≥ | |, we can rely on the 2Sum algorithm (Møller 1965, Knuth 1998, that is, Algorithm 2 below. As before, the inputs are FP numbers and and the outputs are FP numbers and such that + = + exactly and = RN( + ), but unlike in Algorithm 1, no assumption has to be made on and .
Of course, the floating-point addition of and can overflow (Section 2.2). However, if it does not, then 2Sum almost never overflows,27 while Fast2Sum is fully immune to spurious overflow. Indeed, if the first addition does not overflow, then none of the other operations overflow either . Furthermore, 2Sum is "optimal" in the following acceptation. If an algorithm made up with floating-point additions and subtractions only always returns the same results as 2Sum (i.e., the result and the error of the FP addition of and ), then it takes at least the same number of operations as 2Sum (Kornerup et al. 2012). Priest (1991) has given an algorithm for addition that does almost the same thing as 2Sum, even when the rounding function • is not RN (in fact, it only needs to be a faithful rounding). It returns a pair ( , ) such that + = + with the guarantee that the floating-point exponent of is at most the floating-point exponent of minus , however, is not guaranteed to be •( + ).

EFTs for multiplication, division, and square root
If two FP numbers and satisfy the condition of Property 2.12, then the error of their floating-point product can be obtained very simply using the 2Prod algorithm shown below. It is also called Fast2Mult (Kahan 1997, Nievergelt 2003. Note that it requires a fused multiply-add instruction (FMA) to compute R U D ( − ).
Algorithm 3 provides the exact rounding error, or the rounding of it in case of underflow when computing . Note the use of the "generic" rounding function R U D instead of RN. Indeed, by Property 2.12, − is an FP number, so it is computed exactly, whatever the rounding function.
When no FMA is available, one can use an older algorithm by Veltkamp (1968Veltkamp ( , 1969 and Dekker (1971), which has been formally proved by Boldo (2006). It splits the FP numbers and in two (see Section 4.3), multiplies the halves and then subtracts all the parts in a way that ensures exact results thanks to properties similar to those of Section 2.4. It takes 17 floating-point operations which is quite costly compared to Algorithm 3.
For division and square root, FMA-based algorithms similar to Algorithm 3 follow Properties 2.13 and 2.14, with the restriction that rounding to nearest is required to guarantee an exact residual in the case of square root.
Special cases ( 0 equal to ±∞ or NaNs) are described by the standard (IEEE 2019). Note the use of the rounding function RN 0 (i.e., round to nearest ties-tozero), which differs from the default RN (round to nearest ties-to-even). With this 28 History and motivation are given by Riedy and Demmel (2018).
Floating-Point Arithmetic 43 rounding function, these operations would significantly help to implement bitwisereproducible summation and dot product, using an algorithm given by Demmel, Ahrens and Nguyen (2016).
As we are writing these lines, these augmented operations are not implemented in hardware on any commercial platform. If, one day, an efficient implementation is provided on some processor, they will be good candidates for replacing the abovepresented 2Sum, Fast2Sum, and 2Prod algorithms. However, due to the special rounding function, one will need to check if an algorithm that has been proved to work with the usual error-free transformations still works with these operations. Boldo, Lauter and Muller (2021) show how to emulate these operations using conventional, ties-to-even, arithmetic. This implies a significant performance penalty, but this makes it possible to start now the design and test of programs that use the augmented operations. These programs will be ready for use with full efficiency when the augmented operations become available in hardware.

EFT for the FMA
Assuming a round-to-nearest rounding function the error of an FMA operation cannot, in general, be expressed as a single FP number. It can, however, be expressed as the unevaluated sum of two FP numbers (i.e., a double-word, see Section 5.1), which can be obtained using Algorithm 4 below. This EFT was introduced by Boldo and Muller (2005) and has since then been formally proved in Coq (Boldo and Muller 2011). Given three FP numbers , , and assuming that no underflow or overflow occurs, this algorithm makes it possible to convert the exact expression + into the unevaluated sum of three floating-point numbers: where , 1 , 2 are such that = RN( + ), | 1 + 2 | ≤ 1 2 ulp( ), | 2 | ≤ 1 2 ulp( 1 ).

Algorithm 4 ErrFma( , , )
1: Since the 2Prod algorithm performs one multiplication and one FMA, and since the 2Sum (resp. Fast2Sum) algorithm performs six (resp. three) additions, we deduce that the ErrFma algorithm performs one multiplication, two FMAs, and 17 additions, that is, 20 operations in total. In general, it is not known whether this cost can be reduced or not (which is in contrast with the optimality of 2Sum mentioned in Section 4.2.1). However, if we know in advance that , , are such that the exact error + − RN( + ) fits into a single FP number instead of two (that is, 2 = 0), then the call to Fast2Sum in line 6 can be replaced by a single addition, 1 ← RN( + 2 ), which saves two operations. This simplified version of ErrFma can be used for example when designing FP algorithms for correctlyrounded square-root reciprocals (Borges, Jeannerod and Muller 2022). Also, 3 floating-point operations (corresponding to the Fast2Sum of line 6) can be saved by directly returning ( , , 2 ), at the price of an "un-normalized" output (Boldo and Muller 2005). This is much in the spirit of Lange and Rump's pair arithmetic (see Section 5.2).
In some other applications, the exact error of the FMA is in fact not needed and can be replaced by any good enough approximation. In such cases, we can thus relax the specification of ErrFma in order to produce a single FP number ≈ 1 + 2 instead of the pair ( 1 , 2 ) and, hopefully, obtain a cheaper algorithm. For example, if we choose to compute the correctly-rounded value = RN( 1 + 2 ) then, using 1 + 2 = + 2 and 1 = RN( + 2 ), it again suffices to replace line 6 in Algorithm 4 by ← RN( + 2 ). This gives the best possible approximation to the error of the FMA in 18 FP operations instead of 20, and this for every input ( , , ) such that no underflow or overflow occurs. This variant of ErrFMA is of course not an EFT anymore, since the exact sum + can now differ from + , but the pair ( , ) can be viewed as double-word approximation to + : Here, the relative error of the exact sum + is at most u 2 = 2 −2 , as if we had rounded + to nearest in precision 2 . If we accept to relax the accuracy specification a little further and can tolerate a relative error bound of the form (u 2 ) for some modest hidden constant, then even faster solutions are possible. In particular, the following algorithm was proposed by Boldo and Muller (2011), which uses 12 FP operations to produce a pair ( , ) such that = RN( + ), Algorithms 4 and 5 have been used by Kouya (2019) to implement some BLAS1 functions in double-word arithmetic. The experiments reported there for u = 2 −53 are clearly in favor of the second algorithm, which in this specific context is significantly faster than the first one while preserving the same level of accuracy. Comparisons with Lange and Rump pair arithmetic (see Section 5.2) still need to be done. The EFTs can be used to emulate stochastic roundings when no hardware implementation is available (Févotte andLathuilière 2016, Croci et al. 2022).

Splitting algorithms
We sometimes need to split a precision-floating-point number, i.e., to decompose it into two FP numbers ℎ and ℓ of smaller significand size, such that = ℎ + ℓ and | ℓ | ≤ 1 2 ulp( ℎ ) (or, in some cases, | ℓ | < ulp( ℎ )). Among the many applications of this, let us mention the following ones.
• By requiring ℎ to have a one-bit significand, we can compute ufp( ).
• Very similarly, but with looser constraints, we may require | ℎ | to be a power of 2 close to | |. This can be useful when scaling some calculations, e.g., to avoid spurious underflows or overflows when computing euclidean norms and performing some operations in complex arithmetic. • By requiring | ℓ | to be less than one and ℎ to be an integer, we compute mod 1 which for instance is useful when implementing function cospi( ) = cos( ). • By requiring ℎ and ℓ to have ⌊ /2⌋-bit significands (which is always possible in radix-2 FP arithmetic, even when is odd), and doing the same for another FP number decomposed into ℎ and ℓ , we ensure that ℎ ℎ , ℎ ℓ , ℓ ℎ , and ℓ ℓ are FP numbers (and hence are computed exactly with FP multiplications). This is the key to the algorithm of Dekker (1971), used in place of 2Prod when no FMA instruction is available (see Section 4.2.2).
• It can be used to accurately compute the sum of FP numbers, by splitting them into parts that can be accumulated without errors. An example of such a summation algorithm is given and analyzed in detail by Rump et al. (2008). See also Sections 3.4 and 5.3.
As done by Jeannerod, Muller and Zimmermann (2018), we distinguish between absolute splittings, where we require ℎ to be multiple of some constant, and | ℓ | 46 S. Boldo, C.-P. Jeannerod, G. Melquiond, J.-M. Muller to be less than that constant, and relative splittings, where the required bound for ℓ is somehow proportional to | |. The following algorithm does an absolute splitting of , similarly to algorithm Extractscalar of Rump et al. (2008).

Algorithm 6 NearestInteger( )
Note that Algorithm 6 is similar to Fast2Sum (Algorithm 1). Only the input assumptions somehow differ. This algorithm is, for instance, used for reducing the initial argument in the Julia implementation29 of function cospi.

Theorem 4.1 (Jeannerod et al. 2018). Assume that is an integer satisfying
then the floating-point number ℎ returned by Algorithm 6 is an integer such that | − ℎ | ≤ 1/2 (that is, ℎ is equal to rounded to a nearest integer). Furthermore, = ℎ + ℓ .
Some absolute splitting techniques have also been introduced by Rump et al. (2008), for scalars and vectors, in the context of accurate summation of many FP numbers. Extensions to the matrix case have then been proposed by Ozaki, Ogita, Oishi and Rump (2012), which allow to compute EFTs of matrix products.
The next algorithm performs a relative splitting of and can be seen as an FMA-based variant of Veltkamp's splitting used by Dekker (1971). et al. 2018). Let ∈ F and ∈ Z such that 1 ≤ < . Then, barring underflow and overflow, Algorithm 7 computes ℎ , ℓ ∈ F such that = ℎ + ℓ and the significands of ℎ and ℓ have at most − and bits, respectively.
Computing exactly ufp( ) can be done thanks to an algorithm due to Rump (2009), recalled below. Rump's algorithm uses 4 FP operations, without any FMA. Furthermore, if in this algorithm we replace by · 2 1− and ignore the possibility of underflow and overflow, then one can obtain ulp( ) in 4 FP operations as well.

Algorithm 8 ufp( )
If an FMA is available and used, then one can obtain ulp( ) in just 3 FP operations, by adapting a scheme introduced by Rump, Zimmermann, Boldo and Melquiond (2009) for computing the predecessor and successor of a given FP number (Jeannerod et al. 2018, Algorithm 8). Note also that this paper by  provides a good illustration of the care needed to handle subnormal inputs properly.
Finally let us mention that alternative splitting algorithms have been proposed by Graillat, Lefèvre and Muller (2020) when rounding is assumed to be either towards −∞ or towards +∞ (instead of to nearest). Such algorithms can then be used to implement Dekker's EFT for the product of two FP numbers when directed roundings are used.

Floating-point filters
In some cases, despite the use of numerical computations, one might not be looking for a numerical result but a Boolean result. A typical example is in Computational Geometry, where decisions such as are these two points on the same side of this plane? or is this point inside this sphere? must be made. These decisions usually reduce to knowing the sign of a determinant.
If the rounding error of the floating-point determinant can be (hopefully, tightly) bounded, then one will in most cases be able to make the decision with a simple floating-point calculation. The use of exact or highly accurate arithmetic will be reserved to the (hopefully few) cases where the absolute value of the computed determinant is less than the error bound. This is the idea behind the use of floatingpoint filters, suggested by Fortune and Van Wyk (1993). See also the works by Shewchuk (1997), Pion (1999), Demmel and Hida (2004), Bartels, Fisikopoulos and Weiser (2022).

Error bounds in complex arithmetic
Once floating-point arithmetic is available for approximating and manipulating real numbers, it can naturally be used to create ways to approximate and manipulate complex numbers, and this has been common practice at least since the 1960s (Smith 1962, Wilkinson 1965, Friedland 1967.
Typically, a complex number ∈ C will be approximated by a complex floatingpoint number of the form = + with 2 = −1 and , two FP numbers from the same set F of binary, precision-FP numbers. If and are obtained by rounding the real and imaginary parts of to nearest in F, then in the absence of underflow and overflow, each of the two associated relative errors is at most u = 2 − and, therefore, This is just the complex analogue of what we have seen in Section 2.1, the relative error being now a complex number whose complex absolute value is bounded by the unit roundoff independently of . Similarly to the real case, u can be replaced by u/(1+u) when rounding is to nearest, and by 2u in the case of directed or faithful roundings.30 Also, the possibility of underflow for or is taken into account by adding a suitable complex term to the product (1 + ) (Demmel 1984). Given two complex FP numbers = + and = + , we can now consider how to operate on them. In principle, this should be straightforward at least for basic arithmetic, which is defined by simple real functions of , , , :31 In practice, providing suitable FP support for this set of basic operations is nontrivial, as various issues arise, pertaining notably to specification, robustness, and accuracy. These issues are well known and detailed discussions can be found 30 In the case where | | = 1, which occurs in particular when is a root of unity, then the bounds u and 2u can be divided by √ 2, as shown by Brisebarre et al. (2020). 31 For square root, choosing a negative real part could be possible as well, provided the sign of the imaginary part is adjusted accordingly; also, we assume that sign(0) = +1. (See the work of Kahan (1987, p. 201) for a sign function supporting signed zeros.) for example in the writings of Kahan (1987), Kahan and Thomas (1991), and Beebe (2017, Chap. 15, 17), so here we only make a few comments.
Specification. Given some complex FP inputs whose real and imaginary parts can be FP finite numbers as well as infinities or NaNs, what complex FP output should we return for | |, + , etc.? For the complex absolute value as well as complex addition/subtraction we can simply rely on the IEEE 754 specifications of the hypotenuse function and of addition/subtraction. Note, however, that since the support of the hypotenuse function is not required but only recommended by the standard (IEEE 2019, §9), most existing implementations of | | might follow a weaker specification, where a correctly-rounded result is not guaranteed. For complex multiplication, division, and square root, there is no such complete specification to rest on, at least because strict requirements for the behavior of an operation like + have so far been outside the scope of IEEE-754 arithmetic. Several implementation choices are thus possible, leading sometimes to inconsistencies for numerical values as well as exceptional behaviors (Demmel et al. 2022, §2.2).  (Smith 1962, Friedland 1967, Stewart 1985, Kahan 1987, Li et al. 2000, Priest 2004, Baudin and Smith 2012, and some of them appear for example in the working draft of the C standard (ISO/IEC 2022, Annex G). Also, a scaling technique for multiplication is mentioned by Sterbenz (1974, §13).

Robustness. Another difficulty comes from the possibility of undue intermediate underflows and overflows when evaluating expressions such as
Note finally that the availability of a correctly-rounded operator ( , , , ) ↦ → + would immediately lead to a robust multiplication; for division, however, inputs such that + ≈ 2 + 2 ≫ would remain problematic.

Accuracy.
A third difficulty is due to the fact that damaging cancellation can be caused by straightforward FP evaluations of expressions such as − or | | − | |, that appear in the above definitions of , / , and √ . For the other operations, namely, | | and ± , this is clearly not an issue, and high accuracy is ensured even for naive implementations of the hypotenuse (Hull et al. 1994, Ziv 1999, although even better accuracy can be obtained with some care, as shown by Borges (2021).
In the next three subsections, we will examine this accuracy issue more carefully, thus focusing on multiplication, division, and square root, and see when and how the possibility of heavy cancellation should be dealt with in order to ensure some high level of accuracy. A number of error bounds will be reviewed, which have been given in the literature under the assumption that underflows and overflows do not occur. These bounds, however, remain compatible with the scaling techniques mentioned in the previous paragraph provided the scaling factors are integral powers of two.
The last subsection will be devoted to extensions to the complex case of some of the error-free transformations seen in Section 4.2.

Ensuring normwise accuracy: multiplication and division
For complex multiplication and division, it has been noticed early that the definitions recalled above can be used directly in FP arithmetic to obtain values such that for some modest constant ∈ R >0 (Champagne 1964, Wilkinson 1965). All we need here is that each FP operation is done with relative error at most u (or even a small multiple of it) and that underflows and overflows do not occur.
In other words, despite the possibility of damaging cancellation when computing the real or imaginary part of the result (resulting in poor component-wise accuracy), the traditional expressions for complex multiplication and division are enough to ensure high relative accuracy in the normwise sense, and this for any of the usual rounding functions. In the specific case of rounding to nearest, several detailed error analyses have been done in order to determine the possible values of in (4.2) depending on the context (e.g., availability of an FMA or not). We briefly review some of these results in what follows.
Multiplication and squaring. For multiplication, Brent, Percival and Zimmermann (2007) showed that one can take = √ 5 when − and + are obtained after 4 FP multiplications and 2 FP additions. This constant can be decreased to = 2 when an FMA is used to obtain the real and imaginary parts as RN( − RN( )) and RN( +RN( )), or as any of the three other ways (Jeannerod et al. 2017a). In both cases the term (u 2 ) in (4.2) can be removed and the constant is best possible in the sense that there exist FP inputs , , , yielding a relative error such that | | ∼ u as u → 0.
It was also noted by Jeannerod et al. (2017a) that in the special case of complex squaring, where 2 = 2 − 2 + · 2 , one can take = 2 even in the absence of an FMA, that is, when the real part is obtained as RN(RN( 2 ) − RN( 2 )).

Division and inversion.
For division, the definition amounts to first computing the complex product with = − , and then dividing the real and imaginary parts of this product by = = 2 + 2 . Evaluating this sum of squares as RN(RN( 2 ) + RN( 2 )) or RN( 2 + RN( 2 )) or RN(RN( 2 ) + 2 ) yields an FP number of the form and each FP division then yields a relative error at most u. Thus, as noted by Olver (1983) and Baudin (2011), the computed quotient satisfies (4.2) with where mul denotes the constant associated with the algorithm used to evaluate . Consequently, we can always take = √ 5 + 3 and, if an FMA is available, = 5. An interesting special case is complex inversion, defined by Since no complex multiplication is involved, we can take mul = 0 and we deduce immediately that inv = 3. The detailed error analysis by  reveals that for ≥ 10 and without an FMA, we can in fact take inv = 2.70712 . . . and that this leading constant is reasonably sharp. For example, for = 53 (binary64 IEEE format), evaluating = ( + ) −1 in this way with = 4503599709991314 and = 6369051770002436 · 2 26 produces such that | | = | − |/| | = 2.70679...u.
Complex inversion can of course also be used to implement complex division as This approach has the same operation count as the initial one, based on / = ( + )/ + ( − )/ , but less instruction-level parallelism, since the sum of squares and the complex product cannot be evaluated simultaneously anymore. It has, however, the following nice accuracy property. The computed satisfies (4.2) with = inv + mul , which for inv < 2.708 and mul ≤ √ 5 < 2.237 implies < 4.945. Consequently, for complex division we can always take = 5 in (4.2), provided we use / if an FMA is available, and · −1 otherwise.
If besides the FMA we can also use tests or call functions such as minimum-Magnitude and maximumMagnitude (see Section 2.12.2), then the sum of squares can now be evaluated as = RN( 2 + RN( 2 )), = max(| |, | |), = min(| |, | |). (4.4) In this case it is known that the associated relative error | | can be bounded by 1.5u instead of 2u without sorting | | and | |. This directly allows replacing Equation (4.3) by = mul + 2.5 and, therefore, to obtain a complex division algorithm for which = 4.5 (Jeannerod, Louvet and Muller 2013b).

Ensuring normwise accuracy: square root
In contrast to multiplication and division, the formula recalled above for square root does not ensure a normwise relative error in (u) when evaluated in FP arithmetic. This fact, which is due to the possibility of heavy cancellation in one of the two expressions | | ± , has been recognized early, and a now classical workaround is the following rewriting, that for example was already proposed by Strachey (1959). Given = + with , ∈ F, the real and imaginary parts of √ = + can be expressed as follows: • if ≥ 0, then = √︃ | |+ 2 and = /(2 ); • if < 0, then = sign( ) √︃ | | − 2 and = /(2 ).
By using the standard model, Hull et al. (1994) showed that in the absence of underflow and overflow, Strachey's formula produces a complex FP number = + such that This bound holds independently of the use or not of an FMA for evaluating the sum of squares in | | = √ 2 + 2 , and for directed and faithful roundings it holds with u replaced by 2u.

Ensuring component-wise accuracy
We have just seen that for the five basic operations on complex FP numbers the usual definitions (and, in the case of square root, Strachey's rearrangement) suffice to ensure that the associated normwise relative errors | − |/| | can be bounded by small multiples of u. A nice consequence of this is that the error analyses based on the standard model of FP arithmetic can often be lifted up to the complex case directly, by a suitable increase of the leading constants in the error bounds.
What if now we want high relative accuracy in the component-wise sense? In this case, both the real and imaginary parts of must have relative errors in (u). For complex addition and subtraction, obtained as RN( ± ) + RN( ± ), this is obviously true. For complex square root, Strachey's rewriting introduced in Section 4.5.2 in order to ensure normwise accuracy turns out to ensure componentwise accuracy as well. Indeed, if ≥ 0 (and with or without FMA), it is easily seen that repeated applications of the standard model lead to For < 0, these two bounds can be swapped and, again, when rounding is to nearest they are reasonably tight and the terms (u 2 ) are not even needed (Hull et al. 1994, Jeannerod and. We are thus left with complex multiplication and division, for which the only issue is the accurate FP evaluation of expressions of the form − . Specifically, if ≥ 0 then sign( ) = sign( ) and sign( ) = sign( ), which implies that only the real part in the complex product − + ( + ) can suffer from heavy cancellation and that its imaginary part will be obtained with high relative accuracy. Similarly, if < 0 then only + can be inaccurate. These remarks apply to division as well, where inaccuracy can only come from the FP evaluation of one of the numerators in ( + )/( 2 + 2 ) + ( − )/( 2 + 2 ).
In practice, a naive FP evaluation of − can indeed yield totally wrong results, and this with or without an FMA. This fact is well known and we simply recall here a binary64 example by Jeannerod, Monat and Thévenoux (2017b). For = 53, if = 1 + 2 −51 , = 1 + 3 · 2 −52 , = 1 − 2 −53 , and = 1 − 3 · 2 −53 , then the exact result is An efficient way to avoid such damaging cancellations is to call the 2Prod algorithm from Section 4.2.2. For example, it implements the EFT = + , where is the rounded product RN( ) and is the corresponding error. Adding this error to the error of the product RN( ) yields a correction that can then be used to improve the accuracy of the naive evaluation RN(RN( ) − RN( )). This scheme is an example of what is sometimes called a compensated algorithm (see Section 5.3). It was presented and analyzed by Cornea, Harrison and Tang (2002, p. 273), who used it to evaluate both the real and imaginary parts and of ( + )( + ) very accurately. For rounding to nearest, this leads to Algorithm 9 below. 54 S. Boldo,G. Melquiond, Algorithm 9 cmulCHT( , , , ) Detailed rounding error analyses of this approach show that both | / − 1| and | / − 1| are bounded by 2u + (u 2 ), that the constant 2 is best possible, and that the possibility of removing the term in (u 2 ) depends on the tie-breaking rule. If RN = RN (ties to even), then 2u + (u 2 ) can be replaced by 2u. But if RN = RN (ties to away), the relative error in orˆcan be larger than 2u + u 2 − 4u 3 (Muller 2015, Jeannerod 2016).
An alternative to Algorithm 9 is Algorithm 10 below, which, following a technique by Kahan (1998), recovers the error of only one of the two products in − , and similarly for + . In the version displayed here the computed errors are those of RN( ) and RN( ), but three other ways are possible. In each case, the resulting relative errors are bounded as and, again, the constant 2 cannot be reduced further (Jeannerod, Louvet and Muller 2013a).
In practice, these accurate algorithms have reasonable running-time overheads compared with the naive ones, for which the computed real or imaginary part can have no correct digit at all (Jeannerod et al. 2017b). Although Algorithm 10 is simpler and cheaper than Algorithm 9, the latter has the advantage of symmetry and ensures that the complex multiplication it implements is commutative. Note also that both algorithms preserve the fact that = ( + )( − ) = 2 + 2 is a real number; this is well known to be false for the naive FMA-based scheme, where the imaginary part is computed as RN(RN(− ) + ) or RN(− + RN( )).
Application to division. Algorithms 9 and 10 can be used directly to ensure high component-wise accuracy for complex division. It suffices to evaluate the complex product = ( + )( − ) with either of these algorithms, and then to divide the obtained real and imaginary parts by = 2 + 2 . Each component of the computed quotient then has relative error at most 5u + (u 2 ) and, similarly to what has been said for the normwise case, evaluating 2 + 2 as in (4.4) leads further to the slightly smaller bound 4.5u + (u 2 ) (Jeannerod et al. 2013b). As for complex inversion, the traditional formula 1/ = / − · / already ensures high component-wise accuracy.
The special case of squaring. To ensure component-wise accuracy when evaluating 2 = 2 − 2 + · 2 , the preceding algorithms need not be used and it suffices to evaluate 2 − 2 as ( + )( − ). This rewriting was already suggested by Kahan (1987) and, for rounding to nearest, it is easily checked that the computed value satisfies | / − 1| ≤ 3u. The analysis done by Jeannerod (2020) shows further that the best possible constant is indeed 3 when RN rounds ties to away, but that it is 9/4 when RN rounds ties to even. It was also noted there that can be strictly larger than RN( 2 ), while this cannot occur when using any of the naive schemes RN(RN( 2 ) − RN( 2 )), RN( 2 − RN( 2 )), and RN(RN( 2 ) − 2 ), thanks to the monotonicity of the rounding function.

Complex EFTs
Extensions of error-free transformations from the real case to the complex case have been proposed by Graillat and Ménissier-Morain (2007, 2008, 2012 for addition, subtraction, and multiplication, mostly in order to improve the normwise accuracy of polynomial evaluation in complex FP arithmetic. For addition, where + = ( + ) + ( + ) is evaluated as the complex FP number = RN( + ) + RN( + ), a complex EFT is obtained directly by applying the EFT from Section 4.2.1 twice, to + and to + . Two calls to the 2Sum algorithm suffice to produce the FP numbers = + − RN( + ) and = + − RN( + ) and, therefore, the complex FP number such that We can proceed similarly for subtraction, and the remarks done in the real case still apply: rounding must be to nearest, 2Sum can be replaced by Fast2Sum up to sorting the inputs, and the (im)possibility of spurious overflow is well understood. For multiplication, the most common EFT is obtained by considering the evaluation of = ( + )( + ) without FMA, that is, as = + where = RN(RN( ) − RN( )) and = RN(RN( ) + RN( )), and by using not only 2Sum but also the FMA-based 2Prod algorithm from Section 4.2.2 to com-pute the associated rounding errors. Specifically, for the real part we have where the errors 1 , − 2 ∈ F are computed by 2Prod, and the error 3 ∈ F is computed by 2Sum. Proceeding similarly for the imaginary part yields three FP errors such that + = + 1 + 2 + 3 and, therefore, three complex FP numbers such that Depending on whether Fast2Sum or 2Sum is used, the costs of decomposing + as in (4.5) and as in (4.6) range from, respectively, 6 to 12 and 14 to 20 FP operations. In practice, these EFTs have been used to improve the evaluation of elementary symmetric functions on complex FP inputs (Jiang et al. 2016), as well as the computation of polynomial roots via the Ehrlich-Aberth method (Cameron and Graillat 2022).
Note finally that in the absence of an FMA it is still possible to implement the EFT in (4.6) by using the Veltkamp-Dekker algorithm (mentioned in Section 4.2.2) in order to recover the rounding errors of the four FP products, but at the price of at least twice more FP operations than with 2Prod. If, on the contrary, an FMA is available and used to evaluate the product as ′ = ′ + ′ with, say, ′ = RN(RN( ) − ) and ′ = RN(RN( ) + ), then one can obtain an EFT of the same form as in (4.6), with 1 produced as before by two calls to 2Prod, and with ′ 2 and ′ 3 two other complex FP numbers now produced by two calls to the EFT of the FMA (algorithm ErrFMA from Section 4.2.4). Again, a significant cost overhead is to be expected here, due to the complexity of ErrFMA. Thus, although such variants might be optimized further, it seems that a reasonable choice for performing an EFT for complex multiplication remains the one based on (4.6), with a use of the FMA restricted to 2Prod.

Error bounds for higher-dimensional problems
In the previous section, we have focused on complex FP arithmetic, whose key building block is the two-dimensional inner product + . We now consider similar problems in higher dimensions, such as = =1 with and two vectors of FP numbers. As before, our focus will be on accuracy issues and the derivation of error bounds, assuming (unless otherwise stated) that only one precision is used and that underflows and overflows do not occur.

Traditional worst-case bounds
Following Wilkinson (1960Wilkinson ( , 1963, a priori worst-case error bounds can often be produced by repeated applications of the standard models of FP arithmetic (2.5): (ratios of) expressions of the form are obtained, from which bounds on | ℎ | imply bounds on the backward error of the computed result. Here, each is the relative error of a single FP arithmetic operation, and ℎ depends on the algorithm used to produce that result. For example, when evaluating the sum 1 + 2 + 3 from left to right, the computed result can be written as 1 := 1 (1+ 2 ), ′ 2 := 2 (1+ 2 ), and ′ 3 := 3 (1+ 1 ). In other words, is the exact sum of three numbers obtained by slight (relative) perturbations ( 2 , 2 , 1 ) of ( 1 , 2 , 3 ), and its (relative) backward error is at most max(| 1 |, | 2 |).
For rounding to nearest, where | | ≤ u/(1 + u) ≤ u for all , we have for directed roundings and faithful rounding the same holds with u replaced by 2u. Assuming ℎ is fixed and → 0 we see that these bounds have the form u + (u 2 ) for some "constant" that is either ℎ or 2ℎ, depending on the rounding function.
The commonest way to bound | ℎ |, however, is by replacing (1 + u) ℎ − 1 by the following slightly larger quantity: As amply demonstrated by Higham (2002), this ℎ notation turns out to be the right tool for performing a great deal of a priori worst-case rounding error analysis. Indeed, it makes it possible to derive error bounds that are concise yet rigorous and free of the risk of having possibly large constants hiding in the (u 2 ) terms. Furthermore, such bounds can be composed very conveniently by means of simple manipulation rules such as ℎ + + ℎ ≤ ℎ+ if ℎ + < u −1 (Higham 2002, §3.4). Note that when deriving a backward error bound of the form ℎ , the precise value of ℎ can depend (sometimes significantly) on the FP algorithm used. For example, when evaluating a degree-polynomial =0 by Horner's rule, then, for , ∈ F and using multiplications and additions with rounding to nearest, the resulting backward error is at most 2 . But if an FMA is used at each of the iterations, this bound drops to and can thus be roughly halved.
An interesting intermediate situation is blocked summation, which underlies the design and implementation of various efficient and accurate algorithms for summation and inner products (Castaldo, Whaley andChronopoulos 2009, Blanchard, Higham and. Given a block size (which for simplicity is assumed here to divide ), one rewrites the exact sum as / =1 =1 ( −1) + and evaluates it as follows. First, compute each of the / partial sums of length , using a total of / · ( − 1) FP additions; then compute the sum of the obtained partial sums using / − 1 FP additions. In this case, the term −1 in (4.9) can always be replaced by ℎ with ℎ = + / − 2, (4.10) whose minimal value ≈ 2 √ is attained when is nearest to √ . Such variations in the values of ℎ are in general reflected directly by the error bounds of higher-level algorithms relying on FP summation, such as those for inner products, matrix-vector multiplication, matrix multiplication and factorization, etc. In particular, when evaluating inner products = =1 in the usual way by adding the individual products together, the resulting constant is ℎ+1 ≤ , with ℎ ≤ − 1 the height of the underlying summation tree. Finally, any of these traditional error bounds can be extended to the case of complex FP arithmetic by simply taking into account the small constants associated with complex FP ×, /, √ (and which have been presented in Section 4.5).

Refining some worst-case bounds
During the last decade or so, several of the traditional bounds mentioned in the previous section have been sharpened and simplified. The initial improvement is due to Rump (2012), who showed for recursive summation and rounding to nearest, that the constant −1 in (4.9) can be replaced by ( − 1)u, and this for all . In other words, both the (u 2 ) term in −1 = ( − 1)u + (u 2 ) and the assumption ( − 1)u < 1 can be removed. The proof of this result does not use just the standard model (2.5a) but combines it with the following property of rounding to nearest: (4.11) Furthermore, the proof can be adapted to any summation ordering, so that (4.9) can eventually be replaced by (4.12) Then, still assuming rounding to nearest, several similar refinements have been proposed by various authors, leading to (u 2 )-free error bounds for several basic FP algorithms, including the evaluation of inner products, 2-norms, powers and continued products, the evaluation of univariate polynomials by means of Horner's rule, LU and Cholesky factorizations, and triangular system solving by substitution (Rump 2019, §III and the references therein). In all cases, the classical constants ℎ have been replaced by their linear term ℎu, but sometimes at the price of some restriction on the largest possible values of ℎ.
More recently, Lange andRump (2017, 2019) introduced a general arithmetic framework for FP summation, which in particular makes it possible to go beyond (4.12) in at least three ways, by considering roundings other than to nearest, by incorporating the height of the summation tree into the error bounds, and by addressing the issue of the attainability of these bounds. Specifically, they gave the following analog of (4.12) for more general roundings, such as faithful rounding (and thus also directed roundings), for which Property (4.11) does not hold anymore.
Lange and Rump then introduced some (u 2 )-free error bounds that not only cover general roundings, but also incorporate the height ℎ of the summation tree, as follows.

Theorem 4.4 (Lange and Rump 2019)
. Let = =1 with ∈ F. Then, for any summation ordering whose underlying binary tree has height at most ℎ, the computed sum satisfies where v = u for rounding to nearest, and v = 2u for faithful rounding.
Note that here the height ℎ ≤ − 1 is required to be at most of order 1/ √ u, while in the previous theorem was required to be at most about 1/u. Importantly, this error bound holds for inner products as well, with ℎ ≤ being now the height of the associated binary evaluation tree (Lange and Rump 2019, Corollary 4).
Finally, for rounding to nearest, the bound (4.12) can be replaced by the following attainable one, which holds whatever the ordering, provided is not too large.

Theorem 4.5 (Lange and Rump 2019)
. Let = =1 with ∈ F. Then, for rounding to nearest and any summation ordering, the computed sum satisfies This bound is attained in particular for round to nearest with ties to even, when applying recursive summation to ( 1 , 2 , . . . , ) = (1, u, . . . , u). Note that the same bound was proposed by Mascarenhas (2016) with a different proof technique and assuming recursive summation and ≤ 1 20 u −1 . Interestingly, no such attainable bound seems to be available for FP inner products. Despite this, and as shown by Lange and Rump (2019, §6), this optimal bound for FP summation in round to nearest is already enough to deduce rather directly several (u 2 )-free bounds for higher-level problems such as, for example, sums of products , matrixvector products for Vandermonde matrices, and blocked summation. In the latter case, this makes it possible to replace the traditional constant + / −2 seen in (4.10) by the linear term ( + / − 2)u provided max( , / ) − 1 ≤ 1 2 u −1 .

Computable error bounds
All the error bounds seen so far are real numbers involving exact expressions, such as =1 | |, that cannot be represented exactly as FP numbers. If an FP error bound must be produced along with the approximate result, then a first way is via repeated applications of the second standard model (2.5b) instead of the first one. For example, for FP addition and rounding to nearest, this second model implies | + | ≤ (1 + u)|RN( + )| for , , ∈ F. (4.13) Applying this inequality − 1 times to the sum of the | | gives where denotes the computed value of =1 | | obtained using recursive summation (or another ordering). Hence, recalling (4.12), the error of the computed sum of the is bounded as Here the right-hand side is in general not in F, but it can be bounded by a computable FP number as follows. Let := ( − 1)u and := RN( /(1 − u)), and assume ≤ u −1 . It is easily seen that and 1− u are FP numbers, and that (1+u) −1 ≤ . Consequently, To summarize, a rigorous error bound for FP summation can be obtained easily in round-to-nearest FP arithmetic, by first evaluating the two sums and | | independently and then performing three extra FP operations. Similar examples of computable FP bounds can be found in various contexts and we refer to the works by Ogita et al. (2005), Langlois and Louvet (2007), Jiang et al. (2016), Cameron and Graillat (2022).
Another way is to exploit ufp-based bounds instead of the second standard model (2.5b). For example, for FP addition, This bound can be up to about twice smaller than the bound u|RN( + )| implied by (2.5b). Furthermore, it is a computable bound, since the ufp of an FP number can be obtained using no more than 4 FP operations in round to nearest (Rump 2009, Algorithm 3.5); see also Section 4.3.
For the summation of FP numbers with rounding to nearest, this bound can be exploited to obtain the following computable counterpart of (4.12). (4.14) Barring underflow and overflow and for ≤ u −1 , this bound is an FP number as the exact product of an FP number and an integral power of 2; as with the first approach, it can be computed by evaluating both and and performing a few extra FP operations, using only rounding to nearest. Rump (2015) notes that the bound (4.14) is attained for recursive summation, round to nearest with ties to even, and ( 1 , 2 , . . . , ) = (1, u, . . . , u). He shows also that in order to achieve (4.14) it is necessary to use the same ordering for both and , and that different orderings lead to a more pessimistic bound, as before, a few FP operations will suffice to turn this bound into a computable one. From this, one can for example deduce computable error bounds for matrix multiplication, which can then be used to implement efficient routines for interval matrix multiplication (Ozaki, Ogita and Mukunoki 2021).

Probabilistic error bounds
For very high dimensional problems and very low precision-a situation becoming more and more common with the increase of computing speed and the availability of small FP formats, products like u can easily become larger than 1, thus making backward error bounds of the form = u/(1 − u) or u, such as the ones seen in the three previous subsections, totally uninformative.
These bounds, however, are worst-case bounds and can be attained or nearly attained only in extremely rare situations. In practice, for fixed u and increasing , it has long been observed that the error growth tends to be much slower, typically as √ ℎ( ) u rather than as ℎ( ) u (Wilkinson 1961, p. 318). Recently, Mary (2019, 2020) initiated a path aimed at formalizing this rule of thumb, and obtained rigorous probabilistic analogs of (4.8), that is, bounds of the form that are proved to hold with probability at least, say, 1 − 2 exp(− 2 /2). Here, is a small constant that can be chosen so that the bound holds with very high probability and the new term ℎ = √ ℎ u + (u 2 ) grows indeed like √ ℎ u. For stochastic rounding, such probabilistic bounds hold unconditionally, while for deterministic roundings (such as those of IEEE 754 arithmetic), they follow from assumptions made about rounding errors and, sometimes, input data as well. For example, it can be assumed that all the individual rounding errors occurring during a computation form a sequence of mean-independent random variables of mean zero. In practice, this framework yields a priori error bounds whose growth rates nicely match the observed ones.
Since 2019 the literature on this topic has been growing fast, with the publication of probabilistic error analyses for most of the fundamental building blocks such as sums, inner products and matrix-vector products, matrix multiplication, and LU-based system solving. We refer in particular to the works of Ipsen and Zhou (2020), Connolly, Higham and Mary (2021) and the surveys of Croci et al. (2022), . Such analyses have also just been proposed for Horner polynomial evaluation (El Arar et al. 2022), for Householder QR factorization , and for several mixed-precision algorithms (Hallman andIpsen 2022, Connolly, Higham andPranesh 2022).
Finally, as noted by Higham (2021b), these new ( √ ℎ u) bounds can be combined with blocking techniques (which yield smaller constants as well, as in (4.10)) and with the features of modern architectures (whose block FMA units allow for accumulation in extended precision), in order to explain why huge computations using mostly small FP formats can succeed in practice.

Tools for polynomial approximation of functions
In general, at the hardware level, the only arithmetic operations that are available are addition, multiplication, division, and square root. Since division and square root are significantly slower than the other operations (see Figure 1.1), if we avoid using them, the only functions of one variable we can compute are piecewise polynomials. Hence, the mathematical functions are very often approximated by polynomials (Cody and Waite 1980, Muller 2016, Beebe 2017. It is therefore important to be able to tightly control the accuracy of polynomial approximations, as well as the accuracy with which we evaluate polynomials in floating-point arithmetic.

Sollya
The Sollya package33 by Chevillard, Joldeş and Lauter (2010) provides useful tools for designing mathematical function libraries in floating-point arithmetic. Among the many features of Sollya, two are of a particular interest: • a certified infinite norm (function supnorm), which allows one to compute safe approximation error bounds, using an algorithm presented by Chevillard et al. (2011); • a special floating-point minimax procedure (function fpminimax), that computes best or nearly best polynomial approximations to functions given some constraints (such as the coefficients of the polynomial being floating-point numbers of a given format, or double-word numbers-see Section 5.1), using techniques presented by Brisebarre and Chevillard (2007).
First, we need to provide the tool with the values of the polynomial coefficients: c5 = 4803839602528529b-59; c7 = -7320136537069203b-65; c9 = 3253360761268093b-70; Then, we can describe the various floating-point computations. For example, float<ieee_64,ne>(x * x) is the result of first computing the exact product of x by itself and then rounding to binary64 in rounding to nearest (tie breaking to even). Since this rounding operator will be used more than once, let us give it the shorter name rnd: @rnd = float<ieee_64,ne>; x2 = rnd(x * x); We can describe the whole polynomial computation in the same way. But since rnd has to be applied around every arithmetic operation, we can instead put it on the left of the equal sign to achieve the same effect in a more readable way. y rnd= c5 + x2 * (c7 + x2 * c9); Since we are interested in bounding the round-off error, we need a few more definitions. In particular, we need to express the following infinitely precise computations: Mx2 = x * x; My = c5 + Mx2 * (c7 + Mx2 * c9); Now, all that is left is to ask the tool to verify a bound on a round-off error. For example, the following formula means that, for any x less than 2 −7 in absolute value, the relative error (denoted by operator -/) between y and My is bounded by 2 −53 .
{ |x| <= 1b-7 -> |y -/ My| <= 1b-53 } The tool validates that this formula actually holds. It can also produce a formal proof that can be mechanically checked by the Coq proof assistant. For analysis or debugging purposes, one can also leave holes in the formulas. For example, if one wonders what the absolute error is when x is much smaller, the following formula can be sent to Gappa.
{ |x| <= 1b-50 -> y -My in ? } #@-Eprecision=200 The tool answers as follows: In other words, the round-off error is always nonnegative but no larger than 2 −112 ≃ 1.6 · 10 −34 . Note that the option -Eprecision=200 has been passed to the tool in order to increase the internal precision of Gappa's interval arithmetic. Indeed, with its default precision, Gappa would only be able to prove an upper bound of 2 −66 , which is correct but very pessimistic.

Extending the precision
As seen in Section 4.2, error-free transformations (EFTs) make it possible to compute the rounding error of some operations. Upon them, one can then build double-word operators to retain more accuracy using only processor floating-point operations as in Section 5.1. Pair arithmetic is a variation on this approach that avoids some renormalization steps, as shown in Section 5.2.
A double-word or pair arithmetic provides a way of extending the precision by systematically replacing floating-point operations with costlier basic blocks. A more parsimonious use of EFTs leads to the so-called compensated algorithms, as shown in Section 5.3.
These approaches are meant to emulate a floating-point precision that is double the working precision, but this might not be sufficient. Section 5.4 presents some tools and libraries that provide a higher precision.

Double-word arithmetic
Since Algorithms 1 (Fast2Sum), 2 (2Sum), and 3 (2Prod) produce their exact results in the form of unevaluated sums of two floating-point numbers whose first one is equal to RN( ), it is natural to try to perform calculations on such unevaluated sums. If the underlying floating-point format being used is binary64, this makes it easy to manipulate numbers with a precision larger than 100 bits. This can be very useful in critical parts of programs, since the binary128 format (often called quad precision), which is specified in IEEE-754 since the 2008 version, is seldom implemented in hardware. To our knowledge, in the recent years, the only widely distributed computer with a hardwired binary128 arithmetic is the IBM z systems.
In the 60's, Gregory and Raney (1964) and Ikebe (1965) suggested representing large precision numbers as the sum of a FP number and scaled integers. In his seminal paper, Dekker (1971) introduces the first double-word algorithms (called doublelength in his paper) for addition, multiplication, division, and square root. The first interest of double-word arithmetic is to obtain more accurate results. However, as mentioned by Riedy and Demmel (2018) it can also help to improve reproducibility of parallel codes (He and Ding 2000), or to accelerate some linear algebra algorithms (Yamazaki, Tomov and Dongarra 2015). Libraries that provide double-word arithmetic have been written by Hida, Li and Bailey (2001)35 and Briggs.36 A more recent library is CAMPARY37 (Joldeş et al. 2016).
The definition of a double-word number may slightly vary. We use the same definition as Joldeş, Muller and Popescu (2017): Definition 5.1. A double-word (DW) number is the unevaluated sum ℎ + ℓ of two floating-point numbers ℎ and ℓ such that ℎ = RN( ).
The following algorithms are presented by Joldeş et al. (2017) for the sum or product of a double-word number and a FP number, the sum, product or quotients of two DW numbers; and by Lefèvre, Louvet, Muller, Picot and Rideau (2021) for the square root. These algorithms are variants of Dekker's doublelength algorithms. The error bounds have been proved by Joldeş et al. (2017), Muller and Rideau (2022), and Lefèvre et al. (2021). There is a formal proof in Coq for all these error bounds.
Algorithm 14, suggested by Li et al. (2000), computes the product ( ℎ , ℓ ) × . Its relative error is less than 3 2 u 2 + 4u 3 . Beware that, since the algorithm calls 2Prod( ℎ , ), variables ℎ and must be in the domain of validity of 2Prod, i.e., they must satisfy the condition of Property 2.12.
Algorithm 17 computes the square-root of the DW number ( ℎ , ℓ ). Its relative error is less than 25 8 u 2 .
Algorithm 17 SQRTDWtoDW( ℎ , ℓ ) 1: if ℎ = 0 then 2: return ( ℎ , ℓ ) 10: end if Table 5.1 refers to some algorithms (i.e., DWTimesFP2) that are not given in this article. These algorithms are presented (with the same names) in Joldeş et al. (2017). They are slight variants of Algorithms 14, 15, and 16. These variants allow us to choose different compromises between speed and accuracy. For instance, Algorithm DWTimesDW3 is obtained by adding the term ℓ ℓ (which is neglected in Algorithm 15): this results in a more accurate yet slower calculation.   (2022), of the results presented by Joldeş et al. (2017), Muller and Rideau (2022), and Lefèvre et al. (2021), all formally proved. Unless stated otherwise, the largest errors observed in experiments are for RN being round-to-nearest ties-to-even.

Lange and Rump's Pair arithmetic
The double-word algorithms presented in Section 5.1 are very accurate, but they suffer a drawback. They all end-up with an instruction of the form "( ℎ , ℓ ) ← Fast2Sum( , )". That instruction costs 3 consecutive floating-point operations (which is a significant penalty, especially for the simplest algorithms), and one may wonder why we perform that instruction, knowing that all the information on the result is already in variables and , since ℎ + ℓ = + . As a matter of fact, that last Fast2Sum is a renormalization. The pair ( , ) is not necessarily (in fact, it seldom is) a double-word number, since the FP number may not equal RN( + ). If we perform one of the algorithms of Section 5.1 with double-word inputs but without the last Fast2Sum, the output ( , ) will satisfy the error bounds given in Table 5.1, but that output will not be a valid entry for a subsequent double-word algorithm. In other words, if the algorithms are fed with such outputs, the resulting error may be larger than the bounds of Table 5.1.
However, a natural question is: if we decide anyway to perform such algorithms without a final renormalization, what do we lose? We will almost certainly end up with larger final errors, but maybe we will be able to say something on the final result of such a sequence of pair operations? Lange and Rump (2020) undertook this task and gave conditions for the final result of a calculation to be a faithful rounding of the exact result. More precisely, they define the following pair operations: • CPairSum( ℎ , ℓ , ℎ , ℓ ) is SloppyDWPlusDW( ℎ , ℓ , ℎ , ℓ ) (i.e., Algorithm 12) where the pair ( ℎ , ) is returned (i.e., we drop the last Fast2Sum); • CPairProd( ℎ , ℓ , ℎ , ℓ ), adapted with our notation, is the following algorithm (quite similar to Algorithm 15 without the last Fast2Sum): • CPairDiv( ℎ , ℓ , ℎ , ℓ ), adapted with our notation, is the following algorithm: Algorithm 19 CPairDiv( ℎ , ℓ , ℎ , ℓ ) • CPairSQRT( ℎ , ℓ ) is SQRTDWtoDW( ℎ , ℓ , ℎ , ℓ ) (i.e., Algorithm 17) where the pair ( ℎ , ℓ ) is returned (i.e., we drop the last Fast2Sum). Lange and Rump (2020, Thms 4.2 and 5.4) give general sufficient conditions, for a computation represented by an evaluation tree whose nodes are pair operations, to return a faithfully rounded result. We will not present these conditions here, as this would require too much introductory material, but we give some of the corollaries below. These corollaries are with the implicit assumption that there are no underflows or overflows.

Corollary 5.2 (product of FP numbers (Lange and Rump 2020)).
Let ( 1 , . . . , ) ∈ F . If ≤ 1/ √ 2u − 1 then the product ( , ) of all 's computed with the pair arithmetic in any order is such that RN( + ) is a faithful rounding of the exact product.

Corollary 5.3 (sum of FP numbers (Lange and Rump 2020)).
Let ( 1 , . . . , ) ∈ F . Define = =1 | | / =1 . If ≤ 1/ √ 2 u − 1 then the sum ( , ) of all 's computed with the pair arithmetic in any order is such that RN( + ) is a faithful rounding of the exact sum. Similar corollaries are given by Lange and Rump (2020) for the computation of dot products and the evaluation of polynomials using Horner's scheme.

Compensated algorithms
Error-free transformations give access to the rounding errors of individual operations. We can use them later on in the calculation to (at least partly) compensate for these errors. This is the principle behind compensated algorithms. Let us illustrate these algorithms in the case of the computation of the sum of floating-point numbers. We do not claim exhaustiveness here, as there is a large body of literature on summation algorithms; see for instance the works by Rump et al. (2008), Demmel and Nguyen (2015), Higham (1993Higham ( , 2002, Blanchard et al. (2020), Lange (2022). We give a few examples below.
The following compensated summation algorithm was independently found by Kahan (1965) and Babuška (1969). ← 8: end for 9: return Lines 5 and 6 of Algorithm 20 are nothing but Fast2Sum( , ) (Algorithm 1). Note, however, that the conditions that guarantee + = + may not be satisfied. Indeed, the rounding function is not necessarily RN, and | | is not necessarily less than or equal to | |, so the correcting term used to update the next may be only an approximation to the error of . And yet the algorithm already enjoys an error bound better than (4.9): if each floating-point operation has relative error at most v then, as shown by Hallman and Ipsen (2022), the final value of satisfies here, v = u when all computations are done using round-to-nearest, while v = 2u when resorting to some generic rounding function R U D . Later on, Pichat (1972) and Neumaier (1974) independently suggested a more accurate variant of Algorithm 20 where, instead of immediately adding at step the previously computed value of the error term (let us call it −1 ), it accumulates these values and adds their computed sum to at the end of the calculation. They also performed tests to use Algorithm Fast2Sum in its domain of validity. Priest (1992) also gave a doubly compensated summation algorithm, which guarantees high relative accuracy provided the summands have first been sorted into decreasing order of magnitude.
To avoid tests, Ogita et al. (2005) replaced the use of Fast2Sum and a test in the Pichat-Neumaier algorithm by a 2Sum (Algorithm 2), and considered a generalization, where the sum of the error terms is computed again with the same algorithm (which was already suggested by Pichat). Then, calling Algorithm 21 repeatedly, this gives the -fold algorithm (Algorithm 22 below).
Algorithm 21 VecSum( 1 , . . . , ) ← for = 2 to do ( , −1 ) ← 2Sum( , −1 ) end for return Algorithm 22 -fold( 1 , . . . , ) for = 1 to − 1 do ← VecSum( ) end for ← 1 for = 2 to do ← + end for return Rump et al. (2008) show that the final value of variable satisfies where := u/(1 − u). Another, very different way of using EFTs to perform accurate summation is the following. Malcolm (1971), starting from ideas expressed by Wolfe (1964), splits the range of floating-point exponents into bins of a fixed width < (which depends on the number of terms to be added). Then, using a splitting algorithm similar to Algorithm 6, each input number is decomposed as the sum of a small number of lower-precision FP numbers, called slices by Demmel et al. (2016), such that the weights of the significand bits of a slice all belong to the same bin. If is well chosen, then all the elements of a bin can be accumulated without error. There remains to add the nonzero terms accumulated in the bins. Malcolm suggests doing that most significant bin first. The reproducible summation algorithm of Demmel et al. (2016) builds on this method. Another algorithm based on judicious splittings of the operands is the Accsum algorithm by Rump et al. (2008), which guarantees faithful rounding of a sum.

Extended precision software
As seen in Section 5.1, we may rely on the available processor FP operations to build extended precision arithmetic, for instance using double-word algorithms, that roughly double the available precision. Triple-word (Fabiano, Muller and Picot 2019) and quad-word algorithms (Hida et al. 2001) are available as well.
When additional precision is needed, even more than four FP numbers may be used, and a number may be represented by the unevaluated sum of an arbitrary number of floating-point numbers, with some conditions to avoid having too much overlapping between them. Such sums are called floating-point expansions (Priest 1991, Shewchuk 1997, Daumas 1999, Popescu 2017. Software libraries exist based on these ideas. Double-word and quad-word arith-metics are available in QD38. Campary39 provides the double-word algorithms formally proved by Muller and Rideau (2022) as well as arbitrary-precision computations based on floating-point expansions.
When the desired precision becomes very large, methods based on floating-point expansions are no longer efficient, and one needs to switch to another solution that consists in representing a number by an arbitrary-precision significand with a large enough exponent. The pioneering work was done by Brent (1978) with his MP multiple-precision package. GMP40 is known for its very efficient arbitrary precision integers, but it also provides rational numbers and floating-point numbers. GNU MPFR41 is a C library based on GMP that provides multiple-precision floating-point arithmetic with correct rounding (Fousse et al. 2007). Pari/GP42 provides fast multiple-precision arithmetic, either based on GMP or on its own native kernel (usually slower); it is mainly targeted at number theorists. MPFUN202038 is a Fortran package for multiple-precision arithmetic with a special care on being thread-safe; it includes two versions, one based on MPFR.
Another way to ensure the correctness is to provide some enclosing of the exact result. This may be done by interval arithmetic as explained in Section 4.1.
MPFI43 is a C library for arbitrary-precision interval arithmetic, where intervals are represented by their endpoints; it is based on MPFR for the implementation of the endpoints (Revol and Rouillier 2005). Arb44 is a C library for arbitraryprecision ball arithmetic (a form of interval arithmetic that uses a midpoint-radius representation of real and complex numbers); it includes an impressive library of special functions (Johansson 2013).
Extended precision may also be hidden to the user. For instance, computer algebra software systems as Maple,45 Sage,46 or Mathemagix47 provide extendedprecision floating-point numbers (based on GMP for Sage, and on MPFR for Mathemagix).

Conclusion
This conclusion focuses on the evolution of floating-point arithmetic and its links with other domains. Section 6.1 is about useful operators that could spread and be standardized in the future. Section 6.2 describes some recent alternative arith-metics. Section 6.3 presents some fruitful links of floating-point arithmetic to formal proof and computer algebra.

Implemented but not (yet?) standard operators and rounding functions
We have presented many arithmetic operators and rounding functions, the basic ones in Section 2 (including some more exotic in Section 2.12), and a few others added by the last revision of the standard in Section 4.2.3. Nevertheless, some other features might be useful, be they rounding functions or operators.
An old rounding function is rounding to odd, that was first considered by Von Neumann when designing the arithmetic unit of the EDVAC, and later used by Goldberg for converting FP numbers to decimal representation. It was used by GCC for the reverse conversion, and formally studied by Boldo and Melquiond (2008). Its main property related to double rounding is explained in Section 2.10.2.
As far as operators are concerned, we may desire that the augmented operations presented in Section 4.2.3 should also be available in rounding to nearest, ties to even. All EFTs (see Section 4.2) would greatly benefit from an efficient hardware implementation of such operations.
Another possibility is to have more complex FP operators. As the FMA instruction can replace and generalize floating-point addition and multiplication, the FDMDA48 (fused dual multiply dual add) instruction, that computes + + , could replace and generalize the FMA. The question of the specification of such an operator is then crucial, in particular whether it is correctly rounded (meaning as if there were a single rounding at the end of the four operations) or not. This may be available in the next generations of processors, due to its usefulness.
More than solving the difficult problem of the correctly-rounded sum of three numbers (Kornerup et al. 2012) and making most EFT computations trivial, it would greatly simplify and make more accurate double-word arithmetic algorithms, some complex arithmetic operations, and probably many other algorithms.

Alternative arithmetics
It is natural and sound to consider that floating-point arithmetic, as defined in the 70's and 80's, may now be questioned. As we have seen in the introduction, the processor technology and the target applications have drastically changed since the birth of IEEE-754. For some applications, other arithmetics than conventional floating-point may be better suited, for instance when the range of the numbers we need to represent is known beforehand or very large. The emergence of digital neural networks for AI applications requires arithmetics with tiny precisions (to achieve high bandwidth) yet rather large range (Wang and Kanwar 2019, Bertaccini et al. 2022).
In order to obtain a larger range than conventional floating-point while maintaining good precision for numbers of usual order of magnitude, Gustafson (2015) introduced the unums, and later on their successors, the posits (Posit Working Group 2022). Posits are still numbers of the form · 2 , where and are integers, but with a cleverly designed encoding that allows for variable sizes of and (but with a fixed total word size). Experiments by Cococcioni et al. (2022) suggest that Posit8 could be an interesting solution for implementing digital neural networks. For general, higher-precision, numerical computing, the price to pay, however, is huge. Indeed, with posits, the standard model (see Section 2.1) no longer applies, i.e., the individual arithmetic operations no longer have a bounded relative error (de Dinechin et al. 2019).

Floating-point arithmetic and beyond
As soon as a numerical algorithm contains more than a few operations, obtaining very tight error bounds, making certain that underflows and overflows will not occur, that NaNs will not propagate throughout the calculation, is a tremendous task. The proofs we publish are seldom fully satisfactory. Either we just give a rough idea of what makes the algorithm work, coyly hiding all the dirty stuff, or we try to take into account all possible corner cases, ending, as we said in the introduction, with long and tedious proofs, with the consequence that a flaw may remain unnoticed for years. For example, Kahan (2004b) proposed an intricate algorithm for computing a correct discriminant.49 The formal verification of Kahan's algorithm showed a gap in the pen-and-paper proof as a test could be wrong with respect to its mathematical counterpart, creating unstudied cases. Fortunately, the algorithm was still correct in these special cases and formally proved by Boldo (2009).
And yet, floating-point arithmetic is needed in many critical applications, and, for a given application we would like to analyze not just one solution, but many of them, in order to choose the best suited. For small formats until 32 bits, exhaustive testing is achievable for basic operators. But when the precision increases, it becomes intractable and formal proof is then a precious tool, that can provide a strong validation of the pen-and-paper proof of an algorithm, as seen in several parts of this article.
Another solution is to try to build algorithms and proofs that are correct by construction. From that point of view, computer algebra could considerably help the design and analysis of floating-point programs, in several aspects. First, the design of function software could be made easier, and the probability of bugs in such software could be made much lower, using tools that automatically find symmetries, and build local as well as asymptotic approximations, for instance from the differential equation that defines the function. Second, part of the analysis that uses the standard model could be assisted by computer. This would allow one to explore many variants of a numerical algorithm; this would also allow the analysis of numerical programs significantly larger than the ones we are able to deal with by pen-and-paper calculations. The pioneering work of Mezzarobba (2010Mezzarobba ( , 2020 addresses these two aspects.