Do you know about std::hypot(a, b)
and std::hypot(a, b, c)
? (The 3-argument
overload exists since C++17, sometimes referred to as hypot3.) Why do the C
and C++ standard libraries provide a function that is as simple as sqrt(a*a +
b*b)
? It doesn’t save enough characters to warrant its existence, right? Have
you ever considered what happens if the input values are “very” small or large?
Or if an input is an IEEE754 special value such as infinity or NaN? Have you
ever considered how precise the calculation is, especially if an exact answer
is obvious if one of the inputs is 0?
Pythagoras — it’s complicated
The standard function is specified to avoid overflow and underflow in intermediate calculations. Consider the following example:
float a = 0x1p70f; // = 2⁷⁰ (~1.2e21)
float b = 0;
float r = std::sqrt(a * a + b * b); // = inf
a * a
thus is 0x1p140f
(~1.4e42) which overflows the 8-bit exponent
(subnormal, -126 – 127, inf/nan) for single precision. However, mathematically
it’s obvious that r = a
is the right answer, not r = inf
. The naïve
sqrt(a*a + b*b)
thus doesn’t work for inputs greater than 0x1.fffffep63f
(~1.8e19) (if the other input is 0, otherwise the cut-off is even lower) as
well as for inputs smaller than (let’s ignore subnormals for now) 0x1p-63f
(~1e-19).
Regarding precision, sqrt(a*a + b*b)
isn’t really bad since std::sqrt
is
required by IEEE754 to have .5 ULP precision and can thus potentially reduce
the error of a*a + b*b
(where each operation has .5 ULP precision). I’d
expect a total error within 1 ULP (if anyone can give me a good pointer to a
formalism of fp error analysis, please do). The precision issue becomes
relevant when you want to support the full range of possible input values
without over-/underflow.
How to avoid over-/underflow
The idea to support the full range of input values is a simple transformation \(\sqrt{a^2 + b^2} = |b|\cdot\sqrt{(\frac{a}{b})^2+1}\) where \(|b|\ge|a|\). Consequently \(\frac{a}{b} \le 1\) and thus \((\frac{a}{b})^2\) won’t overflow. If \((\frac{a}{b})^2\) underflows, we didn’t have to care about its value anyway because the addition with 1 would have discarded all bits anyway.
1. Optimization idea
Instead of factoring out \(|b|\), we can use a value that is of the same
magnitude but might reduce the errors introduced through intermediate
calculations. The most obvious candidate is to round \(|b|\) down to the next
power-of-2 value, which I’ll call \(b'\). This is a trivial bitwise and
operation, setting all mantissa bits to 0. The resulting floating point value
will only scale the exponent bits when used in a multiplication or division.
Our hypot
implementation thus calculates
\(b'\cdot\sqrt{(\frac{a}{b'})^2+(\frac{b}{b'})^2}\). The final multiplication
(\(b'\cdot\sqrt{\cdot}\)) and both fractions (\(\frac{a}{b'}\) and
\(\frac{b}{b'}\)) are without loss of precision. Thus, the resulting error is
equivalent to the error of \(\sqrt{a^2+b^2}\).
2. Division is slow
Division and square root instructions run on the divider port (pipeline) of the
CPU and will therefore dominate the execution time of the hypot
implementation. We cannot avoid the square root, but we can avoid the division.
The first idea is to turn the two divisions into one division and two
multiplications: \(b'\cdot\sqrt{(\frac{a}{b'})^2+(\frac{b}{b'})^2} =
b'\cdot\sqrt{(a\cdot\hat{b})^2+(b\cdot\hat{b})^2}\), with \(\hat{b} =
\frac{1}{b'}\).
The \(\frac{1}{b'}\) division is a trivial operation, though, because it just needs to flip the sign of the exponent. After all, \(b'\) was constructed to be a power-of-2 value: \(b'=2^n \Rightarrow \hat{b} = 2^{-n}\). If you recall how the exponent is stored in an IEEE754 floating point value, you might realize that flipping all exponent bits almost produces the correct result. It’s just off by one. And if the exponent is off by one, the resulting floating point value is off by a factor of 2.
\(\Rightarrow \hat{b} = 2b'\)^
\(∞\) (or in code: b_hat = ((b & infinity) ^
infinity) * .5
; the xor operator is not defined for float
, but you get the
idea). Note that using an addition instead of multiply by two is faster for
two reasons (An optimizing compiler will turn 2*x
into x+x
by itself, so
feel free to forget about this optimization.):
- it doesn’t require loading a constant value (
0x4000'0000
in this case); - addition instructions typically have a lower latency than multiplication instructions.
\(\frac{b}{b'} = b\cdot\hat{b}\) is another trivial bit operation if we look
closely. \(b'\) was constructed to store the exponent of \(b\). Thus
\(\frac{b}{b'}\) is \(b\) with its exponent set to \(2^0 = 1\). Using a
bitwise-and and bitwise-or operation, we can easily overwrite the exponent of
\(b\) to avoid the multiplication \(b\cdot\hat{b} =\)(b & (min - denorm_min))
| 1
. The CPU can therefore schedule \((b\cdot\hat{b})^2\) earlier and thus
also execute the following FMA and square root earlier, reducing the total
latency of the hypot
function.
Subnormal, NaN, Zero, and Infinity
There are corner cases. A proper implementation must care for the Annex F requirements (C Standard, but imported into C++). As is often the case with floating point, the .01% corner cases lead to significant effort in the implementation, and if done carelessly to unfortunate slowdown. I’ll let you figure it out from the implementation below.
I was talking about std::experimental::simd<float>
You probably didn’t notice other than by the title, but everything I said I
didn’t write for plain float
but for std::experimental::simd<T,
Abi>
(last
draft). Here’s my implementation, targeting
libstdc++:
template <typename T, typename Abi>
simd<T, Abi> hypot(const simd<T, Abi>& x, const simd<T, Abi>& y)
{
if constexpr (simd<T, Abi>::size() == 1) {
return std::hypot(T(x[0]), T(y[0]));
} else if constexpr (is_fixed_size_abi_v<Abi>) {
return fixed_size_apply<simd<T, Abi>>(
[](auto a, auto b) { return hypot(a, b); },
x, y);
} else {
using namespace __proposed::float_bitwise_operators;
using Limits = std::numeric_limits<T>;
using V = simd<T, Abi>;
V absx = abs(x);
V absy = abs(y);
V hi = max(absx, absy);
V lo = min(absy, absx);
constexpr V inf(Limits::infinity());
// if hi is subnormal, avoid scaling by inf & final mul by 0 (which
// yields NaN) by using min()
V scale = 1 / Limits::min();
// invert exponent w/o error and w/o using the slow divider unit:
where(hi > Limits::min(), scale) = ((hi & inf) ^ inf) * T(.5);
// adjust final exponent for subnormal inputs
V hi_exp = Limits::min();
where(hi > Limits::min(), hi_exp) = hi & inf;
constexpr V mant_mask = Limits::min() - Limits::denorm_min();
V h1 = (hi & mant_mask) | V(1);
where(hi < Limits::min(), h1) -= V(1);
V l1 = lo * scale;
V r = hi_exp * sqrt(h1 * h1 + l1 * l1);
#ifdef STDC_IEC_559__
// fixup for Annex F requirements
V fixup = hi;
where(isunordered(x, y), fixup) = Limits::quiet_NaN();
where(isinf(absx) || isinf(absy), fixup) = inf;
where(!(lo == 0 || isunordered(x, y) || isinf(absx) ||
isinf(absy)), fixup) = r;
r = fixup;
#endif
return r;
}
}
Does it fly?
First of all, I made sure it is conforming and within 1 ULP of a precise implementation.
I tried to measure both latency and throughput. Measurements details:
- Intel Xeon W-2123 @ 3.60GHz
- GCC 9
-O2 -march=native
(i.e. using AVX512VL instructions forxmm
andymm
vectors)- Intel pstate set to
no_turbo
andscaling_governor
set toperformance
- Ubuntu 18.10
Throughput
T hypot(T, T) |
TSC cycles/call | Speedup per value relative to T |
---|---|---|
float |
17.1208 | |
simd<float, scalar> |
17.218 | 0.994355 |
simd<float, __sse> |
15.5264 | 4.41074 |
simd<float, __avx> |
15.9294 | 8.59831 |
simd<float, __avx512> |
18.5829 | 14.7411 |
double |
30.0783 | |
simd<double, scalar> |
30.2174 | 0.995398 |
simd<double, __sse> |
15.5236 | 3.87517 |
simd<double, __avx> |
16.9682 | 7.09053 |
simd<double, __avx512> |
26.4416 | 9.1003 |
Latency
T hypot(T, T) |
TSC cycles/call | Speedup per value relative to T |
---|---|---|
float |
37.6397 | |
simd<float, scalar> |
38.4039 | 0.980102 |
simd<float, __sse> |
41.2072 | 3.6537 |
simd<float, __avx> |
41.4811 | 7.25916 |
simd<float, __avx512> |
53.6256 | 11.2304 |
double |
51.806 | |
simd<double, scalar> |
51.8265 | 0.999604 |
simd<double, __sse> |
42.0938 | 2.46145 |
simd<double, __avx> |
42.6324 | 4.86072 |
simd<double, __avx512> |
57.4461 | 7.21455 |
I can also disable the Annex F requirements via -ffast-math
. Let’s see how
that changes the results:
Throughput (-ffast-math)
T hypot(T, T) |
TSC cycles/call | Speedup per value relative to T |
---|---|---|
float |
12.0874 | |
simd<float, scalar> |
11.572 | 1.04453 |
simd<float, __sse> |
11.0433 | 4.37815 |
simd<float, __avx> |
11.6043 | 8.33306 |
simd<float, __avx512> |
13.6978 | 14.1189 |
double |
27.1595 | |
simd<double, scalar> |
26.0718 | 1.04172 |
simd<double, __sse> |
11.6566 | 4.65993 |
simd<double, __avx> |
13.2002 | 8.23001 |
simd<double, __avx512> |
25.5888 | 8.49107 |
Latency (-ffast-math)
T hypot(T, T) |
TSC cycles/call | Speedup per value relative to T |
---|---|---|
float |
36.8928 | |
simd<float, scalar> |
37.1042 | 0.994301 |
simd<float, __sse> |
41.357 | 3.56822 |
simd<float, __avx> |
41.7432 | 7.07042 |
simd<float, __avx512> |
52.2087 | 11.3062 |
double |
51.3563 | |
simd<double, scalar> |
50.9219 | 1.00853 |
simd<double, __sse> |
42.4579 | 2.41916 |
simd<double, __avx> |
42.326 | 4.85341 |
simd<double, __avx512> |
56.3338 | 7.29314 |
Discuss on Reddit