Make it fast and readable!

there's too much unused parallelism on a single core

Optimizing std::hypot for simd arguments

18 Jan 2019 — Written by Matthias Kretz
Tagged as: benchmarking, C++, math, optimization, SIMD, and std::simd

Do you know about std::hypot(a, b) and std::hypot(a, b, c)? (The 3-argument overload exists since C++17, sometimes referred to as hypot3.) Why do the C and C++ standard libraries provide a function that is as simple as sqrt(a*a + b*b)? It doesn’t save enough characters to warrant its existence, right? Have you ever considered what happens if the input values are “very” small or large? Or if an input is an IEEE754 special value such as infinity or NaN? Have you ever considered how precise the calculation is, especially if an exact answer is obvious if one of the inputs is 0?

Pythagoras — it’s complicated

The standard function is specified to avoid overflow and underflow in intermediate calculations. Consider the following example:

float a = 0x1p70f;  // = 2⁷⁰ (~1.2e21)
float b = 0;
float r = std::sqrt(a * a + b * b); // = inf

a * a thus is 0x1p140f (~1.4e42) which overflows the 8-bit exponent (subnormal, -126 – 127, inf/nan) for single precision. However, mathematically it’s obvious that r = a is the right answer, not r = inf. The naïve sqrt(a*a + b*b) thus doesn’t work for inputs greater than 0x1.fffffep63f (~1.8e19) (if the other input is 0, otherwise the cut-off is even lower) as well as for inputs smaller than (let’s ignore subnormals for now) 0x1p-63f (~1e-19).

Regarding precision, sqrt(a*a + b*b) isn’t really bad since std::sqrt is required by IEEE754 to have .5 ULP precision and can thus potentially reduce the error of a*a + b*b (where each operation has .5 ULP precision). I’d expect a total error within 1 ULP (if anyone can give me a good pointer to a formalism of fp error analysis, please do). The precision issue becomes relevant when you want to support the full range of possible input values without over-/underflow.

How to avoid over-/underflow

The idea to support the full range of input values is a simple transformation \(\sqrt{a^2 + b^2} = |b|\cdot\sqrt{(\frac{a}{b})^2+1}\) where \(|b|\ge|a|\). Consequently \(\frac{a}{b} \le 1\) and thus \((\frac{a}{b})^2\) won’t overflow. If \((\frac{a}{b})^2\) underflows, we didn’t have to care about its value anyway because the addition with 1 would have discarded all bits anyway.

1. Optimization idea

Instead of factoring out \(|b|\), we can use a value that is of the same magnitude but might reduce the errors introduced through intermediate calculations. The most obvious candidate is to round \(|b|\) down to the next power-of-2 value, which I’ll call \(b'\). This is a trivial bitwise and operation, setting all mantissa bits to 0. The resulting floating point value will only scale the exponent bits when used in a multiplication or division. Our hypot implementation thus calculates \(b'\cdot\sqrt{(\frac{a}{b'})^2+(\frac{b}{b'})^2}\). The final multiplication (\(b'\cdot\sqrt{\cdot}\)) and both fractions (\(\frac{a}{b'}\) and \(\frac{b}{b'}\)) are without loss of precision. Thus, the resulting error is equivalent to the error of \(\sqrt{a^2+b^2}\).

2. Division is slow

Division and square root instructions run on the divider port (pipeline) of the CPU and will therefore dominate the execution time of the hypot implementation. We cannot avoid the square root, but we can avoid the division. The first idea is to turn the two divisions into one division and two multiplications: \(b'\cdot\sqrt{(\frac{a}{b'})^2+(\frac{b}{b'})^2} = b'\cdot\sqrt{(a\cdot\hat{b})^2+(b\cdot\hat{b})^2}\), with \(\hat{b} = \frac{1}{b'}\).

The \(\frac{1}{b'}\) division is a trivial operation, though, because it just needs to flip the sign of the exponent. After all, \(b'\) was constructed to be a power-of-2 value: \(b'=2^n \Rightarrow \hat{b} = 2^{-n}\). If you recall how the exponent is stored in an IEEE754 floating point value, you might realize that flipping all exponent bits almost produces the correct result. It’s just off by one. And if the exponent is off by one, the resulting floating point value is off by a factor of 2.

\(\Rightarrow \hat{b} = 2b'\)^\(∞\) (or in code: b_hat = ((b & infinity) ^ infinity) * .5; the xor operator is not defined for float, but you get the idea). Note that using an addition instead of multiply by two is faster for two reasons (An optimizing compiler will turn 2*x into x+x by itself, so feel free to forget about this optimization.):

it doesn’t require loading a constant value (0x4000'0000 in this case);
addition instructions typically have a lower latency than multiplication instructions.

\(\frac{b}{b'} = b\cdot\hat{b}\) is another trivial bit operation if we look closely. \(b'\) was constructed to store the exponent of \(b\). Thus \(\frac{b}{b'}\) is \(b\) with its exponent set to \(2^0 = 1\). Using a bitwise-and and bitwise-or operation, we can easily overwrite the exponent of \(b\) to avoid the multiplication \(b\cdot\hat{b} =\)(b & (min - denorm_min)) | 1. The CPU can therefore schedule \((b\cdot\hat{b})^2\) earlier and thus also execute the following FMA and square root earlier, reducing the total latency of the hypot function.

Subnormal, NaN, Zero, and Infinity

There are corner cases. A proper implementation must care for the Annex F requirements (C Standard, but imported into C++). As is often the case with floating point, the .01% corner cases lead to significant effort in the implementation, and if done carelessly to unfortunate slowdown. I’ll let you figure it out from the implementation below.

I was talking about std::experimental::simd<float>

You probably didn’t notice other than by the title, but everything I said I didn’t write for plain float but for std::experimental::simd<T, Abi> (last draft). Here’s my implementation, targeting libstdc++:

template <typename T, typename Abi>
simd<T, Abi> hypot(const simd<T, Abi>& x, const simd<T, Abi>& y)
{
  if constexpr (simd<T, Abi>::size() == 1) {
    return std::hypot(T(x[0]), T(y[0]));
  } else if constexpr (is_fixed_size_abi_v<Abi>) {
    return fixed_size_apply<simd<T, Abi>>(
      [](auto a, auto b) { return hypot(a, b); },
      x, y);
  } else {
    using namespace __proposed::float_bitwise_operators;
    using Limits = std::numeric_limits<T>;
    using V      = simd<T, Abi>;
    V absx       = abs(x);
    V absy       = abs(y);
    V hi         = max(absx, absy);
    V lo         = min(absy, absx);

    constexpr V inf(Limits::infinity());
    // if hi is subnormal, avoid scaling by inf & final mul by 0 (which
    // yields NaN) by using min()
    V scale = 1 / Limits::min();
    // invert exponent w/o error and w/o using the slow divider unit:
    where(hi > Limits::min(), scale) = ((hi & inf) ^ inf) * T(.5);
    // adjust final exponent for subnormal inputs
    V hi_exp                          = Limits::min();
    where(hi > Limits::min(), hi_exp) = hi & inf;
    constexpr V mant_mask = Limits::min() - Limits::denorm_min();
    V h1 = (hi & mant_mask) | V(1);
    where(hi < Limits::min(), h1) -= V(1);
    V l1 = lo * scale;
    V r  = hi_exp * sqrt(h1 * h1 + l1 * l1);

#ifdef STDC_IEC_559__
    // fixup for Annex F requirements
    V fixup                                     = hi;
    where(isunordered(x, y), fixup)          = Limits::quiet_NaN();
    where(isinf(absx) || isinf(absy), fixup) = inf;
    where(!(lo == 0 || isunordered(x, y) || isinf(absx) || 
            isinf(absy)), fixup) = r;
    r = fixup;
#endif
    return r;
  }
}

Does it fly?

First of all, I made sure it is conforming and within 1 ULP of a precise implementation.

I tried to measure both latency and throughput. Measurements details:

Intel Xeon W-2123 @ 3.60GHz
GCC 9
-O2 -march=native (i.e. using AVX512VL instructions for xmm and ymm vectors)
Intel pstate set to no_turbo and scaling_governor set to performance
Ubuntu 18.10

Throughput

`T hypot(T, T)`	TSC cycles/call	Speedup per value relative to `T`
`float`	17.1208
`simd<float, scalar>`	17.218	0.994355
`simd<float, __sse>`	15.5264	4.41074
`simd<float, __avx>`	15.9294	8.59831
`simd<float, __avx512>`	18.5829	14.7411
`double`	30.0783
`simd<double, scalar>`	30.2174	0.995398
`simd<double, __sse>`	15.5236	3.87517
`simd<double, __avx>`	16.9682	7.09053
`simd<double, __avx512>`	26.4416	9.1003

Latency

`T hypot(T, T)`	TSC cycles/call	Speedup per value relative to `T`
`float`	37.6397
`simd<float, scalar>`	38.4039	0.980102
`simd<float, __sse>`	41.2072	3.6537
`simd<float, __avx>`	41.4811	7.25916
`simd<float, __avx512>`	53.6256	11.2304
`double`	51.806
`simd<double, scalar>`	51.8265	0.999604
`simd<double, __sse>`	42.0938	2.46145
`simd<double, __avx>`	42.6324	4.86072
`simd<double, __avx512>`	57.4461	7.21455

I can also disable the Annex F requirements via -ffast-math. Let’s see how that changes the results:

Throughput (-ffast-math)

`T hypot(T, T)`	TSC cycles/call	Speedup per value relative to `T`
`float`	12.0874
`simd<float, scalar>`	11.572	1.04453
`simd<float, __sse>`	11.0433	4.37815
`simd<float, __avx>`	11.6043	8.33306
`simd<float, __avx512>`	13.6978	14.1189
`double`	27.1595
`simd<double, scalar>`	26.0718	1.04172
`simd<double, __sse>`	11.6566	4.65993
`simd<double, __avx>`	13.2002	8.23001
`simd<double, __avx512>`	25.5888	8.49107

Latency (-ffast-math)

`T hypot(T, T)`	TSC cycles/call	Speedup per value relative to `T`
`float`	36.8928
`simd<float, scalar>`	37.1042	0.994301
`simd<float, __sse>`	41.357	3.56822
`simd<float, __avx>`	41.7432	7.07042
`simd<float, __avx512>`	52.2087	11.3062
`double`	51.3563
`simd<double, scalar>`	50.9219	1.00853
`simd<double, __sse>`	42.4579	2.41916
`simd<double, __avx>`	42.326	4.85341
`simd<double, __avx512>`	56.3338	7.29314

Discuss on Reddit