One of the major benefits of type-based vectorization is data-structure vectorization. I’ll introduce and hopefully motivate the pattern in this post.
Example and Benchmark
Let me illustrate data-structure vectorization with an example. Consider many
points in 3-dimensional euclidean space. We’re given the task to determine the
mean distance from the origin (0, 0, 0). A generic definition of a Point
is
given by
template <typename T>
struct Point
{
T x, y, z;
};
Now we have at least three sensible choices wrt. vectorization, for storing many points in memory:
// AoS (array of structure)
using Points_AoS = std::vector<Point<float>>;
// SoA (structure of array)
using Points_SoA = Point<std::vector<float>>;
// AoVS (array of vectorized structure)
using Points_AoVS = std::vector<Point<stdx::native_simd<float>>>;
A visualization, which I created for my PhD
Thesis, might help. It shows how
a 4-element structure (a, b, c, d
) given 4-element SIMD vectors is located in
memory:
- AoS inhibits SIMD loads & stores of a given element from consecutive indexes.
- SoA enables SIMD loads & stores but places the structure’s elements far
apart in memory, breaking locality. With the naïve definition above
(
Point<vector<float>>
), the structure elements are even in separately allocated memory, increasing memory allocation overhead. - AoVS covers the ground in between. It allows SIMD loads & stores while maximizing locality of the structure’s elements.
Loop-based vectorization typically uses SoA data layout, because AoS does not perform well. AoVS is often impractical for loop-based vectorization because there is no natural facility for chunking the memory as needed.
Anyway, given these three data structures, we can calculate the mean length by computing the distance to (0, 0, 0) as the root of the sum of squares. After accumulating all lengths, dividing by the number of points yields the mean length:
// AoS ////////////////////////////////////////////////////////////////////////
float mean_length(const Points_AoS& points) {
float acc = std::transform_reduce(
std::execution::unseq, points.begin(), points.end(), 0.f,
std::plus<>(), [](auto p) {
return std::sqrt(p.x * p.x + p.y * p.y + p.z * p.z);
});
return acc / points.size();
}
// SoA ////////////////////////////////////////////////////////////////////////
float mean_length(const Points_SoA& points) {
assert(points.x.size() == points.y.size()
&& points.x.size() == points.z.size());
float acc = 0;
for (std::size_t i = 0; i < points.x.size(); ++i) {
acc += std::sqrt(points.x[i] * points.x[i] + points.y[i] * points.y[i]
+ points.z[i] * points.z[i]);
}
return acc / points.x.size();
}
// AoVS ///////////////////////////////////////////////////////////////////////
float mean_length(const Points_AoVS& points) {
stdx::native_simd<float> acc = 0;
for (const auto& p : points) {
acc += sqrt(p.x * p.x + p.y * p.y + p.z * p.z);
}
return reduce(acc) / (points.size() * acc.size());
}
The AoVS implementation is straightforward to read, trivially yields SIMD
execution, and produces the best results. These are the results on my AMD Ryzen
7 (AVX for native_simd<float>
) with GCC 11.2, compiled with -O3
-march=native
:
TYPE fast-math mean_length
[cycles/call] (smaller is better)
-----------------------------------------------------------------------------
float SoA no 9800 ██████████████████████████████
float AoS no 9695 █████████████████████████████▋
float AoS yes 4404 █████████████▌
float SoA yes 1541 ████▊
float, simd_abi::[AVX] AoVS yes 1238 ███▊
float, simd_abi::[AVX] AoVS no 1230 ███▊
-----------------------------------------------------------------------------
As you can see, -ffast-math
(and also -O3
, -O2
typically suffices for
stdₓ::simd
) is necessary for good SoA performance. Independent of
-ffast-math
and -O2
vs. -O3
, AoVS with stdₓ::native_simd<float>
wins.
SoA performance with auto-vectorization is close, but AoS performance is always
bad.
Patterns
The Point
template as defined above is a typical pattern I like to use. You
can even give it member functions which depend on T
. For example we can add a
distance_to_origin()
function to Point
:
template <typename T>
struct Point
{
T x, y, z;
T distance_to_origin() const {
using std::sqrt;
return sqrt(x * x + y * y + z * z);
}
};
Then, if you’re careful about if
statements, you can instantiate the template
with either an arithmetic type or a stdₓ::simd
type. You’ve effectively
declared a template that is vectorizable in terms of data layout and
algorithms.
Objects of Point<simd>
are the working sets the program subsequently would
use whenever possible. Thus, you may see these vectorized user-defined types
cross function boundaries. They become an integral part of your internal APIs.
ABI concerns
If you require a stable ABI at a library boundary, then you need to understand
how it affects your ABI. The use of stdx::native_simd
explicitly asks for an
ABI dependency on compiler flags. The stdx::simd_abi::fixed_size<N>
ABI tag
was specifically implemented in libstdc++ such that it trades performance for
ABI stability. The use of stdx::fixed_size_simd<T, N>
on ABI boundaries can
be considered safe and stable.