There’s a lot to write about std::simd in C++—too much for a single post.
I’d like to show how std::simd helps solve actual problems.
Like every tool, it’s not a one-size-fits-all solution.
Pick the tool that suits the problem.
And if you already have a good tool for your problem or the auto-vectorizer understands your code just fine 🤷 great, no one is asking you to switch.
However, maybe I’ll be able to broaden your perspective on its capabilities and the specific challenges it addresses.
A crucial distinction:
std::simdis not the same asstd::experimental::simd. Don’t judge C++26’s data-parallel types by benchmarking the experimental predecessor. There’s a story behind that, but that’s not for today.
Some Examples
In this first post—of hopefully many more to come—I’ll let the preliminary implementation in GCC 16 speak for itself:
Every following code example will assume the following setup:
#include <simd>
namespace simd = std::simd;
std::simd provides C++ types that map directly to hardware SIMD registers.
Where a float operates on one value, a simd::vec<float> operates on \(N\) values in parallel—the width \(N\) determined
by the target architecture.
Integer Division by 2
simd::vec<int> half(simd::vec<int> x)
{
return x / 2;
}
int half(int x)
{
return x / 2;
}
See Compiler Explorer:
"half(std::simd::basic_vec<int, std::simd::_Abi<16, 1, 1ull>>)":
vmovdqa32 zmm1, zmm0
vpsrld zmm0, zmm0, 31
vpaddd zmm0, zmm0, zmm1
vpsrad zmm0, zmm0, 1
ret
"half(int)":
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret
We can see:
simd::vec<int>compiles to ZMM registers (-march=znver5), i.e. it scales to the target width.- The division by
2is optimized to shift instructions, using the exact same pattern as forint. - Change it to
vec<unsigned>and it’ll use a singlevpsrld zmm0, zmm0, 1instruction.
(If you do change it to unsigned, notice that it refuses to implicitly convert 2 (an int) to unsigned.
This is a shortcoming of the C++ language that could have been done differently, but was ultimately rejected by WG21.
Use 2u or std::cw<2> for the divisor.)
Constant-Folding and Strength Reduction
simd::vec<float> six(simd::vec<float> x)
{
x = 3.f;
return x * 2.f;
}
simd::vec<float> double_val(simd::vec<float> x)
{
return x * 2.f;
}
See Compiler Explorer:
"six(std::simd::basic_vec<float, std::simd::_Abi<16, 1, 1ull>>)":
vbroadcastss zmm0, DWORD PTR .LC1[rip]
ret
"double_val(std::simd::basic_vec<float, std::simd::_Abi<16, 1, 1ull>>)":
vaddps zmm0, zmm0, zmm0
ret
.LC1:
.long 1086324736
We can see:
- GCC constant-folds
sixinto a “load constant6.f” instruction. - GCC minimizes
.rodataby using a broadcast instruction rather than a full vector load. - GCC simplifies the multiply by 2 into an addition (
x + x). - This demonstrates that the optimizer understands the semantics of
std::simdoperations, allowing algebraic simplifications.
Alignment: Automatic Where Possible, Explicit Where Needed
auto a8 = alignof(simd::vec<float, 8>);
auto a16 = alignof(simd::vec<float, 16>);
auto a32 = alignof(simd::vec<float, 32>);
auto b8 = simd::alignment_v<simd::vec<float, 8>>;
auto b16 = simd::alignment_v<simd::vec<float, 16>>;
auto b32 = simd::alignment_v<simd::vec<float, 32>>;
alignas(simd::alignment_v<simd::vec<float>>) float data[1024];
auto load()
{
return simd::unchecked_load(data);
}
float data_u[1024];
auto loadu()
{
return simd::unchecked_load(data_u);
}
See Compiler Explorer:
"load()":
vmovaps zmm0, ZMMWORD PTR "data"[rip]
ret
"loadu()":
vmovups zmm0, ZMMWORD PTR "data_u"[rip]
ret
"data_u":
.zero 4096
"data":
.zero 4096
"b32":
.quad 64
"b16":
.quad 64
"b8":
.quad 32
"a32":
.quad 64
"a16":
.quad 64
"a8":
.quad 32
We can see:
- The
simd::vectypes (for obvious reasons) communicate their alignment requirements as matching the registersizeof. - Note that
simd::veccan span multiple registers when the requested width exceeds a single register’s capacity. Here,vec<float, 32>uses twoZMMregisters and thus has no higher alignment requirement than a single one. - The load call does not require specifying alignment and will figure this out itself, if the pointer internally carried the alignment information.
The
simd::flag_alignedargument is an optimization that users should use sparingly.
Explicitly typed pointers for alignment would be nice to have (there is some support in mdspan; but there’s so much more to do here: alignment, aliasing, non-temporal access, …), as well as simpler out-of-the-box over-aligned allocators so that std::vector becomes easier to use.
std::simd Doesn’t Integer-Promote / Guards Against Accidental Increase in Register Usage
auto f(simd::vec<std::int8_t> x, simd::vec<std::int8_t> y)
{
std::int8_t two = {2};
auto r = (x + y) / two;
static_assert(std::is_same_v<decltype(r), simd::vec<std::int8_t>>);
return r;
}
See Compiler Explorer:
"f(std::simd::basic_vec<signed char, std::simd::_Abi<64, 1, 1ull>>, std::simd::basic_vec<signed char, std::simd::_Abi<64, 1, 1ull>>)":
mov eax, -2139062144
vpaddb zmm1, zmm0, zmm1
vpbroadcastd zmm0, eax
vgf2p8affineqb zmm0, zmm1, zmm0, 0
vgf2p8affineqb zmm0, zmm0, ZMMWORD PTR .LC1[rip], 0
vpaddb zmm0, zmm0, zmm1
vgf2p8affineqb zmm0, zmm0, ZMMWORD PTR .LC2[rip], 0
ret
.LC1:
...
We can see:
simd::vec<int8_t>does not promote toint. This is probably the most important deviation from the design principle “asimd::vec<T>behaves like aT”. The reason should be obvious:xin the example uses one register; promoted tointit would suddenly require four registers. This touches upon another design principle: “don’t silently introduce performance gotchas, require explicit opt-in”.- The code required the divisor to be of type
int8_t.(x + y) / 2(where2is of typeint) would have implied a conversion fromint8_ttointand thus a silent change from one to four registers.std::simd::basic_vecrequires that both operands have a common type, and since the conversion frominttoint8_tis not value-preserving andvec<int8_t>is not convertible toint, there is no viableoperator/. - The compiler decided to turn the division by
2into a bizarre sequence of instructions. That’s because x86 lacks native 8-bit integer vector shifts. Instead, the compiler emulates the shift using GF(2⁸) polynomial multiplication.
Binary Compatibility Safeguards
auto f(simd::vec<float, 8> x)
{
return x + x;
}
See Compiler Explorer:
with -march=x86-64-v2:
"f(std::simd::basic_vec<float, std::simd::_Abi<8, 2, 0ull>>)":
movaps xmm0, XMMWORD PTR [rsp+8]
mov rax, rdi
addps xmm0, xmm0
movaps XMMWORD PTR [rdi], xmm0
movaps xmm0, XMMWORD PTR [rsp+24]
addps xmm0, xmm0
movaps XMMWORD PTR [rdi+16], xmm0
ret
and with -march=x86-64-v3:
"f(std::simd::basic_vec<float, std::simd::_Abi<8, 1, 0ull>>)":
vaddps ymm0, ymm0, ymm0
ret
Note the different ABI tag type for basic_vec (std::simd::_Abi<8, 1, 0ull> vs. std::simd::_Abi<8, 2, 0ull>).
The libstdc++ implementation uses an ABI tag that encodes
- the number of elements (
basic_vec::size()); - the number of registers;
- additional bits of differences (vec-mask vs. bit-mask and interleaved vs. contiguous
complexat this point).
This is a safe-guard against linking code that is not binary compatible.
While this safe-guard is not complete (composition can hide it), it is better than a simple <T, N> which happily
compiles and links and then does weird stuff at runtime.
The question then arises how to deploy binaries that support all kinds of ISA extensions. It is, however, a question that the C++ standard simply cannot answer. There do exist patterns and tooling support to make this work. That’s material for a different post.
Conclusion
There are so many more examples that could be considered. Send me requests or your favorite ones — especially if you find them surprising or they optimize badly. That helps us improve the implementation.
While this post does not explain why std::simd is what it is, I hope these examples give you a taste of what’s possible.
Time permitting, there will be more posts on the background and vision of std::simd.
Stay tuned.
Discuss: Mastodon
Postscript: Background & Context
Over the past decade, std::simd has evolved through extensive WG21 committee work, peer-reviewed publications, and my doctoral research. The standardization process took time precisely because I insisted on thoroughness: exploring the design space and ensuring every design choice was backed by evidence.
As we reach C++26, the response has been broad. I’ve received numerous success stories from researchers applying these tools, alongside renewed scrutiny regarding specific use cases. Much of this discussion centers on a natural mismatch in expectations: while all SIMD domains were considered during standardization, data-parallel scientific applications drove the primary design decisions. When API choices conflicted between domains, the needs of HPC codes led the way.
Rather than debating these points abstractly, I believe the best way to evaluate any tool is to look at what it actually does.