There’s a lot to write about std::simd in C++—too much for a single post.
I’d like to show how std::simd helps solve actual problems.
Like every tool, it’s not a one-size-fits-all solution.
Pick the tool that suits the problem.
And if you already have a good tool for your problem or the auto-vectorizer understands your code just fine 🤷 great, no one is asking you to switch.
However, maybe I’ll be able to broaden your perspective on its capabilities and the specific challenges it addresses.
Context & Expectations
A crucial distinction first: std::simd is not the same as std::experimental::simd (an assumption I’ve seen too
often now), so don’t judge C++26’s data-parallel types by benchmarking std::experimental::simd.
There’s a story behind that, but that’s not for today.
Recent articles have claimed that std::simd is flawed or unnecessary.
I did not engage with these posts directly — and I don’t intend to.
Instead, I would like to point out that every technical concern raised during standardization has been addressed
extensively over the past decade. We explored the design space thoroughly—as is customary in WG21—and published
the findings in committee papers.
Peer-reviewed publications document its use in scientific computing, as well as my doctoral thesis, which
covers the motivation, performance, and applicability in depth.
The reason the standardization took so long was precisely because I insisted on thoroughness: no claim went unexamined,
no design choice was made without evidence.
There also appears to be a fundamental mismatch in expectations. While all SIMD use cases were considered during standardization—and none were dismissed—data-parallel scientific applications drove the primary design decisions. When API choices conflicted between domains, the needs of HPC codes led the way.
I’ve also received positive feedback from the broader community—numerous “thank you” messages and success stories
from researchers and developers who successfully applied these tools.
Yet now that std::simd ships in C++26, we’re also seeing renewed scrutiny—sometimes accompanied by personal attacks
rather than technical dialogue.
Regardless, the best way to evaluate any tool is to look at what it actually does. So let’s do that.
Some Examples
In this first post—of hopefully many more to come, where I’ll dive into background and vision—I’ll let my preliminary implementation in GCC 16 speak for itself:
Every following code example will assume the following setup:
#include <simd>
namespace simd = std::simd;
Integer Division by 2
simd::vec<int> half(simd::vec<int> x)
{
return x / 2;
}
int half(int x)
{
return x / 2;
}
This compiles to (cf. Compiler Explorer):
"half(std::simd::basic_vec<int, std::simd::_Abi<16, 1, 1ull>>)":
vmovdqa32 zmm1, zmm0
vpsrld zmm0, zmm0, 31
vpaddd zmm0, zmm0, zmm1
vpsrad zmm0, zmm0, 1
ret
"half(int)":
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret
We can see:
simd::vec<int>compiles to ZMM registers (-march=znver5), i.e. it scales to the target width.- The division by
2is optimized to shift instructions, using the exact same pattern as forint. - Change it to
vec<unsigned>and it’ll use a singlevpsrld zmm0, zmm0, 1instruction.
(If you do change it to unsigned, notice that it refuses to implicitly convert 2 (an int) to unsigned.
This is a shortcoming of the C++ language that could have been done differently, but was ultimately rejected by WG21.
Use 2u or std::cw<2> for the divisor.)
Constant-Folding and Optimization
simd::vec<float> half(simd::vec<float> x)
{
x = 3.f; // <- remove this
return x * 2.f;
}
This compiles to (cf. Compiler Explorer):
"half(std::simd::basic_vec<float, std::simd::_Abi<16, 1, 1ull>>)":
vbroadcastss zmm0, DWORD PTR .LC1[rip]
ret
.LC1:
.long 1086324736
We can see:
- The compiler constant-folds the whole thing into a “load constant
6.f” instruction. - The compiler minimizes
.rodataby using a broadcast instruction rather than a full vector load. - Remove
x = 3.f;and the result is a singlevaddpsinstruction: the multiply by 2 was simplified into anx + x.
Alignment Isn’t Trivial, But It’s Also As Simple As Possible
auto a8 = alignof(simd::vec<float, 8>);
auto a16 = alignof(simd::vec<float, 16>);
auto a32 = alignof(simd::vec<float, 32>);
auto b8 = simd::alignment_v<simd::vec<float, 8>>;
auto b16 = simd::alignment_v<simd::vec<float, 16>>;
auto b32 = simd::alignment_v<simd::vec<float, 32>>;
alignas(simd::alignment_v<simd::vec<float>>) float data[1024];
auto load()
{
return simd::unchecked_load(data);
}
float data_u[1024];
auto loadu()
{
return simd::unchecked_load(data_u);
}
This compiles to (cf. Compiler Explorer):
"load()":
vmovaps zmm0, ZMMWORD PTR "data"[rip]
ret
"loadu()":
vmovups zmm0, ZMMWORD PTR "data_u"[rip]
ret
"data_u":
.zero 4096
"data":
.zero 4096
"b32":
.quad 64
"b16":
.quad 64
"b8":
.quad 32
"a32":
.quad 64
"a16":
.quad 64
"a8":
.quad 32
We can see:
- The
simd::vectypes (for obvious reasons) communicate their alignment requirements as matching the registersizeof. - The
vec<float, 32>type in this case is made up of twoZMMregisters and thus has no higher alignment requirement. - The load call does not require specifying alignment and will figure this out itself, if the pointer internally carried the alignment information.
The
simd::flag_alignedargument is an optimization that users should use sparingly.
Explicitly typed pointers for alignment would be nice to have (there is some support in mdspan; but there’s so much more to do here: alignment, aliasing, non-temporal access, …), as well as simpler out-of-the-box over-aligned allocators so that std::vector becomes easier to use.
std::simd Doesn’t Integer-Promote / Guards Against Accidental Increase in Register Usage
auto f(simd::vec<std::int8_t> x, simd::vec<std::int8_t> y)
{
std::int8_t two = {2};
auto r = (x + y) / two;
static_assert(std::is_same_v<decltype(r), simd::vec<std::int8_t>>);
return r;
}
This compiles to (cf. Compiler Explorer):
"f(std::simd::basic_vec<signed char, std::simd::_Abi<64, 1, 1ull>>, std::simd::basic_vec<signed char, std::simd::_Abi<64, 1, 1ull>>)":
mov eax, -2139062144
vpaddb zmm1, zmm0, zmm1
vpbroadcastd zmm0, eax
vgf2p8affineqb zmm0, zmm1, zmm0, 0
vgf2p8affineqb zmm0, zmm0, ZMMWORD PTR .LC1[rip], 0
vpaddb zmm0, zmm0, zmm1
vgf2p8affineqb zmm0, zmm0, ZMMWORD PTR .LC2[rip], 0
ret
.LC1:
...
We can see:
simd::vec<int8_t>does not promote toint. This is probably the most important deviation from the design principle “asimd::vec<T>behaves like aT”. The reason should be obvious:xin the example uses one register; promoted tointit would suddenly require four registers. This touches upon another design principle: “don’t silently introduce performance gotchas, require explicit opt-in”.- The code required the divisor to be of type
int8_t.(x + y) / 2(where2is of typeint) would have implied a conversion fromint8_ttointand thus a silent change from one to four registers.std::simd::basic_vecrequires that both operands have a common type, and since the conversion frominttoint8_tis not value-preserving andvec<int8_t>is not convertible toint, there is no viableoperator/. - The compiler decided to turn the division by
2into a bizarre sequence of instructions. That’s because x86 doesn’t have 8-bit integer vector shifts.
Binary Compatibility Safeguards
auto f(simd::vec<float, 8> x)
{
return x + x;
}
This compiles to (cf. Compiler Explorer):
with -march=x86-64-v2:
"f(std::simd::basic_vec<float, std::simd::_Abi<8, 2, 0ull>>)":
movaps xmm0, XMMWORD PTR [rsp+8]
mov rax, rdi
addps xmm0, xmm0
movaps XMMWORD PTR [rdi], xmm0
movaps xmm0, XMMWORD PTR [rsp+24]
addps xmm0, xmm0
movaps XMMWORD PTR [rdi+16], xmm0
ret
and with -march=x86-64-v3:
"f(std::simd::basic_vec<float, std::simd::_Abi<8, 1, 0ull>>)":
vaddps ymm0, ymm0, ymm0
ret
Note the different ABI tag type for basic_vec (std::simd::_Abi<8, 1, 0ull> vs. std::simd::_Abi<8, 2, 0ull>).
The libstdc++ implementation uses an ABI tag that encodes
- the number of elements (
basic_vec::size()); - the number of registers;
- additional bits of differences (vec-mask vs. bit-mask and interleaved vs. contiguous
complexat this point).
This is a safe-guard against linking code that is not binary compatible.
While this safe-guard is not complete (composition can hide it), it is better than a simple <T, N> which happily
compiles and links and then does weird stuff at runtime.
The question then arises how to deploy binaries that support all kinds of ISA extensions. It is, however, a question that the C++ standard simply cannot answer. There do exist patterns and tooling support to make this work. That’s material for a different post.
Conclusion and Outlook
There are so many more examples that could be considered. Send me requests or your favorite ones — especially if you find them surprising or they optimize badly. That helps us improve the implementation.
While this post does not explain why std::simd is what it is, I hope these examples give you a taste of what’s possible.
Time permitting, there will be more posts on the background and vision of std::simd.
Again, send me requests and questions and I’ll look into covering them.
The journey from Vc (first free software release in 2009) to std::simd (C++26) took 17 years.
The current quality of implementation (QoI) wouldn’t have been possible without everything that happened in between.
Stay tuned — and don’t judge std::simd by its experimental predecessor. The real story is just beginning.