Vir's blog — go fast with readable code!

there's too much unused parallelism on a single core

Introduction to std::simd in C++26 (Part 1)

21 May 2026 — Written by Matthias Kretz
Tagged as: C++, C++26, SIMD, and std::simd

There’s a lot to write about std::simd in C++—too much for a single post. I’d like to show how std::simd helps solve actual problems. Like every tool, it’s not a one-size-fits-all solution. Pick the tool that suits the problem. And if you already have a good tool for your problem or the auto-vectorizer understands your code just fine 🤷 great, no one is asking you to switch. However, maybe I’ll be able to broaden your perspective on its capabilities and the specific challenges it addresses.

Context & Expectations

A crucial distinction first: std::simd is not the same as std::experimental::simd (an assumption I’ve seen too often now), so don’t judge C++26’s data-parallel types by benchmarking std::experimental::simd. There’s a story behind that, but that’s not for today.

Recent articles have claimed that std::simd is flawed or unnecessary. I did not engage with these posts directly — and I don’t intend to. Instead, I would like to point out that every technical concern raised during standardization has been addressed extensively over the past decade. We explored the design space thoroughly—as is customary in WG21—and published the findings in committee papers. Peer-reviewed publications document its use in scientific computing, as well as my doctoral thesis, which covers the motivation, performance, and applicability in depth. The reason the standardization took so long was precisely because I insisted on thoroughness: no claim went unexamined, no design choice was made without evidence.

There also appears to be a fundamental mismatch in expectations. While all SIMD use cases were considered during standardization—and none were dismissed—data-parallel scientific applications drove the primary design decisions. When API choices conflicted between domains, the needs of HPC codes led the way.

I’ve also received positive feedback from the broader community—numerous “thank you” messages and success stories from researchers and developers who successfully applied these tools. Yet now that std::simd ships in C++26, we’re also seeing renewed scrutiny—sometimes accompanied by personal attacks rather than technical dialogue.

Regardless, the best way to evaluate any tool is to look at what it actually does. So let’s do that.

Some Examples

In this first post—of hopefully many more to come, where I’ll dive into background and vision—I’ll let my preliminary implementation in GCC 16 speak for itself:

Every following code example will assume the following setup:

#include <simd>
namespace simd = std::simd;

Integer Division by 2

simd::vec<int> half(simd::vec<int> x)
{
  return x / 2;
}

int half(int x)
{
  return x / 2;
}

This compiles to (cf. Compiler Explorer):

"half(std::simd::basic_vec<int, std::simd::_Abi<16, 1, 1ull>>)":
        vmovdqa32       zmm1, zmm0
        vpsrld  zmm0, zmm0, 31
        vpaddd  zmm0, zmm0, zmm1
        vpsrad  zmm0, zmm0, 1
        ret
"half(int)":
        mov     eax, edi
        shr     eax, 31
        add     eax, edi
        sar     eax
        ret

We can see:

  1. simd::vec<int> compiles to ZMM registers (-march=znver5), i.e. it scales to the target width.
  2. The division by 2 is optimized to shift instructions, using the exact same pattern as for int.
  3. Change it to vec<unsigned> and it’ll use a single vpsrld zmm0, zmm0, 1 instruction.

(If you do change it to unsigned, notice that it refuses to implicitly convert 2 (an int) to unsigned. This is a shortcoming of the C++ language that could have been done differently, but was ultimately rejected by WG21. Use 2u or std::cw<2> for the divisor.)

Constant-Folding and Optimization

simd::vec<float> half(simd::vec<float> x)
{
  x = 3.f; // <- remove this
  return x * 2.f;
}

This compiles to (cf. Compiler Explorer):

"half(std::simd::basic_vec<float, std::simd::_Abi<16, 1, 1ull>>)":
        vbroadcastss    zmm0, DWORD PTR .LC1[rip]
        ret
.LC1:
        .long   1086324736

We can see:

  1. The compiler constant-folds the whole thing into a “load constant 6.f” instruction.
  2. The compiler minimizes .rodata by using a broadcast instruction rather than a full vector load.
  3. Remove x = 3.f; and the result is a single vaddps instruction: the multiply by 2 was simplified into an x + x.

Alignment Isn’t Trivial, But It’s Also As Simple As Possible

auto a8 = alignof(simd::vec<float, 8>);
auto a16 = alignof(simd::vec<float, 16>);
auto a32 = alignof(simd::vec<float, 32>);

auto b8 = simd::alignment_v<simd::vec<float, 8>>;
auto b16 = simd::alignment_v<simd::vec<float, 16>>;
auto b32 = simd::alignment_v<simd::vec<float, 32>>;

alignas(simd::alignment_v<simd::vec<float>>) float data[1024];

auto load()
{
  return simd::unchecked_load(data);
}

float data_u[1024];

auto loadu()
{
  return simd::unchecked_load(data_u);
}

This compiles to (cf. Compiler Explorer):

"load()":
        vmovaps zmm0, ZMMWORD PTR "data"[rip]
        ret
"loadu()":
        vmovups zmm0, ZMMWORD PTR "data_u"[rip]
        ret
"data_u":
        .zero   4096
"data":
        .zero   4096
"b32":
        .quad   64
"b16":
        .quad   64
"b8":
        .quad   32
"a32":
        .quad   64
"a16":
        .quad   64
"a8":
        .quad   32

We can see:

  1. The simd::vec types (for obvious reasons) communicate their alignment requirements as matching the register sizeof.
  2. The vec<float, 32> type in this case is made up of two ZMM registers and thus has no higher alignment requirement.
  3. The load call does not require specifying alignment and will figure this out itself, if the pointer internally carried the alignment information. The simd::flag_aligned argument is an optimization that users should use sparingly.

Explicitly typed pointers for alignment would be nice to have (there is some support in mdspan; but there’s so much more to do here: alignment, aliasing, non-temporal access, …), as well as simpler out-of-the-box over-aligned allocators so that std::vector becomes easier to use.

std::simd Doesn’t Integer-Promote / Guards Against Accidental Increase in Register Usage

auto f(simd::vec<std::int8_t> x, simd::vec<std::int8_t> y)
{
  std::int8_t two = {2};
  auto r = (x + y) / two;
  static_assert(std::is_same_v<decltype(r), simd::vec<std::int8_t>>);
  return r;
}

This compiles to (cf. Compiler Explorer):

"f(std::simd::basic_vec<signed char, std::simd::_Abi<64, 1, 1ull>>, std::simd::basic_vec<signed char, std::simd::_Abi<64, 1, 1ull>>)":
        mov     eax, -2139062144
        vpaddb  zmm1, zmm0, zmm1
        vpbroadcastd    zmm0, eax
        vgf2p8affineqb   zmm0, zmm1, zmm0, 0
        vgf2p8affineqb   zmm0, zmm0, ZMMWORD PTR .LC1[rip], 0
        vpaddb  zmm0, zmm0, zmm1
        vgf2p8affineqb   zmm0, zmm0, ZMMWORD PTR .LC2[rip], 0
        ret
.LC1:
        ...

We can see:

  1. simd::vec<int8_t> does not promote to int. This is probably the most important deviation from the design principle “a simd::vec<T> behaves like a T”. The reason should be obvious: x in the example uses one register; promoted to int it would suddenly require four registers. This touches upon another design principle: “don’t silently introduce performance gotchas, require explicit opt-in”.
  2. The code required the divisor to be of type int8_t. (x + y) / 2 (where 2 is of type int) would have implied a conversion from int8_t to int and thus a silent change from one to four registers. std::simd::basic_vec requires that both operands have a common type, and since the conversion from int to int8_t is not value-preserving and vec<int8_t> is not convertible to int, there is no viable operator/.
  3. The compiler decided to turn the division by 2 into a bizarre sequence of instructions. That’s because x86 doesn’t have 8-bit integer vector shifts.

Binary Compatibility Safeguards

auto f(simd::vec<float, 8> x)
{
  return x + x;
}

This compiles to (cf. Compiler Explorer): with -march=x86-64-v2:

"f(std::simd::basic_vec<float, std::simd::_Abi<8, 2, 0ull>>)":
        movaps  xmm0, XMMWORD PTR [rsp+8]
        mov     rax, rdi
        addps   xmm0, xmm0
        movaps  XMMWORD PTR [rdi], xmm0
        movaps  xmm0, XMMWORD PTR [rsp+24]
        addps   xmm0, xmm0
        movaps  XMMWORD PTR [rdi+16], xmm0
        ret

and with -march=x86-64-v3:

"f(std::simd::basic_vec<float, std::simd::_Abi<8, 1, 0ull>>)":
        vaddps  ymm0, ymm0, ymm0
        ret

Note the different ABI tag type for basic_vec (std::simd::_Abi<8, 1, 0ull> vs. std::simd::_Abi<8, 2, 0ull>). The libstdc++ implementation uses an ABI tag that encodes

  • the number of elements (basic_vec::size());
  • the number of registers;
  • additional bits of differences (vec-mask vs. bit-mask and interleaved vs. contiguous complex at this point).

This is a safe-guard against linking code that is not binary compatible. While this safe-guard is not complete (composition can hide it), it is better than a simple <T, N> which happily compiles and links and then does weird stuff at runtime.

The question then arises how to deploy binaries that support all kinds of ISA extensions. It is, however, a question that the C++ standard simply cannot answer. There do exist patterns and tooling support to make this work. That’s material for a different post.

Conclusion and Outlook

There are so many more examples that could be considered. Send me requests or your favorite ones — especially if you find them surprising or they optimize badly. That helps us improve the implementation.

While this post does not explain why std::simd is what it is, I hope these examples give you a taste of what’s possible. Time permitting, there will be more posts on the background and vision of std::simd. Again, send me requests and questions and I’ll look into covering them.

The journey from Vc (first free software release in 2009) to std::simd (C++26) took 17 years. The current quality of implementation (QoI) wouldn’t have been possible without everything that happened in between.

Stay tuned — and don’t judge std::simd by its experimental predecessor. The real story is just beginning.