What You'll Learn
- CPU registers and basic asm syntax
- Compiler intrinsics (clz, popcount, bswap)
- SIMD concepts and auto-vectorization
- When low-level code is worth the tradeoff
Inline Assembly & Low-Level CPU Instructions
Inline assembly lets you embed CPU instructions directly in C++ code — useful for cryptography, signal processing, and performance-critical inner loops. More often, compiler intrinsics give you the same power with better portability. This lesson covers both, plus an introduction to SIMD vectorization.
Assembly Basics — Registers & Instructions
Assembly language operates on CPU registers — tiny, fast storage locations directly on the processor. x86-64 has 16 general-purpose 64-bit registers. GCC's inline asm uses the asm("instruction" : outputs : inputs : clobbers) syntax to embed instructions.
Common Mistake: Writing inline assembly when the compiler already generates optimal code. Modern compilers (GCC, Clang) with -O3 often beat hand-written assembly. Profile first.
Assembly Basics
Understand registers, flags, and inline asm syntax
#include <iostream>
using namespace std;
// Inline assembly lets you embed CPU instructions directly
// GCC/Clang use the __asm__ or asm keyword with AT&T syntax
// MSVC uses __asm with Intel syntax
// NOTE: inline assembly is architecture-specific (x86, ARM, etc.)
// This lesson shows x86-64 examples
int main() {
// === Understanding Registers ===
// x86-64 has 16 general-purpose 64-bit registers:
// rax, rbx, rcx, rdx — data registers
// rsi, rdi — source/destination index
...Compiler Intrinsics — Portable Low-Level Code
Intrinsics are functions that map directly to single CPU instructions — they look like normal C++ but compile to specific hardware operations. __builtin_popcount counts set bits, __builtin_clz counts leading zeros, and __builtin_bswap32 swaps byte order. They're portable across architectures because the compiler adapts to the target CPU.
Pro Tip: In C++20, use <bit> header for standard intrinsics: std::popcount, std::countl_zero, std::bit_ceil — fully portable and type-safe.
Compiler Intrinsics
Bit manipulation, byte swapping, and popcount
#include <iostream>
#include <cstdint>
#include <vector>
#include <chrono>
using namespace std;
// Compiler intrinsics: portable alternatives to inline assembly
// They map directly to CPU instructions but look like function calls
// Bit manipulation intrinsics
int countLeadingZeros(uint32_t x) {
if (x == 0) return 32;
// __builtin_clz(x) on GCC/Clang
// _lzcnt_u32(x) on MSVC
int count = 0;
while (!(x & 0x80000000)) { count++; x <<= 1; }
return count;
}
int countTraili
...SIMD — Processing Multiple Values at Once
SIMD (Single Instruction, Multiple Data) processes 4, 8, or 16 values simultaneously using wide registers. SSE uses 128-bit registers (4 floats), AVX uses 256-bit (8 floats). The easiest way to use SIMD is letting the compiler auto-vectorize with -O3 -march=native.
SIMD Concepts
Compare scalar vs unrolled vector addition
#include <iostream>
#include <vector>
#include <chrono>
#include <numeric>
using namespace std;
// SIMD — Single Instruction, Multiple Data
// Process 4, 8, or 16 values simultaneously
// SSE: 128-bit (4 floats), AVX: 256-bit (8 floats), AVX-512: 512-bit
// Scalar addition — one element at a time
void addScalar(const float* a, const float* b, float* out, int n) {
for (int i = 0; i < n; i++)
out[i] = a[i] + b[i];
}
// Simulated SIMD — process 4 elements at a time
// Real SIMD uses
...Quick Reference
| Approach | Portability | Use When |
|---|---|---|
| Inline asm | Architecture-specific | Crypto, OS kernels |
| Intrinsics | Compiler-portable | Bit ops, SIMD |
| C++20 <bit> | Fully portable | Bit manipulation |
| Auto-vectorization | Automatic | -O3 -march=native |
Lesson Complete!
You now understand inline assembly, compiler intrinsics, and SIMD — the tools for maximum C++ performance.
Sign up for free to track which lessons you've completed and get learning reminders.