Skip to main content
    Courses/C++/Inline Assembly

    Advanced

    Inline Assembly & Low-Level Intrinsics

    By the end of this lesson you'll be able to read a basic GCC/Clang inline-assembly statement, explain its output, input, and clobber operands, and — far more importantly — know why compiler intrinsics and the optimizer are almost always the better tool, plus the portability and safety traps that make hand-written asm a last resort.

    What You'll Learn

    • Explain what inline assembly is and the rare cases that justify it
    • Read GCC/Clang extended asm: asm volatile("..." : out : in : clobbers)
    • Identify output, input, and clobber operands and what %0/%1 mean
    • Use portable compiler intrinsics (popcount, ctz, SIMD) instead
    • Understand SIMD and why -O3 auto-vectorization usually wins
    • Avoid the big traps: clobbering, non-portability, breaking the optimizer, UB

    💡 Real-World Analogy

    Think of the compiler as a master translator who turns your C++ into flawless machine code and re-checks the whole document every time anything changes. Writing inline assembly is like grabbing the pen and scribbling a sentence in the target language yourself. Occasionally you know a word the translator's dictionary is missing — but the moment you write it, the translator stops re-checking that sentence. If you misspell a register or forget to declare what you changed, nobody catches it, and the error can hide until a different optimization level exposes it. Compiler intrinsics are the polite middle ground: you suggest the exact word, but the translator still owns the page — it stays portable and keeps getting proofread.

    1. What Inline Assembly Is (and Isn't)

    Inline assembly means writing raw CPU instructions directly inside a C++ function. On GCC and Clang you use the asm keyword; on MSVC the syntax differs again — already a portability headache. It exists for the handful of cases the language genuinely cannot express: a privileged instruction in an operating-system kernel, a brand-new CPU feature with no wrapper yet, or constant-time cryptographic code. For everything else, the compiler writes better assembly than you will. Study the worked example below — the asm is in comments, and the runnable C++ computes the identical answer.

    Worked example: the same addition, three levels deep

    Read the commented asm, then run the C++ that produces the same 42.

    Try it Yourself »
    C++
    #include <iostream>
    using namespace std;
    
    // Inline assembly = raw CPU instructions written INSIDE a C++ function.
    // GCC and Clang spell it:  asm volatile("..." : outputs : inputs : clobbers);
    //
    // Here is the SAME idea (add two numbers) written three ways, from
    // highest level to lowest. Only the C++ versions run in this editor.
    
    int main() {
        int a = 25, b = 17;
    
        // 1) Plain C++ — the optimizer turns this into one 'add' instruction.
        int high = a + b;                       // 42
    
    
    ...

    2. Extended Asm Syntax: Outputs, Inputs & Clobbers

    A GCC/Clang extended asm statement has four colon-separated parts: the instruction template, the output operands (what the asm writes), the input operands (what it reads), and the clobber list (registers and flags it trashes). Placeholders %0, %1, %2 in the template refer to those operands in order. The keyword asm volatile tells the optimizer "keep this exactly where it is — do not move or delete it". Read the anatomy below carefully; the operand model is the whole game.

    Worked example: anatomy of an extended asm statement

    Study the four colon sections and the constraint letters, then run the C++ mirror.

    Try it Yourself »
    C++
    #include <iostream>
    using namespace std;
    
    // Anatomy of a GCC/Clang EXTENDED asm statement (study, don't run):
    //
    //   asm volatile ( "template"
    //                  : output operands     // things asm WRITES
    //                  : input operands       // things asm READS
    //                  : clobbers );          // registers/flags asm TRASHES
    //
    // %0, %1, %2 ... in the template refer to operands in order.
    //
    //   int x = 5, y;
    //   asm volatile ("movl %1, %0"  // copy operand 1 into operand 0
    /
    ...

    🔎 Deep Dive: reading the colons

    The shape is always asm volatile("template" : outputs : inputs : clobbers). Each operand is a constraint string plus a C++ variable in parentheses. The constraint tells the compiler where the value can live and whether the asm reads it, writes it, or both.

    int in = 10, out;
    asm volatile ("incl %0"   // increment operand 0
                  : "=r"(out)  // OUTPUT  %0: "=" write-only, "r" any register
                  : "0"(in)    // INPUT   %1: "0" reuse %0's register, seeded with in
                  : "cc");     // CLOBBER   : "cc" = we changed the flags register
    // out == 11
    
    // Constraints you meet first:
    //   "r" any register   "m" memory   "i" immediate constant
    //   "=" written only   "+" read AND written   "0".."9" tie to that operand
    // Clobbers:  "memory" (we touched RAM)   "cc" (we touched the flags)

    Get the clobber list right and the optimizer keeps working around your block safely. Get it wrong and it assumes a register or memory is untouched, reuses it, and your program corrupts data — usually only at higher optimization levels.

    3. The Better Tool: Compiler Intrinsics

    A compiler intrinsic looks like a normal function call but compiles down to a single CPU instruction. Examples: __builtin_popcount(x) counts set bits, __builtin_ctz(x) counts trailing zeros, and C++20's <bit> header gives portable std::popcount and std::countr_zero. The crucial difference from inline asm: the optimizer understands an intrinsic, so it can fold, reorder, and schedule it, and it stays portable across compilers. The example below uses a plain-C++ popcount so it runs here, with the real intrinsic shown in comments.

    Worked example: popcount the portable way

    See the intrinsic in comments and run the equivalent plain C++.

    Try it Yourself »
    C++
    #include <iostream>
    #include <cstdint>
    using namespace std;
    
    // Compiler intrinsics: function-shaped wrappers that compile to a
    // single CPU instruction. They are the GOOD alternative to inline asm:
    // portable, and the optimizer still understands them.
    //
    // GCC/Clang spell these __builtin_*; C++20 added portable versions in
    // the <bit> header (std::popcount, std::countl_zero, ...).
    
    // popcount = how many bits are set to 1. One CPU instruction on modern
    // chips; here is a plain-C++ version 
    ...

    Your turn. The program below is almost complete — fill in the two blanks marked ___ using the hints in the comments, then run it and check the expected output.

    🎯 Your turn: count the set bits

    Fill in the ___ blanks, then check your output against the expected line.

    Try it Yourself »
    C++
    #include <iostream>
    #include <cstdint>
    using namespace std;
    
    // A real program would call std::popcount; this loop is its portable twin.
    int popcount(uint32_t x) {
        int count = 0;
        while (x) { count += x & 1; x >>= 1; }
        return count;
    }
    
    int main() {
        // 🎯 YOUR TURN — replace each ___ then press "Try it Yourself".
    
        // 1) Store the 8-bit pattern 1010 1010 (alternating bits)
        uint32_t pattern = ___;    // 👉 use 0b10101010  (binary literal)
    
        // 2) Count how many bits are se
    ...

    4. SIMD & Letting the Optimizer Win

    SIMD (Single Instruction, Multiple Data) processes several values with one instruction — SSE does 4 floats at a time, AVX does 8. You can hand-write it with intrinsics like _mm256_add_ps from <immintrin.h>, but the easiest and most portable path is to write a plain loop and compile with -O3 -march=native: modern compilers auto-vectorize it into exactly those wide instructions. The runnable loop below is one the optimizer will happily turn into AVX for you.

    Worked example: a loop the compiler will vectorize

    Read the AVX intrinsics in comments, then run the plain loop that compiles to them under -O3.

    Try it Yourself »
    C++
    #include <iostream>
    #include <vector>
    using namespace std;
    
    // SIMD = Single Instruction, Multiple Data: one instruction works on
    // several values at once. SSE handles 4 floats per step, AVX handles 8.
    //
    // Real SIMD looks like this (study, don't run — needs the right CPU):
    //   #include <immintrin.h>                  // AVX intrinsics
    //   __m256 va = _mm256_loadu_ps(a + i);     // load 8 floats
    //   __m256 vb = _mm256_loadu_ps(b + i);     // load 8 floats
    //   __m256 vc = _mm256_add_ps(va, v
    ...

    Pro Tips

    • 💡 Profile before you reach for asm: measure with a profiler and read the compiler's output (-S or Compiler Explorer) first. Most "slow" code is fixed by a better algorithm, not assembly.
    • 💡 Prefer the C++20 <bit> header: std::popcount, std::countl_zero, std::bit_ceil are portable, type-safe, and need no compiler-specific builtins.
    • 💡 Let -O3 -march=native vectorize: a clean loop the auto-vectorizer can see usually beats hand-written intrinsics and is far easier to maintain.
    • 💡 If you must write asm, always use volatile for side-effecting blocks and list every clobbered register, plus "memory" and "cc" when relevant.

    Common Errors (and the fix)

    • Clobbering registers you didn't declare: your asm writes rcx but you left it out of the clobber list. The compiler keeps using its old value and data corrupts — often only at -O2. Fix: list every changed register, plus "memory" and "cc" when you touch RAM or flags.
    • Non-portable code: asm("addl %1, %0" ...) assembles on x86 but fails or means something else on ARM or with MSVC. Fix: gate asm behind architecture checks, or — much better — replace it with an intrinsic or plain C++ that every compiler supports.
    • Breaking the optimizer: the compiler can't see inside an asm block, so it can't fold constants or reorder around it, and your "fast" hack ends up slower than the C++ it replaced. Fix: prefer intrinsics the optimizer understands; reserve asm for what truly can't be expressed otherwise.
    • Undefined behaviour from a missing volatile or wrong constraint: without volatile the optimizer may delete a side-effecting block; a "=r" on a value you actually read is a lie to the compiler. Both are UB and may "work" until you change a flag. Fix: mark side-effecting asm volatile and use "+r" for read-and-write operands.
    • "impossible constraint in 'asm'" / "operand number out of range": the template references a %2 you never supplied, or a constraint can't be satisfied. Fix: count your operands — %0 is the first output — and make every placeholder match a declared operand.

    📋 Quick Reference

    ConceptFormMeans
    Extended asmasm volatile(t : out : in : clob)four colon sections
    Output operand"=r"(x)asm writes x (a register)
    Input operand"r"(x)asm reads x
    Read+write"+r"(x)asm reads and writes x
    Clobbers: "memory", "cc"touched RAM / flags
    Portable bit opstd::popcount(x)C++20, no asm needed
    Auto-vectorize-O3 -march=nativecompiler emits SIMD

    Frequently Asked Questions

    Q: Why won't the inline assembly examples run in this editor?

    Inline assembly is raw CPU instructions tied to one architecture (x86-64, ARM, and so on). The online compiler may run on a different CPU, and many sandboxes block asm entirely for safety. That is exactly why the asm examples here are shown as commented worked examples you study, while the runnable boxes use portable intrinsics or plain C++ that produce the same result everywhere.

    Q: When is inline assembly actually justified?

    Almost never in normal application code. It is reserved for things the language cannot express: a specific privileged instruction in an OS kernel, a CPU feature with no intrinsic, constant-time cryptography, or a hot loop you have profiled and proven the compiler cannot match. If you cannot point to a profiler result and a missing intrinsic, you do not need it.

    Q: What is the difference between inline asm and a compiler intrinsic?

    An intrinsic looks like a normal function call — __builtin_popcount(x) or _mm256_add_ps(a, b) — but the compiler turns it into the matching CPU instruction. The optimizer still sees through it, can reorder it, and keeps it portable across compilers. Inline asm is an opaque block the optimizer cannot understand, so intrinsics give you the same instruction with far fewer footguns.

    Q: What does asm volatile do, and why the volatile?

    volatile tells the compiler 'do not delete or move this asm block even if its outputs look unused'. Without it the optimizer may assume the block is pure and remove it, which silently breaks side-effecting instructions. Use volatile whenever the asm reads or writes hardware, memory, or flags that the compiler cannot see.

    Q: What is a clobber list and what happens if I get it wrong?

    The clobber list is the third colon section of an extended asm statement: it names every register, plus "memory" or "cc", that your instructions modify but did not declare as an output. If you forget one, the compiler still believes its old value is valid, reuses that register, and you get corruption that often only shows up under -O2. Clobbers are how you keep your promise to the optimizer.

    Mini-Challenge: Alignment Checker

    No blanks this time — just a brief and a blank canvas (with an outline to keep you on track). Use the provided trailingZeros helper — the portable twin of the __builtin_ctz intrinsic — to decide whether an address is 16-byte aligned. Build it, run it, and check your output against the example in the comments.

    🎯 Mini-Challenge: is this address 16-byte aligned?

    Use trailingZeros(address) and report whether the count is at least 4.

    Try it Yourself »
    C++
    #include <iostream>
    #include <cstdint>
    using namespace std;
    
    // A portable stand-in for the intrinsic std::countr_zero / __builtin_ctz.
    int trailingZeros(uint32_t x) {
        if (x == 0) return 32;
        int count = 0;
        while (!(x & 1)) { count++; x >>= 1; }   // count low 0-bits
        return count;
    }
    
    int main() {
        // 🎯 MINI-CHALLENGE: alignment checker
        // Many fast routines need data aligned to 16 bytes. A value is
        // 16-byte aligned when its lowest 4 bits are zero — i.e. it has
        //
    ...

    🎉 Lesson Complete

    • ✅ Inline assembly embeds raw CPU instructions; it's a last resort, not a default
    • ✅ Extended asm has four parts: template : outputs : inputs : clobbers, with %0/%1 by position
    • asm volatile pins the block in place; the clobber list keeps your promise to the optimizer
    • Compiler intrinsics (__builtin_popcount, C++20 <bit>) give the same instruction, portably
    • ✅ For SIMD, a clean loop plus -O3 -march=native auto-vectorizes and usually wins
    • ✅ Top traps: clobbering, non-portability, breaking the optimizer, and undefined behaviour
    • Next lesson: C++ Networking — talk to other machines over sockets

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service