What You'll Learn
- Cache lines and memory access patterns
- Branch prediction and branchless code
- AoS vs SoA data layout
- Hot/cold data splitting
Writing High-Performance C++ Code
Performance in C++ is dominated by three hardware realities: cache locality (are you reading sequential memory?), branch prediction (are your conditionals predictable?), and data layout (is your struct organized for how you access it?). Master these and your code runs 2-10× faster without algorithmic changes.
Cache Lines — Sequential Access Wins
CPUs don't load individual bytes — they load 64-byte cache lines. Sequential access uses every byte of each line. Strided access wastes most of it. The difference between L1 cache (1ns) and RAM (100ns) is 100×.
Pro Tip: Use perf stat -e cache-misses,cache-references to measure your cache miss rate. Below 5% is good; above 20% means your data layout needs work.
Cache Lines
Compare stride-1 vs stride-16 memory access
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
class Timer {
string label;
chrono::high_resolution_clock::time_point start;
public:
Timer(const string& l) : label(l), start(chrono::high_resolution_clock::now()) {}
~Timer() {
auto us = chrono::duration_cast<chrono::microseconds>(
chrono::high_resolution_clock::now() - start).count();
cout << label << ": " << us << " µs" << endl;
}
};
int main() {
const int N = 100
...Branch Prediction — Predictable Code is Fast Code
The CPU predicts which way an if branch will go before it evaluates the condition. When the prediction is right (~97% on sorted data), execution is nearly free. When it's wrong (~50% on random data), the CPU flushes 15-20 cycles of work. Branchless code avoids the penalty entirely.
Common Mistake: Optimizing for branch prediction before profiling. Not all branches are hot. Use perf stat -e branch-misses to find the ones that actually matter.
Branch Prediction
See how data sorting affects conditional performance
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <random>
using namespace std;
class Timer {
string label;
chrono::high_resolution_clock::time_point start;
public:
Timer(const string& l) : label(l), start(chrono::high_resolution_clock::now()) {}
~Timer() {
auto us = chrono::duration_cast<chrono::microseconds>(
chrono::high_resolution_clock::now() - start).count();
cout << label << ": " << us << " µs" << endl;
}
...Data Layout — AoS vs SoA
Array of Structures (AoS) groups all fields of one object together. Structure of Arrays (SoA) groups the same field across all objects. For batch processing (update all positions), SoA is faster because each cache line contains only the data you need.
AoS vs SoA
Compare data layouts for particle simulation
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
class Timer {
string label;
chrono::high_resolution_clock::time_point start;
public:
Timer(const string& l) : label(l), start(chrono::high_resolution_clock::now()) {}
~Timer() {
auto us = chrono::duration_cast<chrono::microseconds>(
chrono::high_resolution_clock::now() - start).count();
cout << label << ": " << us << " µs" << endl;
}
};
// Array of Structures (AoS) — com
...Quick Reference
| Technique | Speedup |
|---|---|
| Sequential access | 2-10× vs random |
| Sorted branch data | 2-5× vs unsorted |
| Branchless code | 1.5-3× for hot loops |
| SoA layout | 1.5-4× for batch ops |
| Hot/cold splitting | 1.5-2× cache efficiency |
Lesson Complete!
You now understand cache lines, branch prediction, and data layout — the three pillars of high-performance C++.
Sign up for free to track which lessons you've completed and get learning reminders.