Courses/C++/High-Performance Code

Lesson 18 • Advanced

Writing High-Performance C++ Code

By the end of this lesson you'll be able to make C++ run several times faster the right way: measure first, lay your data out so the CPU cache loves it, stop copying things you don't need to, and turn on the compiler optimisations that do the heavy lifting for you.

What You'll Learn

Measure before optimising with a simple chrono timer
Choose contiguous std::vector over node-based containers for cache locality
Avoid needless copies with const&, move, and pass-by-value rules
Use reserve() to remove repeated vector reallocations
Understand inlining, branch prediction, and -O2 / -O3
Lay data out as SoA vs AoS for batch processing

Before you start: you should be comfortable with std::vector, functions, and references (const T&). This lesson is about making correct code fast, so you'll get the most from it once you can already write programs that work.

💡 Real-World Analogy

Think of the CPU as a chef and memory as a giant warehouse. The chef keeps a tiny counter (the cache) right next to the stove. When ingredients sit together on one shelf (a contiguous std::vector), an assistant grabs a whole tray at once and the chef never waits. When ingredients are scattered all over the warehouse (a node-based std::list), the assistant runs back and forth for each one — that round trip to RAM is roughly 100× slower than reaching the counter. Most "slow" C++ isn't slow because of clever maths; it's slow because the chef is standing around waiting for memory. Performance work is mostly about keeping the right things on the counter, and about not copying trays you only need to read.

1. Measure First — Never Guess

The golden rule of optimisation is measure, don't guess. Programmers are notoriously bad at predicting where time goes — most of a program's runtime hides in a tiny slice of the code, and the rest is irrelevant. So before you change anything, put a clock around the work and get a real number. A std::chrono timer is all you need to compare two versions of the same task.

Worked example: time a loop with std::chrono

Read every comment, run it, and note the microseconds it reports.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <chrono>
using namespace std;

int main() {
    // RULE #1 of performance: MEASURE, do not guess.
    // A tiny timer lets you compare two versions of the SAME work.
    const int N = 1000000;
    vector<int> data(N, 1);   // a million 1s

    // Record the clock, do the work, record it again.
    auto start = chrono::high_resolution_clock::now();

    long long sum = 0;
    for (int i = 0; i < N; i++) sum += data[i];  // the work we time

    auto 
...

Once you can measure, you can optimise honestly: change one thing, re-run, and keep the change only if the number actually drops. Everything below follows the same pattern — a slow "before" and a fast "after", with a timer proving the difference.

2. Cache Locality — Keep Data Contiguous

The CPU never fetches one byte at a time. It pulls a cache line — usually 64 bytes — into a small, blazing-fast cache. A std::vector stores its elements in one contiguous block, so walking it reads sequential bytes the CPU can prefetch; this is called data-oriented design — laying memory out for how you'll read it. A node-based container like std::list scatters each element somewhere different, so every step risks a cache miss — a trip to RAM that's about 100× slower. The lesson: prefer contiguous containers unless you genuinely need constant-time insertion in the middle.

Worked example: vector vs list traversal

Same sum, very different speed — contiguous memory wins.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <list>
#include <chrono>
using namespace std;

// Time a block of work and print how long it took, in microseconds.
template <typename Func>
long long timeIt(Func work) {
    auto start = chrono::high_resolution_clock::now();
    work();
    auto end = chrono::high_resolution_clock::now();
    return chrono::duration_cast<chrono::microseconds>(end - start).count();
}

int main() {
    const int N = 1000000;

    // vector = ONE contiguous block -> t
...

3. Avoid Unnecessary Copies

A huge amount of accidental slowness comes from copying data you only meant to read. Passing a big std::vector or std::string by value duplicates the whole thing on every call. The fix is to pass large, read-only arguments by const reference (const T&) — the function gets a read-only alias to the original and copies nothing. Pass small types (int, double, a pointer) by value; pass big ones you won't modify by const&. This is the single highest-value habit in everyday C++ performance.

Worked example: by value (slow) vs const& (fast)

The before/after optimisation — same result, far less work.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <string>
#include <chrono>
using namespace std;

template <typename Func>
long long timeIt(Func work) {
    auto s = chrono::high_resolution_clock::now();
    work();
    auto e = chrono::high_resolution_clock::now();
    return chrono::duration_cast<chrono::microseconds>(e - s).count();
}

// ❌ BEFORE: takes the vector BY VALUE -> copies all N strings on EVERY call.
long long countLongBefore(vector<string> words) {  // <-- copy!
    long long n = 0
...

Your turn. The function below copies the whole vector — and every string inside it — on every call. Change the parameter to a const reference and the loop item to a const string& so nothing is copied. Fill in the two blanks marked ___:

🎯 Your turn: pass by const reference

Replace the ___ blanks so the function copies nothing, then check the output.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <string>
using namespace std;

// 🎯 YOUR TURN — this function copies the whole vector on every call.
//    Make it take the data BY CONST REFERENCE so nothing is copied.

// 1) Change the parameter type from a copy to a const reference
long long totalLength(___ names) {   // 👉 const vector<string>& names
    long long total = 0;

    // 2) Loop WITHOUT copying each string: use a const reference item
    for (___ name : names)           // 👉 const
...

4. reserve() — Stop Reallocating

When you push_back into a growing std::vector, it occasionally runs out of room, allocates a bigger block, and copies every element it already holds into the new block. Do that as the vector grows from 1 to a million and you've copied a lot of data many times over. If you know roughly how many items you'll add, call vec.reserve(n) once up front: the vector grabs the whole block immediately and never re-copies. It's one line and it's basically free speed.

Worked example: reserve() removes re-copies

Build the same vector with and without reserve() and compare.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <chrono>
using namespace std;

template <typename Func>
long long timeIt(Func work) {
    auto s = chrono::high_resolution_clock::now();
    work();
    auto e = chrono::high_resolution_clock::now();
    return chrono::duration_cast<chrono::microseconds>(e - s).count();
}

int main() {
    const int N = 2000000;

    // ❌ No reserve: the vector grows, reallocates, and COPIES every
    //    element it already holds, again and again as it expands.
  
...

Now you try. The loop below knows it will add N items, so it should reserve that space first. Fill in the blank with the right call:

🎯 Your turn: reserve before the loop

Add the reserve() call so the build loop allocates only once.

Try it Yourself »

C++

#include <iostream>
#include <vector>
using namespace std;

int main() {
    const int N = 1000;
    vector<int> squares;

    // 🎯 YOUR TURN — you already KNOW you will add N items.
    // 1) Reserve room for all of them up front (one allocation, no re-copies)
    squares.___;                 // 👉 reserve(N)

    // 2) Fill the vector — this part already works
    for (int i = 1; i <= N; i++)
        squares.push_back(i * i);

    cout << "Count: " << squares.size() << endl;       // Count: 1
...

5. Inlining, Move & Compiler Flags

Some speed isn't yours to win by hand — it belongs to the compiler. Inlining pastes a small function's body straight into the caller, removing the call overhead; at -O2 the compiler does this for hot functions automatically, so you rarely write inline for speed. Move semantics (std::move) hand ownership of a big buffer across instead of copying it. And the biggest lever of all is the optimisation flag: compile a release build with -O2 and the same source can run several times faster than an unoptimised -O0 build.

Worked example: inline, std::move, and -O2

See inlining, a move that empties the source, and the build flag.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <string>
using namespace std;

// A tiny "hot" function. At -O2 the compiler INLINES it for you:
// it pastes the body into the caller, removing the call overhead.
// (You almost never need to write 'inline' yourself for speed.)
inline int square(int x) { return x * x; }

vector<int> makeData(int n) {
    vector<int> v;
    v.reserve(n);
    for (int i = 0; i < n; i++) v.push_back(square(i));
    return v;   // returned by MOVE (no copy) thanks to t
...

6. Branch Prediction (Briefly)

Modern CPUs guess which way an if will go before they know the answer, so they can keep working. Guess right (predictable data) and the branch is nearly free; guess wrong (random data) and the CPU throws away ~15–20 cycles of work. You usually don't micro-manage this, but two practical takeaways hold: processing sorted data makes branches predictable, and a branchless expression like sum += (x >= 128) * x; avoids the penalty entirely in a hot loop. As always — measure before reaching for either.

🔎 Deep Dive: SoA vs AoS

Array of Structures (AoS) groups all the fields of one object together: vector<Particle> where each Particle holds x, y, z, mass, type. It reads naturally, but if a loop only needs x, every cache line also drags in y, z, mass and type — mostly wasted.

Structure of Arrays (SoA) flips it: one array per field — vector<float> x, y, z;. Now a loop over all x values reads perfectly sequential memory, so the whole cache line is useful. SoA shines for batch processing (update every particle's position), which is why game engines and simulations lean on it.

// AoS — good when you touch ALL fields of one object
struct Particle { float x, y, z, mass; };
vector<Particle> aos(N);

// SoA — good when you touch ONE field across all objects
struct Particles { vector<float> x, y, z, mass; };
Particles soa;  // looping over soa.x is pure sequential access

Rule of thumb: AoS when you use most fields of one object at a time; SoA when you sweep one field across everything.

Pro Tips

💡 Profile, don't guess: tools like perf or a chrono timer show the real hot spot; it's rarely where you expected.
💡 Default to std::vector: contiguous memory makes it the fastest container for most workloads — reach for list/map only with a measured reason.
💡 Pass big read-only args by const&: the cheapest performance habit there is — no copy, no surprises.
💡 Always build releases with -O2: an unoptimised binary can be several times slower for free reasons.

Common Errors (and the fix)

Premature optimisation: rewriting code to be "fast" before measuring usually just makes it harder to read with no real gain. Get it correct, profile it, then optimise the one hot spot the numbers point to.
Accidental copies: void f(vector<int> v) copies the whole vector on every call. Take it by const reference — void f(const vector<int>& v) — when you only read it.
Per-item copies in a loop: for (string s : words) copies every string. Use for (const string& s : words) to read them in place.
Forgetting reserve(): building a big vector with push_back and no reserve reallocates and re-copies repeatedly. Call v.reserve(n); when you know the size.
Benchmarking a debug build: timing an -O0 binary tells you almost nothing about release speed. Always measure with -O2 (or -O3) on.

📋 Quick Reference

Goal	Do this	Why
Find the slow part	chrono / perf	Measure, don't guess
Cache locality	std::vector	Contiguous = fast
Read big arg	const T& arg	No copy
Loop a container	for (const T& x : c)	No per-item copy
Build a vector	v.reserve(n);	One allocation
Transfer ownership	std::move(x)	Steal, don't copy
Release build	g++ -O2 file.cpp	Compiler optimises
Batch one field	SoA layout	Sequential reads

Frequently Asked Questions

Q: Should I optimise my code from the start?

No. Write clear, correct code first, then measure where it is actually slow with a profiler or a timer, and optimise only that part. Most of a program's time is spent in a tiny fraction of the code, so guessing wastes effort and makes the rest harder to read. This is the rule 'measure, do not guess'.

Q: Why is std::vector usually faster than std::list?

A std::vector stores its elements in one contiguous block of memory, so walking through it reads sequential bytes that the CPU can prefetch and cache. A std::list scatters each node anywhere in memory, so every step is a cache miss that can cost 100x more than a cache hit. Unless you do a lot of inserting and removing in the middle, prefer vector.

Q: What is the difference between passing by value and by const reference?

Passing by value copies the whole argument into the function — cheap for an int, expensive for a big std::vector or std::string. Passing by const reference (const T&) hands the function a read-only alias to the original, so nothing is copied. Pass small types by value and large read-only types by const reference.

Q: What does reserve() actually do?

vec.reserve(n) tells the vector to allocate room for n elements up front. Without it, a growing vector reallocates and copies every element several times as it expands. If you know roughly how many items you will add, reserving once removes all of those repeated copies.

Q: Do I need to write 'inline' to make functions fast?

Almost never. Modern compilers decide what to inline on their own, and at -O2 they inline small hot functions for you. The inline keyword today is mainly about allowing a definition in a header without breaking the one-definition rule. Focus on -O2, good data layout, and fewer copies instead.

Q: What is the difference between -O2 and -O3?

-O2 turns on the standard, well-tested optimisations and is the safe default for release builds. -O3 adds more aggressive ones such as extra vectorisation and loop unrolling, which sometimes help and sometimes make code bigger and slower. Build with -O2, then measure -O3 to see if it actually wins for your program.

Mini-Challenge: Prove Your Optimisation

No blanks this time — just a brief and an outline. Build a word counter that takes its data by const&, time it with std::chrono, and print the count and the microseconds. The point isn't only to make it fast — it's to prove it with a number, exactly like a real performance engineer.

🎯 Mini-Challenge: optimise and measure

Write the const& counter, time it, and report the result.

Try it Yourself »

C++

#include <iostream>
#include <vector>
#include <string>
#include <chrono>
using namespace std;

int main() {
    // 🎯 MINI-CHALLENGE: optimise a word counter and prove it is faster
    //
    // 1. Make a vector<string> with ~100000 copies of some word.
    // 2. Write a function that counts words longer than 4 letters,
    //    taking the vector BY CONST REFERENCE (const vector<string>&)
    //    and looping with a const string& item (no per-item copies).
    // 3. Time it with chrono::high_
...

🎉 Lesson Complete

✅ Measure first with a std::chrono timer — never guess where the time goes
✅ Prefer contiguous std::vector over node-based containers for cache locality
✅ Pass big read-only arguments by const&; loop with const T& to avoid copies
✅ Call reserve(n) to remove repeated reallocations; use std::move to transfer instead of copy
✅ Let -O2 handle inlining and the rest; only chase branch prediction once you've measured
✅ Lay data out as SoA for batch field sweeps, AoS for whole-object access
✅ Next lesson: Data Structures — choose the right container for each job

Writing High-Performance C++ Code

What You'll Learn

💡 Real-World Analogy

1. Measure First — Never Guess

Worked example: time a loop with std::chrono

2. Cache Locality — Keep Data Contiguous

Worked example: vector vs list traversal

3. Avoid Unnecessary Copies

Worked example: by value (slow) vs const& (fast)

🎯 Your turn: pass by const reference

4. reserve() — Stop Reallocating

Worked example: reserve() removes re-copies

🎯 Your turn: reserve before the loop

5. Inlining, Move & Compiler Flags

Worked example: inline, std::move, and -O2

6. Branch Prediction (Briefly)

🔎 Deep Dive: SoA vs AoS

Pro Tips

Common Errors (and the fix)

📋 Quick Reference

Frequently Asked Questions

Mini-Challenge: Prove Your Optimisation

🎯 Mini-Challenge: optimise and measure

🎉 Lesson Complete

Cookie & Privacy Settings