Lesson 18 • Advanced
Profiling & Benchmarking
By the end of this lesson you'll be able to time C++ code accurately with std::chrono, avoid the traps that make micro-benchmarks lie, reach for the right profiler to find real hotspots, and reason about whether an optimisation is even worth doing — so you optimise what matters instead of guessing.
What You'll Learn
- Live by the golden rule: measure, don't guess
- Time code with std::chrono::high_resolution_clock
- Dodge micro-benchmark pitfalls (code optimised away, cold cache)
- Pick the right tool: perf, Valgrind/Callgrind, gprof, Google Benchmark
- Read a profile and find the true hotspot
- Apply Amdahl's law and weigh big-O against constant factors
std::vector, and how the compiler optimises (see Advanced Debugging). All the timing examples here are runnable — but remember real benchmarks should be built with -O2 or -O3, never a debug build.💡 Real-World Analogy
Optimising without profiling is like a doctor prescribing surgery before running any tests. You feel the problem is the heart, so you operate on the heart — when the real issue was a vitamin deficiency. A profiler is the diagnostic scan: it shows you exactly which part of the body (your code) is consuming the time. Only after the scan do you pick the treatment. Programmers who skip the scan spend days "fixing" a function that runs for 0.1% of the total time, while the real culprit sits untouched. Measure first, then operate.
1. Measure, Don't Guess — Timing with std::chrono
The single most expensive habit in performance work is guessing. Your intuition about which line is slow is almost always wrong, because the CPU's caches, branch predictor, and the optimiser reshape your code in ways you can't see. The cure is to measure. The simplest tool is std::chrono: take a timestamp before the work, another after, and subtract. high_resolution_clock is the finest-grained clock the standard library offers — perfect for small benchmarks.
The pattern is always the same: auto start = Clock::now(); → do the work → auto end = Clock::now(); → duration_cast the gap into the unit you want (microseconds here). Read the worked example, run it, and notice it checks both answers match before comparing speed — a fast wrong answer is worthless.
Worked example: time two approaches with chrono
Read every comment, run it, and watch how the same work is timed two ways.
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric> // accumulate, iota
using namespace std;
// chrono is the standard timing toolkit. high_resolution_clock is the
// finest-grained clock available, good for micro-benchmarks.
using Clock = chrono::high_resolution_clock;
// APPROACH A: sum with a hand-written loop.
long long sumLoop(const vector<int>& v) {
long long total = 0;
for (int x : v) total += x; // visit every element once -> O(n)
return total;
}
...Your turn. The program below times how long it takes to build a big vector — fill in the two timestamp blanks and the conversion, then run it.
🎯 Your turn: time a loop with chrono
Fill in the ___ blanks, then check your output against the expected lines.
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
using Clock = chrono::high_resolution_clock;
int main() {
// 🎯 YOUR TURN — time how long it takes to build a big vector.
// Replace each ___ then press "Try it Yourself".
const int N = 500000;
vector<long long> v;
// 1) Take a timestamp BEFORE the work
auto start = ___; // 👉 Clock::now()
long long sum = 0;
for (int i = 1; i <= N; i++) { v.push_back(i); sum += i; }
// 2)
...2. Micro-Benchmark Pitfalls
A micro-benchmark times a tiny isolated snippet — and tiny snippets lie more than any other kind of measurement. The most infamous trap: if you compute a result and never use it, an optimising compiler is allowed to delete the entire calculation. Your benchmark then reports near-zero time and you celebrate a speed-up that never happened. The fix is to make the result observable — print it, or store it in a volatile so the compiler must actually do the work.
Worked example: when the optimiser deletes your benchmark
See an unused result get elided, then forced to survive with volatile.
#include <iostream>
#include <chrono>
using namespace std;
using Clock = chrono::high_resolution_clock;
int main() {
const int N = 100000000;
// PITFALL: this loop computes a sum that is NEVER USED.
// With -O2 the optimiser deletes the whole loop -> ~0 us. The
// "blazing fast" result is an illusion: nothing actually ran.
{
auto start = Clock::now();
long long sum = 0;
for (int i = 0; i < N; i++) sum += i; // result discarded
auto end = C
...The other three micro-benchmark traps
- Cold cache: the first run pays to pull data into the CPU cache; later runs are warm and far faster. Add a warm-up pass you don't time, then measure.
- One sample: a single run is noise. Run many times and report the median or minimum, not one figure.
- Debug build: timing a
-O0build tells you nothing about release performance. Always benchmark with-O2/-O3.
3. Profilers — Find the Real Hotspot
chrono answers "how long does this take?". A profiler answers the more important question: "across my whole program, where is the time going?". It runs your program and reports a breakdown by function. A hotspot is a function with a large share of the total — that's where optimisation pays off. There are two flavours: sampling profilers (like perf) interrupt the program thousands of times a second and record where it is, giving low overhead and realistic numbers; instrumenting profilers (like Callgrind) count every call exactly, giving precise call graphs but running much slower.
🔧 The toolbox at a glance
perf (Linux): the go-to sampling profiler. perf record ./app then perf report shows a sorted list of hotspots with almost no slowdown. First choice for real workloads.
Valgrind / Callgrind: valgrind --tool=callgrind ./app gives an exact call graph and per-line instruction counts. Visualise it with kcachegrind. Slow (10–50×) but deterministic — great for understanding why a function is hot.
gprof: the classic. Compile with g++ -pg, run, then gprof a.out gmon.out. Simple function-level timings; dated, but everywhere.
Google Benchmark: a micro-benchmark library, not a profiler. It runs each case enough times for stable numbers, handles warm-up, and offers benchmark::DoNotOptimize(x) to defeat exactly the "optimised away" trap from Section 2. Reach for it when you're comparing two implementations carefully.
Whatever the tool, reading a profile is the same skill: look at the top of the sorted list, ignore the long tail. A function eating 60% of the run time is your target; a hundred functions at 0.2% each are not worth touching. This is where your effort multiplies.
4. Is It Even Worth It? Amdahl's Law & Big-O
Before you optimise, ask what the payoff can be. Amdahl's law says your overall speed-up is capped by the part you don't improve. If a function is 90% of the run time, making it infinitely fast still only makes the whole program 10× faster — and optimising a 10% slice can never beat 1.11× no matter what. This is the maths behind "optimise the biggest slice the profiler shows you."
Worked example: Amdahl's law ceiling
See how the un-optimised fraction caps your total speed-up.
#include <iostream>
#include <iomanip>
using namespace std;
// Amdahl's law: if a fraction p of the program can be sped up by factor s,
// the WHOLE program speeds up by only: 1 / ((1 - p) + p / s)
// The (1 - p) part you did NOT optimise puts a hard ceiling on your gains.
double amdahl(double p, double s) {
return 1.0 / ((1.0 - p) + p / s);
}
int main() {
cout << fixed << setprecision(2);
// Suppose a function is 90% of total run time (p = 0.90).
// Even if you make THAT fun
...The other lens is big-O vs constant factors. Big-O tells you how cost grows as the input grows — O(n) doubles when the input doubles, O(n²) quadruples. It is the most reliable performance lever you have, because a better big-O wins by ever-larger margins as data scales. But big-O hides the constant factor: for small inputs a "worse" algorithm with a tiny constant can win. Below, replace the blank so you can watch an O(1) hash lookup demolish an O(n) linear scan as soon as the data is non-trivial.
🎯 Your turn: O(n) scan vs O(1) hash lookup
Fill in the set lookup, then compare the two timings.
#include <iostream>
#include <chrono>
#include <vector>
#include <unordered_set>
using namespace std;
using Clock = chrono::high_resolution_clock;
int main() {
// 🎯 YOUR TURN — big-O vs constant factors.
// Looking a value up in a vector is O(n) (scan every element).
// Looking it up in an unordered_set is O(1) average (hash jump).
const int N = 20000;
vector<int> vec;
unordered_set<int> set;
for (int i = 0; i < N; i++) { vec.push_back(i); set.insert(i); }
int
...Pro Tips
- 💡 Profile a release build: measure
-O2/-O3code. Debug timings are meaningless for production. - 💡 Fix the algorithm before the line: a better big-O beats any amount of micro-tuning. Change the data structure first.
- 💡 Report the median, not the mean: one slow outlier (a context switch) skews the average; the median ignores it.
- 💡 Defeat the optimiser deliberately: use
volatileorbenchmark::DoNotOptimizeso the work you're timing actually runs.
Common Errors (and the fix)
- Optimising without measuring: you rewrite a "slow-looking" loop and the program is no faster — because that loop was never the hotspot. Fix: run a profiler (
perf record) first and optimise only the top entries. - Benchmark optimised away (reports 0 us): the result is unused so the compiler deleted the work. Fix: make it observable — print it, store it in a
volatile, or usebenchmark::DoNotOptimize(result). - Cold-cache skew: the first timed run is far slower than the rest. Fix: run an untimed warm-up pass, then time several runs and take the median.
- Timing a debug build: a
-O0build is 10–100× slower and reorders nothing. Fix: always benchmark and profile a-O2/-O3build. - Chasing the wrong slice: you speed up a function that is 5% of run time by 50× and the program barely moves (Amdahl's law). Fix: spend effort on the biggest slice the profile shows.
📋 Quick Reference — Tools
| Tool | Command | Best for |
|---|---|---|
| std::chrono | high_resolution_clock::now() | Timing one snippet |
| perf | perf record ./app; perf report | Low-overhead hotspots |
| callgrind | valgrind --tool=callgrind ./app | Exact call graphs |
| gprof | g++ -pg; ./a.out; gprof a.out | Function-level time |
| Google Benchmark | BENCHMARK(fn); DoNotOptimize(x) | Careful micro-benchmarks |
| kcachegrind | kcachegrind callgrind.out | Visualising a profile |
Frequently Asked Questions
Q: Why is everyone so insistent on 'measure, don't guess'?
Because human intuition about performance is almost always wrong. The line you think is slow is rarely the one the CPU spends time on — branch prediction, caches, and the optimiser change everything. A profiler tells you the truth in seconds; a guess can send you optimising code that runs 0.1% of the time.
Q: What is the difference between profiling and benchmarking?
Benchmarking measures how long a specific piece of code takes (the 'how fast?' number). Profiling measures where a whole program spends its time across all its functions (the 'which part is slow?' map). You benchmark a candidate fix; you profile to find what to fix in the first place.
Q: My timing numbers jump around between runs. Is something broken?
No — that is normal. The OS scheduler, other processes, CPU frequency scaling, and a cold cache all add noise. Run each benchmark several times, warm up the cache first, and report the median (or minimum) rather than a single run. For serious work use a library like Google Benchmark that handles this for you.
Q: Should I always pick the lower big-O algorithm?
Usually, but not blindly. Big-O describes how cost grows as input grows; it hides the constant factor. For small inputs an O(n^2) routine with a tiny constant can beat an O(n log n) one with a big constant. Measure at your real input size — big-O tells you what wins eventually, the profiler tells you what wins today.
Q: Why did my benchmark report 0 microseconds?
The optimiser almost certainly deleted your code. If a computed result is never used, the compiler is allowed to remove the whole calculation. Force the result to be observed — print it, accumulate it into a volatile, or feed it to a do-not-optimise sink — so the work cannot be elided.
Mini-Challenge: Benchmark Two String Builders
No blanks this time — just a brief and a blank canvas (with an outline to keep you on track). Time two ways of building a big string, then prove to yourself which one wins and by how much. Remember to print a result so the compiler can't delete your benchmark.
🎯 Mini-Challenge: time string concat vs append
Time both approaches with chrono and compare the microseconds.
#include <iostream>
#include <chrono>
#include <string>
using namespace std;
using Clock = chrono::high_resolution_clock;
int main() {
// 🎯 MINI-CHALLENGE: benchmark string building
// Compare two ways to build one big string of 100000 "x" characters.
//
// 1. APPROACH A — naive concatenation in a loop:
// string a;
// for (...) a = a + "x"; // each += may COPY the whole string
// Time it with Clock::now() before and after.
//
// 2. APPROAC
...🎉 Lesson Complete
- ✅ The golden rule: measure, don't guess — your intuition about speed is usually wrong
- ✅ Time code with
Clock::now()before and after, thenduration_castthe gap - ✅ Micro-benchmarks lie: code can be optimised away, a cold cache skews the first run, one sample is noise
- ✅ Profilers find hotspots:
perf(sampling), Callgrind (exact), gprof (classic), Google Benchmark (micro) - ✅ Read a profile by the top of the list; ignore the long tail
- ✅ Amdahl's law caps your gains; a better big-O beats constant-factor tuning at scale
- ✅ Next lesson: Undefined Behavior — the bugs the optimiser is allowed to make worse
Sign up for free to track which lessons you've completed and get learning reminders.