What You'll Learn
- Measure code with high-resolution timers
- Cache-friendly vs cache-hostile patterns
- Find hotspots with data structure choice
- Use perf, gprof, VTune, callgrind
Profiling & Optimizing C++ Applications
The golden rule of optimization: measure first, optimize second. Guessing where bottlenecks are is almost always wrong. This lesson teaches you how to measure precisely, identify real hotspots, and apply targeted optimizations that actually matter.
Measuring with High-Resolution Timers
std::chrono::high_resolution_clock gives microsecond or nanosecond precision. Wrap timing in an RAII class so you never forget to stop the timer — the destructor prints the elapsed time automatically.
Pro Tip: Always benchmark with optimizations enabled (-O2 or -O3). Debug builds (-O0) are 10-100× slower and produce misleading results.
RAII Timer
Compare loop vs accumulate and sort vs stable_sort
#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <numeric>
using namespace std;
// Simple RAII timer — measures scope lifetime
class Timer {
string label;
chrono::high_resolution_clock::time_point start;
public:
Timer(const string& l) : label(l), start(chrono::high_resolution_clock::now()) {}
~Timer() {
auto end = chrono::high_resolution_clock::now();
auto us = chrono::duration_cast<chrono::microseconds>(end - start).count();
...Cache-Friendly Access Patterns
Modern CPUs load memory in 64-byte cache lines. Sequential access (row-major) hits the cache; jumping across rows (column-major) causes cache misses. A cache miss costs 100+ CPU cycles — making memory layout the single biggest performance factor in data-heavy code.
Common Mistake: Using vector<vector<int>> for matrices. Each inner vector is a separate heap allocation — terrible cache locality. Use a flat vector<int> with manual indexing for performance-critical matrices.
Cache Performance
Compare row-major, column-major, and flat array traversal
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
class Timer {
string label;
chrono::high_resolution_clock::time_point start;
public:
Timer(const string& l) : label(l), start(chrono::high_resolution_clock::now()) {}
~Timer() {
auto us = chrono::duration_cast<chrono::microseconds>(
chrono::high_resolution_clock::now() - start).count();
cout << label << ": " << us << " µs" << endl;
}
};
int main() {
const int ROWS =
...Finding Hotspots — Data Structure Choice
Often the biggest optimization is choosing the right data structure. A linear scan through a vector is O(n) per lookup; an unordered_map is O(1) average. Profile first to find where time is spent, then fix the algorithm — not the micro-optimizations.
Hotspot Analysis
See how data structure choice dominates performance
#include <iostream>
#include <vector>
#include <string>
#include <unordered_map>
#include <chrono>
#include <algorithm>
using namespace std;
class Timer {
string label;
chrono::high_resolution_clock::time_point start;
public:
Timer(const string& l) : label(l), start(chrono::high_resolution_clock::now()) {}
~Timer() {
auto us = chrono::duration_cast<chrono::microseconds>(
chrono::high_resolution_clock::now() - start).count();
cout << label << ": " << u
...Quick Reference
| Tool | Command | Best For |
|---|---|---|
| gprof | g++ -pg; gprof a.out | Function-level time |
| perf | perf record/report | CPU sampling |
| callgrind | valgrind --tool=callgrind | Call graphs |
| chrono | high_resolution_clock | Micro-benchmarks |
Lesson Complete!
You can now measure performance accurately, identify real bottlenecks, and apply data-driven optimizations.
Sign up for free to track which lessons you've completed and get learning reminders.