Lesson 46 • Expert
Performance Profiling
Stop guessing where your Java program is slow. Learn to measure it — with JMH benchmarks, JFR and async-profiler recordings, and flame graphs — then fix the one hotspot that actually matters.
What You'll Learn in This Lesson
- ✓Write correct JMH microbenchmarks and avoid their pitfalls
- ✓Profile CPU and allocations with JFR and async-profiler
- ✓Read a flame graph to find the real hotspot
- ✓Capture and reason about GC behaviour, and tune it sensibly
- ✓Follow the measure → profile → fix → verify loop
- ✓Spot Java traps: autoboxing, string concat in loops, premature optimisation
Before You Start
This lesson builds on Memory Management & GC and JVM Internals — you'll want to know what a heap and a garbage collector are. Familiarity with Thread Pools and Java collections helps too. The previous lesson covered concurrency; this one covers finding the bottlenecks in it.
A Real-World Analogy: The Detective, Not the Psychic
💡 Analogy: A good detective does not guess who committed the crime and arrest the first person they suspect. They gather evidence, follow the trail, and let the facts name the culprit. A profiler is your forensic kit: it collects the evidence (where time and memory actually go) so you stop arresting innocent code.
The golden rule of performance work is three words: measure, don't guess. Developers are famously bad at predicting where a program spends its time — the slow part is almost never where intuition points. The cost of guessing is real: you rewrite a clever method that was never the problem, add complexity and bugs, and the program is no faster.
There's a reason this works. The 90/10 rule: roughly 90% of execution time is spent in about 10% of the code. Your entire job is to find that 10% — the hotspot — and leave the other 90% of the code clean and readable. A flame graph shows you the hotspot in one glance.
1️⃣ The Optimisation Loop (and the Toolbox)
Every performance fix follows the same disciplined loop. Skip a step — especially jumping straight to "fix" — and you waste effort optimising code that was never slow.
🧠 The loop:
Measure (is it even slow, and by how much?) → Profile (record CPU + allocations) → Identify (read the flame graph for the widest bar) → Fix (change only the hotspot) → Verify (re-measure to confirm the win is real).
Each tool below fits a different step. You don't need all of them at once — pick by what you're asking:
| Tool | What it is | Overhead | Best for |
|---|---|---|---|
| JFR | Built-in flight recorder | <1% | Always-on production profiling |
| async-profiler | Sampling profiler | ~2% | CPU + allocation flame graphs |
| JMC / VisualVM | GUI viewers | 5–15% | Exploring a recording, dev debugging |
| JMH | Microbenchmark harness | N/A | Comparing two implementations |
| jcmd / jfr | CLI diagnostics | Minimal | Thread/heap dumps, reading .jfr |
2️⃣ Microbenchmarking with JMH (and Its Pitfalls)
A microbenchmark measures one tiny piece of code in isolation — like "is + or StringBuilder faster?". You cannot do this reliably with a hand-written timing loop, because the JVM cheats you in two ways.
Warmup: for the first thousands of calls your code runs in the slow interpreter; only once the JIT compiler (Just-In-Time — it compiles hot bytecode to native code on the fly) has warmed up do you see real speed. Time too early and every result is wrong.
Dead-code elimination: if the JIT can prove a result is never used, it deletes the entire computation. Your benchmark then times nothing and reports an impossibly fast number.
JMH (Java Microbenchmark Harness, from the JDK team) handles both: it warms up first, runs the timed iterations after, forks fresh JVMs for isolation, and forces you to return the result or feed it to a Blackhole so the optimiser can't delete it.
return the result of a @Benchmark (or pass it to a Blackhole). A benchmark whose result is thrown away is a benchmark of an empty method.// build.gradle adds the two JMH dependencies:
// org.openjdk.jmh:jmh-core:1.37
// org.openjdk.jmh:jmh-generator-annprocess:1.37 (annotation processor)
// Then run: ./gradlew jmh (or java -jar target/benchmarks.jar)
package com.myapp.bench;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;
@BenchmarkMode(Mode.AverageTime) // report average ns per call
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1) // let the JIT compile hot code first
@Measurement(iterations = 10, time = 1) // THEN take the timed measurements
@Fork(2) // fresh JVMs so one run can't bias the next
@State(Scope.Thread) // each thread gets its own copy of fields
public class StringConcatBenchmark {
@Param({"10", "100", "1000"}) // run every benchmark at each size
public int n;
// The number 42 is a CONSTANT, so without a Blackhole the JIT would
// delete this method entirely ("dead-code elimination") and you'd be
// timing nothing. RETURNING the result tells JMH to keep the work.
@Benchmark
public String plusInLoop() {
String s = "";
for (int i = 0; i < n; i++) {
s += "x"; // each '+' makes a NEW String — O(n^2)
}
return s; // returned -> JIT can't delete the loop
}
@Benchmark
public String stringBuilder() {
StringBuilder sb = new StringBuilder(n); // pre-size to avoid resizes
for (int i = 0; i < n; i++) {
sb.append('x'); // appends into one buffer — O(n)
}
return sb.toString();
}
// When a method has no single return value, hand it to a Blackhole so
// the JIT still believes the result is "used".
@Benchmark
public void consumeWithBlackhole(Blackhole bh) {
StringBuilder sb = new StringBuilder(n);
for (int i = 0; i < n; i++) sb.append('x');
bh.consume(sb); // pretend we used it
}
}Benchmark (n) Mode Cnt Score Error Units
StringConcatBenchmark.plusInLoop 10 avgt 20 88.413 ± 3.121 ns/op
StringConcatBenchmark.plusInLoop 100 avgt 20 6234.211 ± 214.842 ns/op
StringConcatBenchmark.plusInLoop 1000 avgt 20 583122.917 ± 9851.000 ns/op
StringConcatBenchmark.stringBuilder 10 avgt 20 32.117 ± 0.812 ns/op
StringConcatBenchmark.stringBuilder100 avgt 20 281.044 ± 7.211 ns/op
StringConcatBenchmark.stringBuilder1000 avgt 20 2715.398 ± 78.114 ns/op
# Smaller Score = faster. Error is the ± confidence interval.
# At n=1000, StringBuilder is ~215x faster than '+' in a loop.nanoTime() loop gets wrong. Scaffold a project with the official archetype or the jmh-gradle-plugin, then run java -jar target/benchmarks.jar. The Score numbers vary by machine; the ordering does not.🎯 Your Turn #1 — Finish the JMH Benchmark
Fill in three blanks: set warmup to 3 iterations, measurement to 5, and return the sum so the JIT can't delete the loop as dead code. The expected output shape is in the comment.
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class SumBenchmark {
// 🎯 YOUR TURN — fill in the blanks marked with ___
@Param({"1000", "10000"})
public int n;
// 1) Warm up the JIT for 3 iterations of 1 second each.
@Warmup(iterations = ___, time = 1) // 👉 replace ___ with 3
// 2) Then take 5 timed measurement iterations.
@Measurement(iterations = ___, time = 1) // 👉 replace ___ with 5
@Benchmark
public long sumLoop() {
long total = 0; // primitive accumulator
for (int i = 0; i < n; i++) total += i;
// 3) RETURN the result so the JIT can't delete the loop as dead code.
___; // 👉 replace ___ with return total
}
// ✅ Expected: JMH prints a table with rows for n=1000 and n=10000,
// in "avgt" mode with "ns/op" units, e.g.
// SumBenchmark.sumLoop 1000 avgt 5 312.4 ± 4.1 ns/op
}nanoTime() loop gets wrong. Scaffold a project with the official archetype or the jmh-gradle-plugin, then run java -jar target/benchmarks.jar. The Score numbers vary by machine; the ordering does not.3️⃣ CPU & Allocation Profiling: JFR and async-profiler
A benchmark compares two known options. A profiler answers the bigger question: in a whole running app, where is the time and memory going? You attach it, let real traffic flow, and it samples what's happening.
Java Flight Recorder (JFR) is built into the JDK and costs under 1% overhead, so you can leave it on in production. It records a .jfr file of CPU samples, allocations, GC events and more, which you read with the jfr CLI or open in JDK Mission Control (JMC).
async-profiler is a separate, low-overhead sampling profiler that produces interactive flame graphs for both CPU (where time goes) and allocations (what creates garbage). Two recordings, two different questions — always check allocation too, because excessive object creation is one of the most common hidden slowdowns.
# === Java Flight Recorder (JFR) — built into the JDK, <1% overhead ========
# Start a 60-second recording when you launch the app:
java -XX:StartFlightRecording=duration=60s,filename=app.jfr,settings=profile \
-jar app.jar
# Or attach to an already-running JVM by its process id:
jps # list Java pids
jcmd 12345 JFR.start duration=60s filename=/tmp/app.jfr settings=profile
jcmd 12345 JFR.dump filename=/tmp/snap.jfr # snapshot mid-flight
jcmd 12345 JFR.stop # stop the recording
# Read events straight from the CLI without a GUI:
jfr summary app.jfr # GC count, CPU load, thread totals
jfr print --events jdk.GarbageCollection app.jfr
jfr print --events jdk.ObjectAllocationSample app.jfr | head -40
jmc app.jfr # open in JDK Mission Control (flame graphs)
# === async-profiler — sampling CPU + allocation flame graphs (~2%) =======
# https://github.com/async-profiler/async-profiler
./asprof -d 30 -f cpu.html 12345 # 30s CPU flame graph -> HTML
./asprof -d 30 -e alloc -f alloc.html 12345 # WHERE allocations happen
# === Always-on safety net for production =================================
java -XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/var/log/myapp/heap-%p.hprof \
-Xlog:gc*:file=/var/log/myapp/gc.log:time,uptime:filecount=5,filesize=20m \
-jar app.jar
# You get a heap dump for free if it OOMs, and a GC log to diagnose pauses.$ jfr summary app.jfr
Version: 2.1
Events: 41,233
Event Type Count Size (bytes)
=====================================================
jdk.ObjectAllocationSample 18,402 612,331
jdk.ExecutionSample 12,810 421,004
jdk.GarbageCollection 127 18,902
jdk.GCHeapSummary 254 40,124
# 127 GC events in 60s and allocation sampling dominating the recording
# is your first clue: this app is allocating too much, not CPU-bound.jps, jcmd, jfr, jmc and JFR itself all ship with the JDK — no install. async-profiler is a separate download. Run these in your own terminal against a real JVM.4️⃣ Reading a Flame Graph
A flame graph turns thousands of sampled call stacks into one picture. Learn to read it and you can find a hotspot in seconds.
- Y axis = stack depth. Callers at the bottom, the methods they call stacked on top.
- X axis is NOT time. Frame width = share of samples — how much CPU (or allocation) that method accounts for.
- Wide, flat bars = your hotspots. Optimise the widest frame first.
- Tall, thin towers are deep call chains that individually cost little — usually not worth touching.
💡 The maths of effort: a method that is 2% of the width can never give you more than a 2% speed-up, no matter how cleverly you rewrite it. A 60%-wide method is where the real win is. Width tells you where to spend your time — that's the whole point.
🎯 Your Turn #2 — Record & Read a Flame Graph
Fill in the blanks to record a 20-second CPU flame graph, then answer the self-check question about which method to optimise first.
# 🎯 YOUR TURN — record a 20s CPU flame graph of a running app,
# then answer the question in the comment from what you see.
# 1) Find the Java process id.
___ # 👉 replace ___ with jps
# 2) Record a 20-second CPU flame graph to HTML for pid 4567.
./asprof -d ___ -f cpu.html 4567 # 👉 replace ___ with 20
# 3) Open cpu.html in a browser. Reading it:
# - The X axis is NOT time order — width = share of samples (CPU time).
# - Look for the WIDEST bar; that frame's method is your hotspot.
# - A tall-but-narrow stack is deep recursion, usually fine.
#
# ✅ Self-check: if "parseJson" is 60% of the width and everything else is
# a thin sliver, which method do you optimise first? -> parseJson
# (Optimising a 2%-wide method can never give more than a 2% speed-up.)jps, jcmd, jfr, jmc and JFR itself all ship with the JDK — no install. async-profiler is a separate download. Run these in your own terminal against a real JVM.5️⃣ Common Java Performance Traps
Profilers keep surfacing the same handful of culprits in Java code. Knowing them lets you read a flame graph and immediately recognise the pattern.
- Autoboxing in hot loops: mixing primitives with wrapper types (
Integer,Long) silently allocates an object per operation — millions of them in a loop, which crushes the GC. - String concatenation with
+in a loop: each+builds a brand-newString, turning O(n) work into O(n²). Use aStringBuilder. - Recompiling regexes / reallocating in a loop:
Pattern.compile(...)ornew-ing throwaway objects inside the hot path repeats expensive work every iteration. - Premature optimisation: contorting code for speed before profiling proves it matters — the opposite trap.
The runnable example below shows the autoboxing trap with a deterministic fix: a boxed Long accumulator versus a primitive long. Same answer, very different speed.
import java.util.ArrayList;
import java.util.List;
public class Main {
// The flame graph showed 70% of CPU in Integer.valueOf / Long boxing.
// The culprit: a Long key forces autoboxing on EVERY map/loop touch.
// SLOW: sum into a boxed Long. Each "+=" unboxes, adds, re-boxes a
// brand-new Long object — millions of throwaway allocations.
static long sumBoxed(List<Integer> data) {
Long total = 0L; // boxed accumulator (bad)
for (Integer v : data) {
total += v; // unbox v, unbox total, box result
}
return total;
}
// FAST: keep the accumulator a PRIMITIVE long. Zero allocations in the
// loop — the JIT keeps it in a CPU register.
static long sumPrimitive(List<Integer> data) {
long total = 0L; // primitive accumulator (good)
for (Integer v : data) {
total += v; // one unbox; no boxing of the sum
}
return total;
}
public static void main(String[] args) {
List<Integer> data = new ArrayList<>();
for (int i = 0; i < 5_000_000; i++) data.add(i);
// Warm the JIT so we measure steady-state, not interpreter, speed.
for (int i = 0; i < 5; i++) { sumBoxed(data); sumPrimitive(data); }
long t0 = System.nanoTime();
long a = sumBoxed(data);
long boxedMs = (System.nanoTime() - t0) / 1_000_000;
long t1 = System.nanoTime();
long b = sumPrimitive(data);
long primMs = (System.nanoTime() - t1) / 1_000_000;
System.out.println("Boxed Long sum: " + a + " in " + boxedMs + " ms");
System.out.println("Primitive long sum: " + b + " in " + primMs + " ms");
System.out.println("Speed-up: "
+ String.format("%.1fx", (double) boxedMs / Math.max(primMs, 1)));
System.out.println();
System.out.println("(For publication numbers use JMH, not System.nanoTime.)");
}
}Boxed Long sum: 12499997500000 in 41 ms
Primitive long sum: 12499997500000 in 7 ms
Speed-up: 5.9x
(For publication numbers use JMH, not System.nanoTime.)System.nanoTime().6️⃣ GC Tuning Basics
The garbage collector (GC) reclaims objects you no longer reference. Most of the time it just works — and the single biggest GC win is allocating less in the first place. Only reach for tuning when a recording shows real GC pain: long pauses, or GC eating a big slice of CPU.
When you do tune, start by matching the collector to your goal, then size the heap, then change one flag at a time and re-measure with a GC log.
| Collector | Flag | Optimised for |
|---|---|---|
| G1 (default) | -XX:+UseG1GC | Balanced pause vs throughput |
| ZGC | -XX:+UseZGC | Very low pauses, large heaps |
| Shenandoah | -XX:+UseShenandoahGC | Low pauses, concurrent |
| Parallel | -XX:+UseParallelGC | Raw throughput, batch jobs |
🧠 Rule of thumb:
Turn on a GC log first — -Xlog:gc*:file=gc.log:time,uptime — and read it before changing anything. Copying a wall of -XX: flags from a random blog post usually makes things worse, not better.
Mini-Challenge — Prove the String Trap
Scaffolding removed. Read the comment outline, then write it yourself: build a big string two ways (+= in a loop vs a StringBuilder), time both, and print the slow/fast ratio. The expected shape is in the comment so you can self-check.
import java.util.List;
public class Main {
// 🎯 MINI-CHALLENGE: prove string concatenation in a loop is the bottleneck
// 1. Write buildSlow(int n): start "String s = \"\";" and do s += i + ","
// inside a for-loop. Return s.
// 2. Write buildFast(int n): use a StringBuilder, append i + "," each turn,
// and return sb.toString().
// 3. In main: warm up both with a few calls, then time each on n = 50_000
// using System.nanoTime(), and print the millisecond cost of each plus
// the slow/fast ratio.
//
// ✅ Expected (numbers vary by machine, the RATIO is the point):
// Slow (+=): ~900 ms
// Fast (builder): ~2 ms
// StringBuilder is hundreds of times faster — same output string.
//
// 💡 Once it works, the real lesson: the profiler/flame graph would have
// pointed straight at String.<init> / Arrays.copyOf before you guessed.
static String buildSlow(int n) {
// your code here
return "";
}
static String buildFast(int n) {
// your code here
return "";
}
public static void main(String[] args) {
// your code here
}
}System.nanoTime().Common Errors (and the fix)
- ❌ Micro-benchmarking without JMH: a
System.nanoTime()loop measures the interpreter before warmup, then gets its loop deleted by dead-code elimination after. Fix: use JMH — it warms up, forks JVMs, and forces you to return the result so the optimiser can't cheat. - ❌ Optimising the wrong thing: "I'm sure this method is slow" is a guess, not evidence; the real hotspot is almost always elsewhere. Fix: profile first, find the widest frame in the flame graph, and only optimise that.
- ❌ Ignoring GC / allocation: a CPU profile looks fine but the app stutters because it allocates millions of short-lived objects and the GC runs constantly. Fix: take an allocation profile (async-profiler
-e allocor JFR allocation events) and cut the biggest allocators. - ❌ Autoboxing in a hot loop:
Long total = 0L; total += v;re-boxes a newLongevery iteration. Fix: keep accumulators primitive (long total = 0L;) and preferIntStream/LongStreamoverStream<Integer>. - ❌ Heap dump fills the disk:
jcmd PID GC.heap_dumpwrites a file as big as your heap (could be many GB) and can crash the box if disk is short. Fix: check free space first, and prefer-XX:+HeapDumpOnOutOfMemoryErrorwith a dedicated-XX:HeapDumpPath. - ❌ Premature optimisation: rewriting clean code into something clever and buggy before profiling proves it's hot. Fix: make it work, make it right, then — guided by a profiler — make the measured hotspot fast.
📋 Quick Reference
| Goal | Command / API | Notes |
|---|---|---|
| Start a recording | -XX:StartFlightRecording=... | Writes a .jfr file |
| Attach to a live JVM | jcmd PID JFR.start ... | Find PID with jps |
| Read a recording | jfr summary app.jfr | Or open in JMC |
| CPU flame graph | asprof -d 30 -f cpu.html PID | async-profiler |
| Allocation flame graph | asprof -e alloc -f a.html PID | Finds garbage sources |
| Benchmark two options | @Benchmark (JMH) | Return the result! |
| Heap dump | jcmd PID GC.heap_dump f.hprof | As big as the heap |
| Thread dump | jcmd PID Thread.print | Thread states |
| GC log | -Xlog:gc*:file=gc.log | Read before tuning |
| Dump on OOM | -XX:+HeapDumpOnOutOfMemoryError | Set HeapDumpPath |
Frequently Asked Questions
Why can't I just time my code with System.nanoTime() in a loop?
Because the JVM is not a static machine. Before the JIT compiler warms up, your code runs in the slow interpreter, so the first runs are misleadingly slow. After warmup, the JIT can prove a result is never used and delete the whole loop ('dead-code elimination'), making it misleadingly fast (or instant). On-stack replacement, GC pauses, and CPU frequency scaling add more noise. JMH solves all of this: it warms up first, forks fresh JVMs, uses Blackhole/return values to defeat dead-code elimination, and reports a confidence interval. Hand-rolled nanoTime timing of micro-operations is almost always wrong.
What is the difference between CPU profiling and allocation profiling?
CPU profiling answers 'where is my time going?' — it samples the call stack periodically and shows which methods are on-CPU most often (the wide bars in a flame graph). Allocation profiling answers 'what is creating garbage?' — it samples object allocations and shows which call sites allocate the most bytes. They diagnose different problems: a CPU-bound app needs a faster algorithm; an allocation-heavy app needs fewer objects (it shows up as frequent GC and high GC CPU). async-profiler and JFR can do both; always check allocation too, because excessive allocation is one of the most common hidden Java performance problems.
How do I read a flame graph?
A flame graph is a picture of sampled call stacks. The Y axis is stack depth (callers at the bottom, callees stacked on top). The X axis is NOT time order — the width of a frame is the share of samples that landed in it, i.e. how much CPU (or allocation) it accounts for. So you scan for the WIDEST frames and optimise those first; a 60%-wide method is where the win is. Tall, thin towers are deep call chains that individually cost little. Optimising a 2%-wide frame can never make the program more than 2% faster, so width tells you where to spend effort.
Should I tune the garbage collector to make my app faster?
Usually not first. The biggest GC win is allocating less — if your allocation profile is flat, GC mostly takes care of itself. Reach for GC tuning when a recording shows real problems: long pauses, or GC eating a large share of CPU. Start by picking the right collector for the goal (G1 is the balanced default; ZGC or Shenandoah for very low pause times; Parallel for raw throughput batch jobs) and sizing the heap with -Xmx. Tune one flag at a time and measure with a GC log (-Xlog:gc*). Random flag-copying from a blog usually makes things worse.
What does 'premature optimization is the root of all evil' actually mean?
Donald Knuth's full quote is 'we should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.' It does NOT mean ignore performance — it means don't contort code for speed before you have profiled and know it matters. Most code is not on a hot path, and complicating it for an imaginary speed-up costs readability and introduces bugs for no real gain. Make it work, make it right, then — guided by a profiler — make the measured hotspot fast.
Why is autoboxing in a hot loop so expensive?
Autoboxing silently wraps a primitive (int, long, double) into its object form (Integer, Long, Double) whenever you mix primitives with generics or wrapper-typed variables. Each box is a heap allocation, and in a loop that runs millions of times that is millions of throwaway objects — which hammers the allocator and the garbage collector. A boxed accumulator like 'Long total = 0L; total += v;' re-boxes on every iteration. Keep accumulators and hot-loop variables as primitives, prefer IntStream/LongStream over Stream<Integer>, and use primitive-specialised collections when the loop is genuinely hot.
🎉 Lesson Complete!
Excellent work! You can now measure instead of guess: write correct JMH benchmarks, record CPU and allocation profiles with JFR and async-profiler, read a flame graph to find the real hotspot, tune the GC only when the evidence calls for it, and recognise the classic Java traps — autoboxing, string concatenation in loops, and premature optimisation.
Next up: Microservices — building distributed systems with Spring Boot, where these profiling skills scale up to whole fleets of services.
Sign up for free to track which lessons you've completed and get learning reminders.