Skip to main content
    Courses/Java/Performance Profiling

    Lesson 46 • Expert

    Performance Profiling

    Stop guessing where your Java program is slow. Learn to measure it — with JMH benchmarks, JFR and async-profiler recordings, and flame graphs — then fix the one hotspot that actually matters.

    What You'll Learn in This Lesson

    • Write correct JMH microbenchmarks and avoid their pitfalls
    • Profile CPU and allocations with JFR and async-profiler
    • Read a flame graph to find the real hotspot
    • Capture and reason about GC behaviour, and tune it sensibly
    • Follow the measure → profile → fix → verify loop
    • Spot Java traps: autoboxing, string concat in loops, premature optimisation

    Before You Start

    This lesson builds on Memory Management & GC and JVM Internals — you'll want to know what a heap and a garbage collector are. Familiarity with Thread Pools and Java collections helps too. The previous lesson covered concurrency; this one covers finding the bottlenecks in it.

    A Real-World Analogy: The Detective, Not the Psychic

    💡 Analogy: A good detective does not guess who committed the crime and arrest the first person they suspect. They gather evidence, follow the trail, and let the facts name the culprit. A profiler is your forensic kit: it collects the evidence (where time and memory actually go) so you stop arresting innocent code.

    The golden rule of performance work is three words: measure, don't guess. Developers are famously bad at predicting where a program spends its time — the slow part is almost never where intuition points. The cost of guessing is real: you rewrite a clever method that was never the problem, add complexity and bugs, and the program is no faster.

    There's a reason this works. The 90/10 rule: roughly 90% of execution time is spent in about 10% of the code. Your entire job is to find that 10% — the hotspot — and leave the other 90% of the code clean and readable. A flame graph shows you the hotspot in one glance.

    1️⃣ The Optimisation Loop (and the Toolbox)

    Every performance fix follows the same disciplined loop. Skip a step — especially jumping straight to "fix" — and you waste effort optimising code that was never slow.

    🧠 The loop:

    Measure (is it even slow, and by how much?) → Profile (record CPU + allocations) → Identify (read the flame graph for the widest bar) → Fix (change only the hotspot) → Verify (re-measure to confirm the win is real).

    Each tool below fits a different step. You don't need all of them at once — pick by what you're asking:

    ToolWhat it isOverheadBest for
    JFRBuilt-in flight recorder<1%Always-on production profiling
    async-profilerSampling profiler~2%CPU + allocation flame graphs
    JMC / VisualVMGUI viewers5–15%Exploring a recording, dev debugging
    JMHMicrobenchmark harnessN/AComparing two implementations
    jcmd / jfrCLI diagnosticsMinimalThread/heap dumps, reading .jfr

    2️⃣ Microbenchmarking with JMH (and Its Pitfalls)

    A microbenchmark measures one tiny piece of code in isolation — like "is + or StringBuilder faster?". You cannot do this reliably with a hand-written timing loop, because the JVM cheats you in two ways.

    Warmup: for the first thousands of calls your code runs in the slow interpreter; only once the JIT compiler (Just-In-Time — it compiles hot bytecode to native code on the fly) has warmed up do you see real speed. Time too early and every result is wrong.

    Dead-code elimination: if the JIT can prove a result is never used, it deletes the entire computation. Your benchmark then times nothing and reports an impossibly fast number.

    JMH (Java Microbenchmark Harness, from the JDK team) handles both: it warms up first, runs the timed iterations after, forks fresh JVMs for isolation, and forces you to return the result or feed it to a Blackhole so the optimiser can't delete it.

    A correct JMH benchmark — StringBuilder vs '+' in a loop
    // build.gradle adds the two JMH dependencies:
    //   org.openjdk.jmh:jmh-core:1.37
    //   org.openjdk.jmh:jmh-generator-annprocess:1.37   (annotation processor)
    // Then run:  ./gradlew jmh        (or  java -jar target/benchmarks.jar)
    
    package com.myapp.bench;
    
    import org.openjdk.jmh.annotations.*;
    import org.openjdk.jmh.infra.Blackhole;
    import java.util.concurrent.TimeUnit;
    
    @BenchmarkMode(Mode.AverageTime)             // report average ns per call
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @Warmup(iterations = 5, time = 1)            // let the JIT compile hot code first
    @Measurement(iterations = 10, time = 1)      // THEN take the timed measurements
    @Fork(2)                                     // fresh JVMs so one run can't bias the next
    @State(Scope.Thread)                         // each thread gets its own copy of fields
    public class StringConcatBenchmark {
    
        @Param({"10", "100", "1000"})            // run every benchmark at each size
        public int n;
    
        // The number 42 is a CONSTANT, so without a Blackhole the JIT would
        // delete this method entirely ("dead-code elimination") and you'd be
        // timing nothing. RETURNING the result tells JMH to keep the work.
        @Benchmark
        public String plusInLoop() {
            String s = "";
            for (int i = 0; i < n; i++) {
                s += "x";                        // each '+' makes a NEW String — O(n^2)
            }
            return s;                            // returned -> JIT can't delete the loop
        }
    
        @Benchmark
        public String stringBuilder() {
            StringBuilder sb = new StringBuilder(n);   // pre-size to avoid resizes
            for (int i = 0; i < n; i++) {
                sb.append('x');                  // appends into one buffer — O(n)
            }
            return sb.toString();
        }
    
        // When a method has no single return value, hand it to a Blackhole so
        // the JIT still believes the result is "used".
        @Benchmark
        public void consumeWithBlackhole(Blackhole bh) {
            StringBuilder sb = new StringBuilder(n);
            for (int i = 0; i < n; i++) sb.append('x');
            bh.consume(sb);                      // pretend we used it
        }
    }
    Output
    Benchmark                          (n)   Mode  Cnt       Score      Error  Units
    StringConcatBenchmark.plusInLoop    10   avgt   20      88.413 ±    3.121  ns/op
    StringConcatBenchmark.plusInLoop   100   avgt   20    6234.211 ±  214.842  ns/op
    StringConcatBenchmark.plusInLoop  1000   avgt   20  583122.917 ± 9851.000  ns/op
    StringConcatBenchmark.stringBuilder 10   avgt   20      32.117 ±    0.812  ns/op
    StringConcatBenchmark.stringBuilder100  avgt   20     281.044 ±    7.211  ns/op
    StringConcatBenchmark.stringBuilder1000 avgt   20    2715.398 ±   78.114  ns/op
    
    # Smaller Score = faster. Error is the ± confidence interval.
    # At n=1000, StringBuilder is ~215x faster than '+' in a loop.
    JMH is the only correct way to micro-benchmark on the JVM — it handles JIT warmup, dead-code elimination, and per-fork isolation that a hand-rolled nanoTime() loop gets wrong. Scaffold a project with the official archetype or the jmh-gradle-plugin, then run java -jar target/benchmarks.jar. The Score numbers vary by machine; the ordering does not.

    🎯 Your Turn #1 — Finish the JMH Benchmark

    Fill in three blanks: set warmup to 3 iterations, measurement to 5, and return the sum so the JIT can't delete the loop as dead code. The expected output shape is in the comment.

    Fill in the blanks
    import org.openjdk.jmh.annotations.*;
    import java.util.concurrent.TimeUnit;
    
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @State(Scope.Thread)
    public class SumBenchmark {
    
        // 🎯 YOUR TURN — fill in the blanks marked with ___
    
        @Param({"1000", "10000"})
        public int n;
    
        // 1) Warm up the JIT for 3 iterations of 1 second each.
        @Warmup(iterations = ___, time = 1)          // 👉 replace ___ with 3
    
        // 2) Then take 5 timed measurement iterations.
        @Measurement(iterations = ___, time = 1)     // 👉 replace ___ with 5
        @Benchmark
        public long sumLoop() {
            long total = 0;                          // primitive accumulator
            for (int i = 0; i < n; i++) total += i;
            // 3) RETURN the result so the JIT can't delete the loop as dead code.
            ___;                                     // 👉 replace ___ with  return total
        }
    
        // ✅ Expected: JMH prints a table with rows for n=1000 and n=10000,
        //    in "avgt" mode with "ns/op" units, e.g.
        //    SumBenchmark.sumLoop  1000  avgt   5   312.4 ± 4.1  ns/op
    }
    JMH is the only correct way to micro-benchmark on the JVM — it handles JIT warmup, dead-code elimination, and per-fork isolation that a hand-rolled nanoTime() loop gets wrong. Scaffold a project with the official archetype or the jmh-gradle-plugin, then run java -jar target/benchmarks.jar. The Score numbers vary by machine; the ordering does not.

    3️⃣ CPU & Allocation Profiling: JFR and async-profiler

    A benchmark compares two known options. A profiler answers the bigger question: in a whole running app, where is the time and memory going? You attach it, let real traffic flow, and it samples what's happening.

    Java Flight Recorder (JFR) is built into the JDK and costs under 1% overhead, so you can leave it on in production. It records a .jfr file of CPU samples, allocations, GC events and more, which you read with the jfr CLI or open in JDK Mission Control (JMC).

    async-profiler is a separate, low-overhead sampling profiler that produces interactive flame graphs for both CPU (where time goes) and allocations (what creates garbage). Two recordings, two different questions — always check allocation too, because excessive object creation is one of the most common hidden slowdowns.

    Profiling commands you actually run (JFR, jcmd, async-profiler)
    # === Java Flight Recorder (JFR) — built into the JDK, <1% overhead ========
    # Start a 60-second recording when you launch the app:
    java -XX:StartFlightRecording=duration=60s,filename=app.jfr,settings=profile \
         -jar app.jar
    
    # Or attach to an already-running JVM by its process id:
    jps                                       # list Java pids
    jcmd 12345 JFR.start duration=60s filename=/tmp/app.jfr settings=profile
    jcmd 12345 JFR.dump  filename=/tmp/snap.jfr    # snapshot mid-flight
    jcmd 12345 JFR.stop                            # stop the recording
    
    # Read events straight from the CLI without a GUI:
    jfr summary app.jfr                       # GC count, CPU load, thread totals
    jfr print --events jdk.GarbageCollection app.jfr
    jfr print --events jdk.ObjectAllocationSample app.jfr | head -40
    jmc app.jfr                               # open in JDK Mission Control (flame graphs)
    
    # === async-profiler — sampling CPU + allocation flame graphs (~2%) =======
    # https://github.com/async-profiler/async-profiler
    ./asprof -d 30 -f cpu.html 12345              # 30s CPU flame graph -> HTML
    ./asprof -d 30 -e alloc -f alloc.html 12345   # WHERE allocations happen
    
    # === Always-on safety net for production =================================
    java -XX:+HeapDumpOnOutOfMemoryError \
         -XX:HeapDumpPath=/var/log/myapp/heap-%p.hprof \
         -Xlog:gc*:file=/var/log/myapp/gc.log:time,uptime:filecount=5,filesize=20m \
         -jar app.jar
    # You get a heap dump for free if it OOMs, and a GC log to diagnose pauses.
    Output
    $ jfr summary app.jfr
     Version: 2.1
     Events: 41,233
    
     Event Type                          Count   Size (bytes)
    =====================================================
     jdk.ObjectAllocationSample          18,402        612,331
     jdk.ExecutionSample                 12,810        421,004
     jdk.GarbageCollection                  127         18,902
     jdk.GCHeapSummary                      254         40,124
    
    # 127 GC events in 60s and allocation sampling dominating the recording
    # is your first clue: this app is allocating too much, not CPU-bound.
    jps, jcmd, jfr, jmc and JFR itself all ship with the JDK — no install. async-profiler is a separate download. Run these in your own terminal against a real JVM.

    4️⃣ Reading a Flame Graph

    A flame graph turns thousands of sampled call stacks into one picture. Learn to read it and you can find a hotspot in seconds.

    • Y axis = stack depth. Callers at the bottom, the methods they call stacked on top.
    • X axis is NOT time. Frame width = share of samples — how much CPU (or allocation) that method accounts for.
    • Wide, flat bars = your hotspots. Optimise the widest frame first.
    • Tall, thin towers are deep call chains that individually cost little — usually not worth touching.

    💡 The maths of effort: a method that is 2% of the width can never give you more than a 2% speed-up, no matter how cleverly you rewrite it. A 60%-wide method is where the real win is. Width tells you where to spend your time — that's the whole point.

    🎯 Your Turn #2 — Record & Read a Flame Graph

    Fill in the blanks to record a 20-second CPU flame graph, then answer the self-check question about which method to optimise first.

    Fill in the blanks
    # 🎯 YOUR TURN — record a 20s CPU flame graph of a running app,
    # then answer the question in the comment from what you see.
    
    # 1) Find the Java process id.
    ___                                          # 👉 replace ___ with  jps
    
    # 2) Record a 20-second CPU flame graph to HTML for pid 4567.
    ./asprof -d ___ -f cpu.html 4567             # 👉 replace ___ with  20
    
    # 3) Open cpu.html in a browser. Reading it:
    #    - The X axis is NOT time order — width = share of samples (CPU time).
    #    - Look for the WIDEST bar; that frame's method is your hotspot.
    #    - A tall-but-narrow stack is deep recursion, usually fine.
    #
    # ✅ Self-check: if "parseJson" is 60% of the width and everything else is
    #    a thin sliver, which method do you optimise first?  ->  parseJson
    #    (Optimising a 2%-wide method can never give more than a 2% speed-up.)
    jps, jcmd, jfr, jmc and JFR itself all ship with the JDK — no install. async-profiler is a separate download. Run these in your own terminal against a real JVM.

    5️⃣ Common Java Performance Traps

    Profilers keep surfacing the same handful of culprits in Java code. Knowing them lets you read a flame graph and immediately recognise the pattern.

    • Autoboxing in hot loops: mixing primitives with wrapper types (Integer, Long) silently allocates an object per operation — millions of them in a loop, which crushes the GC.
    • String concatenation with + in a loop: each + builds a brand-new String, turning O(n) work into O(n²). Use a StringBuilder.
    • Recompiling regexes / reallocating in a loop: Pattern.compile(...) or new-ing throwaway objects inside the hot path repeats expensive work every iteration.
    • Premature optimisation: contorting code for speed before profiling proves it matters — the opposite trap.

    The runnable example below shows the autoboxing trap with a deterministic fix: a boxed Long accumulator versus a primitive long. Same answer, very different speed.

    Trap: a boxed Long accumulator in a hot loop
    import java.util.ArrayList;
    import java.util.List;
    
    public class Main {
    
        // The flame graph showed 70% of CPU in Integer.valueOf / Long boxing.
        // The culprit: a Long key forces autoboxing on EVERY map/loop touch.
    
        // SLOW: sum into a boxed Long. Each "+=" unboxes, adds, re-boxes a
        // brand-new Long object — millions of throwaway allocations.
        static long sumBoxed(List<Integer> data) {
            Long total = 0L;                         // boxed accumulator (bad)
            for (Integer v : data) {
                total += v;                          // unbox v, unbox total, box result
            }
            return total;
        }
    
        // FAST: keep the accumulator a PRIMITIVE long. Zero allocations in the
        // loop — the JIT keeps it in a CPU register.
        static long sumPrimitive(List<Integer> data) {
            long total = 0L;                         // primitive accumulator (good)
            for (Integer v : data) {
                total += v;                          // one unbox; no boxing of the sum
            }
            return total;
        }
    
        public static void main(String[] args) {
            List<Integer> data = new ArrayList<>();
            for (int i = 0; i < 5_000_000; i++) data.add(i);
    
            // Warm the JIT so we measure steady-state, not interpreter, speed.
            for (int i = 0; i < 5; i++) { sumBoxed(data); sumPrimitive(data); }
    
            long t0 = System.nanoTime();
            long a = sumBoxed(data);
            long boxedMs = (System.nanoTime() - t0) / 1_000_000;
    
            long t1 = System.nanoTime();
            long b = sumPrimitive(data);
            long primMs = (System.nanoTime() - t1) / 1_000_000;
    
            System.out.println("Boxed Long sum:     " + a + "  in " + boxedMs + " ms");
            System.out.println("Primitive long sum: " + b + "  in " + primMs + " ms");
            System.out.println("Speed-up:           "
                    + String.format("%.1fx", (double) boxedMs / Math.max(primMs, 1)));
            System.out.println();
            System.out.println("(For publication numbers use JMH, not System.nanoTime.)");
        }
    }
    Output
    Boxed Long sum:     12499997500000  in 41 ms
    Primitive long sum: 12499997500000  in 7 ms
    Speed-up:           5.9x
    
    (For publication numbers use JMH, not System.nanoTime.)
    The exact millisecond numbers depend on your machine and JIT state, but the slow/fast ratio is reproducible. Paste it into onecompiler.com/java or a local JDK. For publication-quality timing, use JMH instead of System.nanoTime().

    6️⃣ GC Tuning Basics

    The garbage collector (GC) reclaims objects you no longer reference. Most of the time it just works — and the single biggest GC win is allocating less in the first place. Only reach for tuning when a recording shows real GC pain: long pauses, or GC eating a big slice of CPU.

    When you do tune, start by matching the collector to your goal, then size the heap, then change one flag at a time and re-measure with a GC log.

    CollectorFlagOptimised for
    G1 (default)-XX:+UseG1GCBalanced pause vs throughput
    ZGC-XX:+UseZGCVery low pauses, large heaps
    Shenandoah-XX:+UseShenandoahGCLow pauses, concurrent
    Parallel-XX:+UseParallelGCRaw throughput, batch jobs

    🧠 Rule of thumb:

    Turn on a GC log first — -Xlog:gc*:file=gc.log:time,uptime — and read it before changing anything. Copying a wall of -XX: flags from a random blog post usually makes things worse, not better.

    Mini-Challenge — Prove the String Trap

    Scaffolding removed. Read the comment outline, then write it yourself: build a big string two ways (+= in a loop vs a StringBuilder), time both, and print the slow/fast ratio. The expected shape is in the comment so you can self-check.

    Your challenge — only a comment outline is given
    import java.util.List;
    
    public class Main {
    
        // 🎯 MINI-CHALLENGE: prove string concatenation in a loop is the bottleneck
        // 1. Write buildSlow(int n): start "String s = \"\";" and do  s += i + ","
        //    inside a for-loop. Return s.
        // 2. Write buildFast(int n): use a StringBuilder, append i + "," each turn,
        //    and return sb.toString().
        // 3. In main: warm up both with a few calls, then time each on n = 50_000
        //    using System.nanoTime(), and print the millisecond cost of each plus
        //    the slow/fast ratio.
        //
        // ✅ Expected (numbers vary by machine, the RATIO is the point):
        //    Slow (+=):        ~900 ms
        //    Fast (builder):   ~2 ms
        //    StringBuilder is hundreds of times faster — same output string.
        //
        // 💡 Once it works, the real lesson: the profiler/flame graph would have
        //    pointed straight at String.<init> / Arrays.copyOf before you guessed.
    
        static String buildSlow(int n) {
            // your code here
            return "";
        }
    
        static String buildFast(int n) {
            // your code here
            return "";
        }
    
        public static void main(String[] args) {
            // your code here
        }
    }
    The exact millisecond numbers depend on your machine and JIT state, but the slow/fast ratio is reproducible. Paste it into onecompiler.com/java or a local JDK. For publication-quality timing, use JMH instead of System.nanoTime().

    Common Errors (and the fix)

    • Micro-benchmarking without JMH: a System.nanoTime() loop measures the interpreter before warmup, then gets its loop deleted by dead-code elimination after. Fix: use JMH — it warms up, forks JVMs, and forces you to return the result so the optimiser can't cheat.
    • Optimising the wrong thing: "I'm sure this method is slow" is a guess, not evidence; the real hotspot is almost always elsewhere. Fix: profile first, find the widest frame in the flame graph, and only optimise that.
    • Ignoring GC / allocation: a CPU profile looks fine but the app stutters because it allocates millions of short-lived objects and the GC runs constantly. Fix: take an allocation profile (async-profiler -e alloc or JFR allocation events) and cut the biggest allocators.
    • Autoboxing in a hot loop: Long total = 0L; total += v; re-boxes a new Long every iteration. Fix: keep accumulators primitive (long total = 0L;) and prefer IntStream/LongStream over Stream<Integer>.
    • Heap dump fills the disk: jcmd PID GC.heap_dump writes a file as big as your heap (could be many GB) and can crash the box if disk is short. Fix: check free space first, and prefer -XX:+HeapDumpOnOutOfMemoryError with a dedicated -XX:HeapDumpPath.
    • Premature optimisation: rewriting clean code into something clever and buggy before profiling proves it's hot. Fix: make it work, make it right, then — guided by a profiler — make the measured hotspot fast.

    📋 Quick Reference

    GoalCommand / APINotes
    Start a recording-XX:StartFlightRecording=...Writes a .jfr file
    Attach to a live JVMjcmd PID JFR.start ...Find PID with jps
    Read a recordingjfr summary app.jfrOr open in JMC
    CPU flame graphasprof -d 30 -f cpu.html PIDasync-profiler
    Allocation flame graphasprof -e alloc -f a.html PIDFinds garbage sources
    Benchmark two options@Benchmark (JMH)Return the result!
    Heap dumpjcmd PID GC.heap_dump f.hprofAs big as the heap
    Thread dumpjcmd PID Thread.printThread states
    GC log-Xlog:gc*:file=gc.logRead before tuning
    Dump on OOM-XX:+HeapDumpOnOutOfMemoryErrorSet HeapDumpPath

    Frequently Asked Questions

    Why can't I just time my code with System.nanoTime() in a loop?

    Because the JVM is not a static machine. Before the JIT compiler warms up, your code runs in the slow interpreter, so the first runs are misleadingly slow. After warmup, the JIT can prove a result is never used and delete the whole loop ('dead-code elimination'), making it misleadingly fast (or instant). On-stack replacement, GC pauses, and CPU frequency scaling add more noise. JMH solves all of this: it warms up first, forks fresh JVMs, uses Blackhole/return values to defeat dead-code elimination, and reports a confidence interval. Hand-rolled nanoTime timing of micro-operations is almost always wrong.

    What is the difference between CPU profiling and allocation profiling?

    CPU profiling answers 'where is my time going?' — it samples the call stack periodically and shows which methods are on-CPU most often (the wide bars in a flame graph). Allocation profiling answers 'what is creating garbage?' — it samples object allocations and shows which call sites allocate the most bytes. They diagnose different problems: a CPU-bound app needs a faster algorithm; an allocation-heavy app needs fewer objects (it shows up as frequent GC and high GC CPU). async-profiler and JFR can do both; always check allocation too, because excessive allocation is one of the most common hidden Java performance problems.

    How do I read a flame graph?

    A flame graph is a picture of sampled call stacks. The Y axis is stack depth (callers at the bottom, callees stacked on top). The X axis is NOT time order — the width of a frame is the share of samples that landed in it, i.e. how much CPU (or allocation) it accounts for. So you scan for the WIDEST frames and optimise those first; a 60%-wide method is where the win is. Tall, thin towers are deep call chains that individually cost little. Optimising a 2%-wide frame can never make the program more than 2% faster, so width tells you where to spend effort.

    Should I tune the garbage collector to make my app faster?

    Usually not first. The biggest GC win is allocating less — if your allocation profile is flat, GC mostly takes care of itself. Reach for GC tuning when a recording shows real problems: long pauses, or GC eating a large share of CPU. Start by picking the right collector for the goal (G1 is the balanced default; ZGC or Shenandoah for very low pause times; Parallel for raw throughput batch jobs) and sizing the heap with -Xmx. Tune one flag at a time and measure with a GC log (-Xlog:gc*). Random flag-copying from a blog usually makes things worse.

    What does 'premature optimization is the root of all evil' actually mean?

    Donald Knuth's full quote is 'we should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.' It does NOT mean ignore performance — it means don't contort code for speed before you have profiled and know it matters. Most code is not on a hot path, and complicating it for an imaginary speed-up costs readability and introduces bugs for no real gain. Make it work, make it right, then — guided by a profiler — make the measured hotspot fast.

    Why is autoboxing in a hot loop so expensive?

    Autoboxing silently wraps a primitive (int, long, double) into its object form (Integer, Long, Double) whenever you mix primitives with generics or wrapper-typed variables. Each box is a heap allocation, and in a loop that runs millions of times that is millions of throwaway objects — which hammers the allocator and the garbage collector. A boxed accumulator like 'Long total = 0L; total += v;' re-boxes on every iteration. Keep accumulators and hot-loop variables as primitives, prefer IntStream/LongStream over Stream<Integer>, and use primitive-specialised collections when the loop is genuinely hot.

    🎉 Lesson Complete!

    Excellent work! You can now measure instead of guess: write correct JMH benchmarks, record CPU and allocation profiles with JFR and async-profiler, read a flame graph to find the real hotspot, tune the GC only when the evidence calls for it, and recognise the classic Java traps — autoboxing, string concatenation in loops, and premature optimisation.

    Next up: Microservices — building distributed systems with Spring Boot, where these profiling skills scale up to whole fleets of services.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service