Lesson 41 • Advanced
API Monitoring & Observability 📊
By the end of this lesson you'll instrument a PHP API with structured logs, a health-check endpoint, latency percentiles, and correlation IDs — so when something breaks at 3 AM you can see exactly what, where, and why.
What You'll Learn in This Lesson
- Write structured (JSON) logs with consistent fields using the Monolog idea
- Explain the three pillars of observability: logs, metrics, and traces
- Build a health-check endpoint that returns 200 or 503
- Track request and error rates and report tools like Sentry
- Measure latency with percentiles (p50, p95, p99), not just averages
- Thread a correlation ID through logs and alert without causing alert fatigue
php file.php. The Output panel under each example shows exactly what to expect.1️⃣ The Three Pillars: Logs, Metrics, Traces
Observability is your ability to understand what's happening inside a running system just from the data it emits. It rests on three pillars. Logs are timestamped records of individual events ("a request was handled", "an error was thrown"). Metrics are numbers aggregated over time (requests per second, error rate, p95 latency) that you chart and alert on. Traces follow one request as it hops between services so you can see exactly where the time went. Logs tell you what happened, metrics tell you how often and how bad, and traces tell you where. This lesson builds all three from the ground up.
2️⃣ Structured Logging with Monolog
Structured logging means every log entry is a machine-readable object with consistent fields — level, message, timestamp, plus context — rather than a free-text sentence. The payoff is querying: "show me every level=error for user=u_004 in the last hour" becomes a one-line filter. In real PHP you'd reach for Monolog, the standard logging library (it implements PSR-3, the shared logging interface), and call $log->error('msg', [...]). The example below hand-rolls the same idea so you can see precisely what one structured log line is.
<?php
// Structured logging: every log line is a JSON object with consistent fields,
// not a free-text sentence. Machines can then filter "all level=error" instantly.
// Monolog is the standard PHP library for this; here we hand-roll the same idea
// so you can see exactly what a structured log line is.
function logEvent(string $level, string $message, array $context = []): void {
$entry = [
'time' => '2026-06-16T09:30:00Z', // ISO-8601 timestamp (fixed here so output is stable)
'level' => $level, // info | warning | error — your filter key
'message' => $message, // a short, stable description
] + $context; // merge in extra fields (user, path, ms...)
// JSON_UNESCAPED_SLASHES keeps "/api/users" readable instead of "\/api\/users".
echo json_encode($entry, JSON_UNESCAPED_SLASHES) . "\n";
}
// A normal request: one structured line carrying its own context.
logEvent('info', 'request.handled', [
'method' => 'GET',
'path' => '/api/users',
'status' => 200,
'ms' => 42, // how long the request took, in milliseconds
'user' => 'u_001',
]);
// A failure: same shape, level='error'. You can now query for level=error alone.
logEvent('error', 'request.failed', [
'method' => 'GET',
'path' => '/api/users',
'status' => 500,
'ms' => 2300,
'user' => 'u_004',
]);
?>{"time":"2026-06-16T09:30:00Z","level":"info","message":"request.handled","method":"GET","path":"/api/users","status":200,"ms":42,"user":"u_001"}
{"time":"2026-06-16T09:30:00Z","level":"error","message":"request.failed","method":"GET","path":"/api/users","status":500,"ms":2300,"user":"u_004"}Each line is valid JSON with the same keys in the same shape. That consistency is the whole point: a log shipper (Datadog, Elasticsearch, Loki) can index every field, so you filter and chart logs instead of grepping plain text. Notice the level field — that single key is what separates routine info noise from the error you actually need to see.
3️⃣ Health-Check Endpoints
A health-check endpoint is a single URL — conventionally /health — that an uptime monitor or load balancer pings every few seconds. It checks your real dependencies (database, cache, queue) and returns HTTP 200 when everything is up or 503 (Service Unavailable) when something is down. The status code is what machines read: a load balancer that sees 503 stops sending traffic to that server, and an uptime monitor that sees 503 fires an alert. Keep the check fast and cheap — it runs constantly.
<?php
// A health-check endpoint is a single URL (usually /health) that an uptime
// monitor pings every few seconds. It returns 200 when healthy and 503 when not,
// so the monitor — and your load balancer — can route traffic away from a sick box.
function checkDatabase(): bool {
// In real code: run a tiny query like "SELECT 1". Here we fake a healthy DB.
return true;
}
function checkCache(): bool {
// In real code: PING your Redis/Memcached. Here we fake the cache being DOWN.
return false;
}
// Run each dependency check and collect a pass/fail map.
$checks = [
'database' => checkDatabase(),
'cache' => checkCache(),
];
// The whole service is "healthy" only if EVERY dependency passed.
$healthy = !in_array(false, $checks, true);
// 200 = OK, 503 = Service Unavailable. The status CODE is what monitors read.
$httpStatus = $healthy ? 200 : 503;
$body = [
'status' => $healthy ? 'healthy' : 'unhealthy',
'checks' => array_map(fn($ok) => $ok ? 'up' : 'down', $checks),
];
// In a real endpoint: http_response_code($httpStatus); header('Content-Type: application/json');
echo "HTTP {$httpStatus}\n";
echo json_encode($body, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n";
?>HTTP 503
{
"status": "unhealthy",
"checks": {
"database": "up",
"cache": "down"
}
}Here the cache is down, so the overall status flips to unhealthy and the endpoint returns 503. In a real handler you'd call http_response_code($httpStatus) and send the JSON body. The per-dependency map (database: up, cache: down) tells the on-call engineer which thing broke without reading any logs.
4️⃣ Error Tracking & Uptime Monitoring
Logs are great for searching, but you don't want to read them to find out something broke. Error tracking tools like Sentry capture every uncaught exception, group identical errors together, count how often each fires, and message you with the stack trace and the request that caused it. Uptime monitoring (UptimeRobot, Pingdom, Better Stack) hits your health-check endpoint from outside your network on a schedule and alerts you the moment it stops returning 200 — catching the case where your whole server is down and can't even log. The pattern: logs to search, error tracking to be told, uptime to confirm you're reachable at all.
5️⃣ Latency Metrics & Percentiles
Latency is how long a request takes. Tracking only the average hides your worst experiences: if 99 requests take 40ms and one takes 4000ms, the average is a comfortable ~80ms — yet a real user waited four seconds. That's why teams track percentiles. p50 (the median) is a typical request; p95 means 95% of requests were at least this fast; p99 is your slow tail. Alerts go on p95/p99 because that's where users actually feel pain. Always pair percentiles with throughput (requests per second) and error rate.
Now you try. The script below emits a structured error log — fill in each ___ using the 👉 hint, then run it and compare against the Output panel.
<?php
// 🎯 YOUR TURN — emit one structured ERROR log line, then run it.
// Same shape as the worked example: a level, a message, and a context array.
function logEvent(string $level, string $message, array $context = []): void {
$entry = ['level' => $level, 'message' => $message] + $context;
echo json_encode($entry, JSON_UNESCAPED_SLASHES) . "\n";
}
// 1) The first argument is the severity level.
logEvent(___, "payment.failed", [ // 👉 replace ___ with "error" (in quotes)
"user" => "u_009",
// 2) Add a numeric "status" field set to 500 so you can filter on it later.
___ => 500, // 👉 replace ___ with "status" (in quotes)
]);
// ✅ Expected output:
// {"level":"error","message":"payment.failed","user":"u_009","status":500}
?>{"level":"error","message":"payment.failed","user":"u_009","status":500}___ blanks: the level "error" and the field name "status" (both in double quotes). Your output should be one JSON line.One more. This one decides the HTTP status a health check should return. Fill in the operator and the failure code the hints point to.
<?php
// 🎯 YOUR TURN — decide the HTTP status for a health check.
// A service is healthy ONLY when every dependency check passed.
$checks = ['database' => true, 'queue' => false]; // the queue is down
// 1) "healthy" is true only if there is NO false value in $checks.
$healthy = ___ in_array(false, $checks, true); // 👉 replace ___ with the NOT operator: !
// 2) Healthy returns 200, unhealthy returns 503 (Service Unavailable).
$httpStatus = $healthy ? 200 : ___; // 👉 replace ___ with 503
echo "HTTP {$httpStatus} (" . ($healthy ? "healthy" : "unhealthy") . ")\n";
// ✅ Expected output:
// HTTP 503 (unhealthy)
?>HTTP 503 (unhealthy)___ with the NOT operator ! and the second with 503. The queue is down, so the service is unhealthy.6️⃣ Correlation IDs & Alerting
A correlation ID (also called a request ID or trace ID) is a unique string you generate the instant a request arrives — for example with bin2hex(random_bytes(8)). You attach it to every log line that request produces and forward it in an X-Request-Id header to any service it calls. When something breaks, you search for that one ID and instantly see the full story of that single request across every service — instead of untangling thousands of interleaved lines. This is the seed of distributed tracing.
Alerting turns metrics into action. The goal is to be paged for things users feel — error rate climbing, p99 latency spiking, the health check failing — and nothing else. The enemy is alert fatigue: so many noisy alerts that the team mutes them all, including the real one. Guard against it by alerting on symptoms not internal blips, requiring a duration ("error rate over 5% for 5 minutes") so a one-second spike doesn't page anyone, and routing low-priority signals to a dashboard instead of a phone.
Common Errors (and the fix)
- Logs are free-text sentences you can't query — writing
error_log("user 5 failed at checkout")gives you nothing to filter on. Emit structured logs (a JSON object withlevel,message, and a context array) so a log tool can index every field. - You logged a password, token, or card number — secrets in logs are a breach waiting to happen, and logs are copied everywhere. Never log credentials. Redact or omit sensitive fields before logging — e.g. store
"card" => "****1234", never the full number. - Your service goes down and nobody notices — with no health-check endpoint, your load balancer and uptime monitor have nothing to test. Add a
/healthroute that returns200when healthy and503when a dependency is down. - The team ignores alerts (alert fatigue) — if everything pages you, nothing does. Alert only on user-facing symptoms (error rate, p99 latency, health-check failure), add a duration threshold so brief spikes don't fire, and send low-priority noise to a dashboard instead of a phone.
Pro Tips
- 💡 Use Monolog, not
error_log. Its PSR-3 levels and handlers let one$log->error()call fan out to a file, Slack, and Sentry at once. - 💡 Alert on percentiles, chart the average. p95/p99 catch the slow tail your users feel; the mean is fine for a quick eyeball but hides outliers.
- 💡 Attach the correlation ID in a Monolog processor. Set it once per request and every log line carries it automatically — no manual passing.
📋 Quick Reference — Monitoring & Observability
| Term | What It Is | When You Use It |
|---|---|---|
| Logs | Timestamped event records | Search "what happened" |
| Metrics | Numbers over time (rate, latency) | Chart & alert |
| Traces | One request across services | Find where time went |
| /health | Endpoint returning 200 / 503 | Uptime & load balancing |
| p95 / p99 | Latency percentiles (slow tail) | Latency alerts |
| Correlation ID | Unique per-request trace key | Tie logs together |
| Sentry | Error tracking + grouping | Be told about exceptions |
Frequently Asked Questions
Q: What are the three pillars of observability?
Logs, metrics, and traces. Logs are timestamped records of discrete events (a request was handled, an error was thrown). Metrics are numbers aggregated over time (requests per second, error rate, p95 latency) that you chart and alert on. Traces follow a single request as it hops between services, showing where the time went. Logs tell you what happened, metrics tell you how often and how bad, and traces tell you where in the system. Together they let you answer almost any 'why is it slow/broken?' question.
Q: Why should I use Monolog instead of just calling error_log()?
error_log() writes a plain string to one place. Monolog is the de-facto standard PHP logging library and implements PSR-3, the shared logging interface, so any framework can use it. It gives you severity levels (debug through emergency), structured context arrays, and 'handlers' that fan a log out to files, syslog, Slack, Sentry, or Elasticsearch at once. It also supports 'processors' that automatically attach things like a correlation ID to every line. You write log->error('payment failed', ['user' => $id]) once and route it anywhere.
Q: Why is average latency misleading, and what should I track instead?
An average hides your worst experiences. If 99 requests take 40ms and one takes 4000ms, the average is only ~80ms — yet one real user waited four seconds. Track percentiles instead: p50 (the median, a typical request), p95 (95% of requests were at least this fast), and p99 (your slow tail). Teams set alerts on p95/p99 because that is where real users feel the pain. Always pair percentiles with throughput (requests/sec) and error rate.
Q: What is a correlation ID and why does every request need one?
A correlation ID (also called a request ID or trace ID) is a unique string generated when a request first arrives. You attach it to every log line that request produces and pass it in a header (commonly X-Request-Id) to any downstream service it calls. When something breaks, you grep for that one ID and instantly see the full story of that single request across every service and log file, instead of guessing which of thousands of interleaved lines belong together.
Q: How do I avoid alert fatigue?
Alert fatigue is when so many alerts fire — most of them noise — that the team starts ignoring all of them, including the real one. Fix it by alerting on symptoms users feel (error rate, p99 latency, the health check failing) rather than on every internal blip; by using thresholds with a duration ('error rate over 5% for 5 minutes') so a one-second spike does not page anyone; by routing low-priority issues to a dashboard or chat channel instead of paging a human at 3 AM; and by deleting or tuning any alert that has fired without ever being actionable.
Mini-Challenge: Latency Percentiles
No code is filled in this time — just a brief and an outline. Compute the p95 latency yourself, run it on onecompiler.com/php or your own machine, then check your result against the expected output in the comments. This is the same write-run-check loop you'll use to build any real metric.
<?php
// 🎯 MINI-CHALLENGE: latency percentiles
// Percentiles describe the SHAPE of your latency, not just the average.
// p95 = "95% of requests were at least this fast"; it catches slow outliers
// that a mean would hide. No code is filled in — work from the steps below.
//
// 1. Start with: $latencies = [40, 42, 45, 50, 38, 41, 600, 47, 44, 43];
// 2. sort() the array so the values run smallest to largest.
// 3. p95 index = (int) ceil(0.95 * count($latencies)) - 1
// (the position 95% of the way along the sorted list).
// 4. echo "p95: " . $latencies[$index] . "ms\n";
//
// Tip: that one 600ms outlier is exactly what p95 is built to expose.
//
// ✅ Expected output:
// p95: 600ms
// your code here
?>🎉 Lesson Complete!
- ✅ Observability stands on three pillars: logs (what), metrics (how often/bad), and traces (where)
- ✅ Structured logs (JSON with
level+ context) are queryable; Monolog is the standard PHP library - ✅ A health-check endpoint returns
200when healthy and503when a dependency is down - ✅ Error tracking (Sentry) tells you about exceptions; uptime monitoring confirms you're reachable
- ✅ Track percentiles (p50/p95/p99), not just averages — they reveal the slow tail users feel
- ✅ A correlation ID ties one request's logs together; tame alert fatigue by alerting only on user-facing symptoms
- ✅ Next lesson: RBAC & ACL — control who is allowed to do what with role-based access control
Sign up for free to track which lessons you've completed and get learning reminders.