Lesson 33 • Advanced
Logging, Debugging & Error Handling at Scale
When your codebase is tiny, print() and random try/except blocks "work". Once you have APIs, background workers, cron jobs, queues, and multiple servers, that approach falls apart. Learn professional-grade debugging + logging + error-handling systems used by real engineering teams.
What You'll Master
This lesson takes you from basic debugging → production-grade logging systems used by companies handling thousands of users, multiple environments, and distributed services.
✔ Why logging beats print() in real systems
✔ Structured logging and log levels
✔ Multiple handlers and log routing
✔ Professional error handling strategies
✔ Centralized log aggregation
✔ Correlation IDs and distributed tracing
✔ Async logging techniques
✔ Monitoring, alerting & observability
✔ Production debugging workflows
Part 1: Logging Fundamentals & Error Handling
🔥 1. Why Logging Beats print() in Real Systems
When you have thousands of users, multiple environments (dev/staging/prod), background jobs, schedulers, and queues — print() falls apart.
| Feature | print() | logging |
|---|---|---|
| Output destination | stdout only | Console, files, HTTP, email |
| Severity levels | None | DEBUG → CRITICAL |
| Filtering | Not possible | Per module, per environment |
| Context | Manual string formatting | Attach user ID, request ID, etc. |
| Production use | Pollutes output | Professional & configurable |
Rule: print() is only for quick experiments. Real services use logging everywhere.
⚙️ 2. Basic Logging Setup (Per-Module Loggers)
You should never use the root logger directly in big projects. Instead, use one logger per module:
Per-Module Logger
Create a dedicated logger for each module
import logging
logger = logging.getLogger(__name__)
def process_order(order_id: str) -> None:
logger.info("Processing order", extra={"order_id": order_id})
# ...Then in your entry point (e.g. main.py or app.py):
Entry Point Logging Config
Configure logging in your main entry point
import logging
LOG_FORMAT = "[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
logging.basicConfig(
level=logging.INFO,
format=LOG_FORMAT,
)This gives you timestamp, log level, logger name (payments.service, users.api), and message. You can raise log level in dev: DEBUG, staging: INFO, prod: WARNING.
🧱 3. Logging Levels: A Contract for Your Team
- • DEBUG = Routine vitals check (internal details)
- • INFO = Patient admitted/discharged (normal events)
- • WARNING = Minor symptoms, monitor closely (recoverable issues)
- • ERROR = Needs immediate attention but stable (operation failures)
- • CRITICAL = Emergency! Life-threatening! (system down)
Use logging levels consistently across your team:
| Level | When to Use | Example |
|---|---|---|
| DEBUG | Internal details, variables, timings | "User payload: {...}" |
| INFO | High-level "this happened" | "User registered", "Order created" |
| WARNING | Unusual but recoverable | "Slow query: 832ms", "Retry attempt 2" |
| ERROR | Operation failed, app keeps running | "Payment failed", "Email send failed" |
| CRITICAL | System is not usable | "DB unreachable", "Config missing" |
Logging Levels Example
Use consistent log levels across your codebase
logger.debug("User payload: %r", payload)
logger.info("User registered", extra={"user_id": user.id})
logger.warning("Slow query", extra={"duration_ms": 832})
logger.error("Payment failed", extra={"user_id": user.id})
logger.critical("Database unreachable")When everyone uses the same level rules, you can alert only on ERROR/CRITICAL, filter out noisy DEBUG in production, and quickly see timeline of events with INFO logs.
🧩 4. Structured Logging (So Logs Are Actually Searchable)
For small scripts, string messages are fine. At scale, you need structured logs so tools like Datadog/Loki/ELK/CloudWatch can filter and aggregate.
Instead of:
Unstructured Log (Bad)
Avoid embedding data in log messages
logger.info(f"User {user.id} logged in from {ip}")Use:
Structured Logging (Good)
Use structured logs for searchability
logger.info(
"User logged in",
extra={
"user_id": user.id,
"ip": request.client.host,
"source": "web",
}
)
# Or use a dict-style message:
logger.info({
"event": "user_login",
"user_id": user.id,
"ip": request.client.host,
"source": "web",
})Now your logging backend can filter by user_id, count logins from each IP, and group by event.
🧲 5. Handlers: Sending Logs to Multiple Destinations
A handler decides where logs go. Common ones:
| Handler | Destination | Best For |
|---|---|---|
| StreamHandler | Console | Development, debugging |
| FileHandler | Regular log file | Simple file logging |
| RotatingFileHandler | Rotates at size limit | Production (prevents huge files) |
| TimedRotatingFileHandler | Rotates daily/hourly | Daily log archives |
| SMTPHandler | Critical error alerts | |
| HTTPHandler | Log service | Centralized monitoring |
Example: console + file with rotation
Multiple Handlers
Send logs to console and file with rotation
import logging
from logging.handlers import RotatingFileHandler
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console = logging.StreamHandler()
console.setLevel(logging.INFO)
file_handler = RotatingFileHandler(
"app.log",
maxBytes=5_000_000, # 5 MB
backupCount=5, # keep 5 old files
)
formatter = logging.Formatter(
"[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
)
console.setFormatter(formatter)
file_handler.setFormatter(formatter)
logger.addHandler
...Now dev uses console, ops can inspect app.log, and files don't grow forever. In a SaaS/microservice setup, you'd usually log to stdout and let Docker/Kubernetes send logs to a central system.
🧠 6. Error Handling Strategy (Don't Just "try/except Everything")
except Exception is like adoctor saying "take this pill" for every symptom — headache, broken leg, fever, anything! Good error handling is like diagnosing the specific problem and prescribing the right treatment.Bad pattern:
Bad Error Handling
Avoid catching everything blindly
try:
do_thing()
except Exception:
logger.error("Something went wrong")Problems with this approach:
- • Swallows context — you don't know what went wrong
- • Hides the original traceback — debugging becomes impossible
- • The caller doesn't know it failed — silent failures are the worst!
Good pattern: Catch specific exceptions, log with traceback, decide whether to handle or re-raise
Good Error Handling
Use domain exceptions with proper logging
import logging
logger = logging.getLogger(__name__)
class PaymentError(Exception):
pass
def charge_card(user_id: str, amount_pennies: int) -> None:
try:
external_gateway.charge(user_id, amount_pennies)
except GatewayTimeoutError as e:
logger.error(
"Payment timeout",
exc_info=True,
extra={"user_id": user_id, "amount": amount_pennies}
)
raise PaymentError("Temporary payment failure") from eKey ideas: Use domain-level exceptions (PaymentError), wrap low-level exceptions, pass exc_info=True to capture traceback, use raise ... from e to keep the error chain.
🧨 7. Logging Tracebacks Correctly
Two main options:
1. logger.exception() inside except
logger.exception()
Logs message with full traceback
try:
risky_operation()
except Exception:
logger.exception("Unexpected exception during risky_operation")This logs message, full traceback, and error type.
2. exc_info=True
exc_info=True
Alternative way to include traceback
logger.error("Something failed", exc_info=True)Both approaches are valid. logger.exception() is just a shortcut for logger.error(..., exc_info=True) inside an except block.
🪤 8. Global Exception Hooks (Last-Resort Safety Net)
For CLI/worker processes, you can register a global handler:
Global Exception Hook
Catch all uncaught exceptions
import logging
import sys
logger = logging.getLogger(__name__)
def handle_uncaught(exc_type, exc, tb):
logger.critical("Uncaught exception", exc_info=(exc_type, exc, tb))
sys.excepthook = handle_uncaughtNow any uncaught error in the main thread is logged at CRITICAL. You still let the process crash (which is often correct in prod). In frameworks: Django has middleware for logging unhandled errors, FastAPI lets you add global exception handlers, Celery workers log errors from tasks.
🧪 9. Debugging Strategy: Logs + Debugger + Assertions
At scale, you don't "guess":
✅ Use logging for:
- long-term tracing in production
- measuring how often things break
- understanding user behaviour
✅ Use debugger for:
- stepping through tricky paths in dev
- inspecting state live
Example with pdb / breakpoint():
Using breakpoint()
Debug interactively with pdb
def compute_total(order):
breakpoint() # or import pdb; pdb.set_trace()
# inspect order, step through logic
total = sum(item.price for item in order.items)
return total✅ Use assertions in dev: assert total >= 0, "Total should never be negative"
In production, you can turn off certain assertions or keep only the critical ones.
🎯 10. Principles for "At Scale" Error Handling
1. Don't hide errors silently
Always log unexpected exceptions. Prefer fail loud and fast to silent corruption.
2. Centralize error handling at boundaries
HTTP layer (FastAPI/Django view), Message queue consumer, CLI/worker entry point
3. Use custom exception hierarchy
DomainError → ValidationError, PaymentError, ExternalServiceError. This makes handling & logging more deliberate.
4. Keep logs human-readable AND machine-parseable
Combined message + structured context
5. Tune log levels
Too much DEBUG in prod = noisy. Too few logs in prod = blind.
Part 2: Production Engineering — Centralized Logging, Correlation IDs & Distributed Tracing
Part 1 covered local logging and error handling. Now we move into real production engineering — the systems used by FastAPI/Django backends, microservices, SaaS platforms, and distributed workers.
🔥 1. Centralised Log Aggregation (The Real World Standard)
As soon as you have multiple servers, background workers, containers, or functions-as-a-service — you cannot inspect logs locally anymore. Real systems send logs to a central place.
✔ ELK Stack (Elasticsearch + Logstash + Kibana) — Most customizable, open-source, works at huge scale
✔ Loki + Grafana — Insanely fast, cheap, streams logs from Docker/Kubernetes
✔ AWS CloudWatch / GCP Logging / Azure Monitor — Great if you already host on those platforms
✔ Datadog / Sentry / NewRelic — Expensive, but world-class dashboards + alerts
⚙️ 2. Logging to STDOUT in Containers (Best Practice)
In Docker/Kubernetes, you never write log files inside the container. You output logs to STDOUT:
Container Logging
Log to stdout for Docker/Kubernetes
logging.basicConfig(
level=logging.INFO,
format="[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s",
handlers=[logging.StreamHandler()]
)Kubernetes will automatically: ✔ capture your stdout, ✔ send it to your log system, ✔ attach metadata (pod, namespace, service). This makes logs fully searchable across the entire cluster.
🧠 3. The Importance of Request IDs / Correlation IDs
Imagine debugging: User logs in → Their request hits API A → API A calls API B → API B queries the database → A background worker processes a message → Something fails.
Without correlation IDs, logs look like a mess. Solution: Generate a unique ID per user request.
Correlation IDs
Generate unique IDs for request tracing
import uuid
request_id = str(uuid.uuid4())
logger.info("Received request", extra={"request_id": request_id})Pass the ID through HTTP headers, background jobs, microservice calls, and log contexts. Now you can filter your log system for request_id = "1f0cd133-f9ab-4bd9-a6cd-92d3a002a415" and see EVERYTHING that happened.
This alone can save hours per week in debugging.
🧩 4. Logging Context Automatically (ContextVars)
Python provides a way to attach values (like request IDs) to all logs inside async or threaded code.
ContextVars for Logging
Automatically attach context to all logs
from contextvars import ContextVar
request_id_var = ContextVar("request_id", default="-")
def set_request_id(rid: str):
request_id_var.set(rid)
# Using a custom log filter:
class RequestIDFilter(logging.Filter):
def filter(self, record):
record.request_id = request_id_var.get()
return True
# Attach filter to logger:
logger.addFilter(RequestIDFilter())Now every log line automatically includes request_id, user_id (if added), job_id, trace_id. This is essential in async web apps like FastAPI.
🚀 5. Distributed Tracing (How Big Systems Debug)
Used by Uber, Netflix, Stripe, Google-scale systems. Distributed tracing tracks: "This request flowed through: API → Worker → DB → Cache → Queue → Another Worker"
Tools: OpenTelemetry (industry standard), Jaeger, Zipkin, Datadog APM
OpenTelemetry Tracing
Add distributed tracing to your app
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order_id", oid)
span.set_attribute("user_id", user_id)Now logs + traces appear linked. This gives timings for each step, slow points, bottlenecks, failed spans, and dependency maps. This is how modern SaaS teams debug performance problems.
🧵 6. Logging in Async Python (Trickier Than It Looks)
Async apps (FastAPI, aiohttp) handle hundreds/thousands of concurrent tasks. Debugging them needs extra care.
Problem 1: Interleaved logs — Two tasks printing logs at the same time → scrambled output
Solution: structured logging (JSON logs)
Instead of plain text logs:
{"timestamp": "...", "level": "INFO", "task_id": 128, "msg": "User logged in"}Tools: python-json-logger, structlog, loguru. Structured logs remove all ambiguity.
🧪 7. Debugging Async Issues
Common async issues: tasks never awaited, task cancelled too early, infinite loops in async code, race conditions modifying shared state, slow network calls blocking event loop.
Use asyncio.all_tasks() to debug:
Debug Async Tasks
Inspect all running asyncio tasks
import asyncio
for t in asyncio.all_tasks():
print(t)You can inspect pending tasks, long-running tasks, and tasks stuck on await.
Async Timeouts & Shield
Handle timeouts and prevent cancellation
# Use asyncio timeout wrappers
await asyncio.wait_for(sync(), timeout=3)
# Use asyncio.shield to prevent unwanted cancellation
result = await asyncio.shield(long_task())This is how real async backends prevent accidental cancellations.
⚡ 8. Advanced Exception Routing (Tiered Error Strategy)
In large systems, errors must be: 1. Logged, 2. Categorized, 3. Reported to monitoring tools, 4. Potentially retried, 5. Potentially suppressed, 6. Potentially escalated
✔ Business error (expected)
Logged at WARNING, Message shown to user, No alerts
✔ External service error
Logged at ERROR, Trigger retry/backoff, Alert if repeated failures
✔ System error
Logged at CRITICAL, Immediate alert, Fallback mode or shutdown
This prevents alert fatigue and improves reliability.
🔔 9. Error Tracking Tools (Sentry, Rollbar, Bugsnag)
Instead of forcing developers to read logs manually, use automatic error monitoring.
Sentry Integration
Capture errors with automatic tracking
try:
risky()
except Exception as e:
logger.exception("Error during risky operation")
sentry_sdk.capture_exception(e)
raiseSentry provides: stack trace, breadcrumbs (log history), user ID + metadata, environment (dev/staging/prod), frequency of error, release version. You can review top crashing endpoints, sudden spikes, and regressions after deploys.
This is mandatory in modern SaaS.
🧊 10. Debug Builds vs Production Builds
Your debugging configuration must change by environment:
Development
- • logging.DEBUG
- • colourful logs
- • stack traces everywhere
- • profiler enabled
- • hot reload
Staging
- • logging.INFO
- • Sentry enabled
- • performance profiling on
- • request IDs enabled
Production
- • logging.WARNING or ERROR
- • no sensitive data
- • structured JSON logs
- • aggressive error tracking
- • correlation IDs
- • distributed tracing
This separation keeps production clean and safe.
Part 3: Monitoring, Alerting & Observability — Production Architecture
The final deep-dive covers how large-scale systems (millions of users) keep themselves healthy, monitor failures, detect outages, and maintain reliability.
🔥 1. Observability: The "Holy Trinity"
Modern systems don't rely on just logging. True observability comes from three pillars:
✔ Logs
Text records of what happened (requests, errors, info messages)
✔ Metrics
Numerical time-series measurements (CPU, latency, throughput, queue length)
✔ Traces
Request flow across services (API → DB → Worker → Cache)
Together, these allow you to answer: What broke? Why? When? Where? Who was affected? How long?
This is how Google, Netflix, Uber and Stripe prevent downtime.
⚙️ 2. Metrics You MUST Track in Any Real System
A professional backend MUST export metrics like:
Performance Metrics
- • Request latency (p50/p95/p99)
- • Event loop lag
- • Worker queue size
- • CPU usage per service
- • Memory usage per process
Reliability Metrics
- • Error rate (5xx/exceptions)
- • Retry rate
- • Timeout rate
- • Dead-letter queue count
Throughput Metrics
- • Requests per second
- • Tasks executed per minute
- • DB queries per request
- • Cache hit/miss ratio
Example using Prometheus:
Prometheus Metrics
Track requests and latency
from prometheus_client import Counter, Histogram
REQUESTS = Counter("api_requests_total", "Total API Requests")
LATENCY = Histogram("request_latency_seconds", "Latency")
def handler():
REQUESTS.inc()
with LATENCY.time():
process()These metrics show EXACTLY where your system slows or breaks.
🧠 3. Alerting Like a Real SaaS Company
Your system should notify you when:
- ❌ Error rates spike (e.g., more than 2% of requests fail)
- ❌ Latency spikes (p99 over 500ms)
- ❌ CPU % hits critical (over 80% sustained)
- ❌ Memory leaks begin (memory steadily grows without dropping)
- ❌ Queue backups (worker queue length keeps increasing)
Alert channels: Email, SMS, Discord webhook, Slack, PagerDuty (industry standard)
Alerts MUST be actionable and never noisy. (Otherwise you get alert fatigue and ignore them.)
🚨 4. Creating Alert Rules (Industry Examples)
Latency Alert: IF p99_latency > 400ms FOR 5 minutes → alert
Error Rate Alert: IF 5xx_errors > 2% OF total_requests FOR 2 minutes → alert
Worker Queue Alert: IF queue_length > 1000 FOR 10 minutes → alert
Memory Leak Alert: IF memory_usage increases for 30 minutes straight → alert
This protects your system from silent failures.
🔍 5. Log Schema for Enterprise-Level Systems
Logs must be consistent, otherwise your log system becomes useless. Here's a clean, universal JSON format:
{
"timestamp": "2025-01-01T12:00:00Z",
"level": "INFO",
"service": "payment_api",
"event": "charge_created",
"message": "Payment successful",
"request_id": "xyz-123",
"user_id": 314159,
"duration_ms": 142,
"status_code": 200
}Why JSON logs? Machines can process it, log systems index it, developers can query it, easy to parse in dashboards.
Never do: Free-text logs, logs with no structure, logs with inconsistent keys.
Stronger logs = faster debugging.
🧩 6. Building Dashboards (Real Engineering Style)
Good dashboards answer questions instantly.
API Dashboard
- • Incoming requests
- • Error rate
- • Latency p50/p95/p99
- • DB queries per request
- • Cache hit %
Worker Dashboard
- • Queue depth
- • Processing rate
- • Task wait time
- • Failures vs retries
System Dashboard
- • CPU per service
- • Memory usage
- • Disk I/O
- • Network outbound/inbound
Business Dashboard
- • Sign-ups
- • Payments
- • Failed payments
- • Daily active users
These dashboards turn your logs + metrics into a map of system health.
🧵 7. Detecting Problems Automatically (Heuristics)
Some failures don't generate hard errors. Symptoms you must watch for:
🚨 Slow Increase → Memory Leak (Memory grows hour after hour)
🚨 Sawtooth Pattern → GC Thrashing (Frequent garbage collections)
🚨 CPU Stuck at 100% → Hot Loop (Infinite work created accidentally)
🚨 Queue Length Increasing → Bottleneck (Workers can't keep up)
🚨 Error Spikes at Same Time Daily → Scheduled Task Problem
Your monitoring system should automatically surface these patterns.
🔧 8. Production Failure Scenarios & How to Diagnose
Scenario 1: CPU Spikes to 100%
Check: pprof/py-spy flame graph, number of tasks, bad while loops, accidental synchronous code inside async
Scenario 2: Latency Spikes
Check: DB query count, slow endpoints, overloaded workers, missing indexes in SQL
Scenario 3: Queue Backup
Check: processing time per job, number of workers, retry storms, deadlocks
Scenario 4: Silent Failure with No Errors
Check: logs suppressed accidentally, exceptions swallowed, callbacks failed silently, network retries hiding real issues
Scenario 5: Memory Leak
Check: tracemalloc snapshots, reference cycles, global caches growing, large objects kept alive by closures
These are real issues companies debug DAILY.
📦 9. Production Architecture: Logging + Monitoring Stack
Small app (learning platform)
Logging: basic JSON logs, Monitoring: CloudWatch or simple Grafana, Error tracking: Sentry
Medium SaaS
Logging: Loki/Elasticsearch, Metrics: Prometheus, Traces: Jaeger, Error monitoring: Sentry
Large platform
Distributed tracing everywhere, Multi-layer dashboards, Real-time anomaly detection, Canary deployments with rollback triggers, Full observability pipeline
This is the path your apps will grow into.
🎓 10. The Engineering Mindset for Production Stability
Professional devs think differently:
❌ Not: "How do I fix this bug?"
✔ Yes: "How do I prevent this class of bug forever?"
❌ Not: "Why did the app crash?"
✔ Yes: "Why wasn't this detected earlier?"
❌ Not: "Why is it slow?"
✔ Yes: "What instrumentation do we add so slowdowns are always visible?"
You build systems that protect themselves.
🎉 Final Summary — Master Level
You now understand how real companies handle logging & debugging:
✔ Structured logs
✔ Centralised logging systems
✔ Correlation IDs
✔ Async logging techniques
✔ Distributed tracing
✔ Monitoring metrics
✔ Alert rules & thresholds
✔ Dashboards for system health
✔ Diagnosing CPU, memory, queue & latency issues
✔ Production-ready error strategies
✔ Observability stack for any size of project
This is enterprise-level engineering knowledge, the type senior developers use every day.
📋 Quick Reference — Logging & Debugging
| Syntax | What it does |
|---|---|
| logging.basicConfig(level=logging.INFO) | Set up basic logging config |
| logger = logging.getLogger(__name__) | Create a named logger |
| logger.exception("msg") | Log error with full traceback |
| pdb.set_trace() | Drop into interactive debugger |
| breakpoint() | Python 3.7+ built-in debugger shortcut |
🎉 Great work! You've completed this lesson.
You can now add structured logging, use the debugger effectively, and handle errors at enterprise scale.
Up next: Testing with pytest — write fixtures, parametrised tests, and mocks like a professional.
Sign up for free to track which lessons you've completed and get learning reminders.