Courses/Python/Logging, Debugging & Error Handling at Scale
    Back to Course

    Lesson 33 • Advanced

    Logging, Debugging & Error Handling at Scale

    When your codebase is tiny, print() and random try/except blocks "work". Once you have APIs, background workers, cron jobs, queues, and multiple servers, that approach falls apart. Learn professional-grade debugging + logging + error-handling systems used by real engineering teams.

    What You'll Master

    This lesson takes you from basic debugging → production-grade logging systems used by companies handling thousands of users, multiple environments, and distributed services.

    ✔ Why logging beats print() in real systems

    ✔ Structured logging and log levels

    ✔ Multiple handlers and log routing

    ✔ Professional error handling strategies

    ✔ Centralized log aggregation

    ✔ Correlation IDs and distributed tracing

    ✔ Async logging techniques

    ✔ Monitoring, alerting & observability

    ✔ Production debugging workflows

    Part 1: Logging Fundamentals & Error Handling

    🔥 1. Why Logging Beats print() in Real Systems

    When you have thousands of users, multiple environments (dev/staging/prod), background jobs, schedulers, and queues — print() falls apart.

    Featureprint()logging
    Output destinationstdout onlyConsole, files, HTTP, email
    Severity levelsNoneDEBUG → CRITICAL
    FilteringNot possiblePer module, per environment
    ContextManual string formattingAttach user ID, request ID, etc.
    Production usePollutes outputProfessional & configurable

    Rule: print() is only for quick experiments. Real services use logging everywhere.

    ⚙️ 2. Basic Logging Setup (Per-Module Loggers)

    You should never use the root logger directly in big projects. Instead, use one logger per module:

    Per-Module Logger

    Create a dedicated logger for each module

    Try it Yourself »
    Python
    import logging
    
    logger = logging.getLogger(__name__)
    
    def process_order(order_id: str) -> None:
        logger.info("Processing order", extra={"order_id": order_id})
        # ...

    Then in your entry point (e.g. main.py or app.py):

    Entry Point Logging Config

    Configure logging in your main entry point

    Try it Yourself »
    Python
    import logging
    
    LOG_FORMAT = "[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
    
    logging.basicConfig(
        level=logging.INFO,
        format=LOG_FORMAT,
    )

    This gives you timestamp, log level, logger name (payments.service, users.api), and message. You can raise log level in dev: DEBUG, staging: INFO, prod: WARNING.

    🧱 3. Logging Levels: A Contract for Your Team

    Use logging levels consistently across your team:

    LevelWhen to UseExample
    DEBUGInternal details, variables, timings"User payload: {...}"
    INFOHigh-level "this happened""User registered", "Order created"
    WARNINGUnusual but recoverable"Slow query: 832ms", "Retry attempt 2"
    ERROROperation failed, app keeps running"Payment failed", "Email send failed"
    CRITICALSystem is not usable"DB unreachable", "Config missing"

    Logging Levels Example

    Use consistent log levels across your codebase

    Try it Yourself »
    Python
    logger.debug("User payload: %r", payload)
    logger.info("User registered", extra={"user_id": user.id})
    logger.warning("Slow query", extra={"duration_ms": 832})
    logger.error("Payment failed", extra={"user_id": user.id})
    logger.critical("Database unreachable")

    When everyone uses the same level rules, you can alert only on ERROR/CRITICAL, filter out noisy DEBUG in production, and quickly see timeline of events with INFO logs.

    🧩 4. Structured Logging (So Logs Are Actually Searchable)

    For small scripts, string messages are fine. At scale, you need structured logs so tools like Datadog/Loki/ELK/CloudWatch can filter and aggregate.

    Instead of:

    Unstructured Log (Bad)

    Avoid embedding data in log messages

    Try it Yourself »
    Python
    logger.info(f"User {user.id} logged in from {ip}")

    Use:

    Structured Logging (Good)

    Use structured logs for searchability

    Try it Yourself »
    Python
    logger.info(
        "User logged in",
        extra={
            "user_id": user.id,
            "ip": request.client.host,
            "source": "web",
        }
    )
    
    # Or use a dict-style message:
    logger.info({
        "event": "user_login",
        "user_id": user.id,
        "ip": request.client.host,
        "source": "web",
    })

    Now your logging backend can filter by user_id, count logins from each IP, and group by event.

    🧲 5. Handlers: Sending Logs to Multiple Destinations

    A handler decides where logs go. Common ones:

    HandlerDestinationBest For
    StreamHandlerConsoleDevelopment, debugging
    FileHandlerRegular log fileSimple file logging
    RotatingFileHandlerRotates at size limitProduction (prevents huge files)
    TimedRotatingFileHandlerRotates daily/hourlyDaily log archives
    SMTPHandlerEmailCritical error alerts
    HTTPHandlerLog serviceCentralized monitoring

    Example: console + file with rotation

    Multiple Handlers

    Send logs to console and file with rotation

    Try it Yourself »
    Python
    import logging
    from logging.handlers import RotatingFileHandler
    
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    console = logging.StreamHandler()
    console.setLevel(logging.INFO)
    
    file_handler = RotatingFileHandler(
        "app.log",
        maxBytes=5_000_000,  # 5 MB
        backupCount=5,      # keep 5 old files
    )
    
    formatter = logging.Formatter(
        "[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
    )
    
    console.setFormatter(formatter)
    file_handler.setFormatter(formatter)
    
    logger.addHandler
    ...

    Now dev uses console, ops can inspect app.log, and files don't grow forever. In a SaaS/microservice setup, you'd usually log to stdout and let Docker/Kubernetes send logs to a central system.

    🧠 6. Error Handling Strategy (Don't Just "try/except Everything")

    Bad pattern:

    Bad Error Handling

    Avoid catching everything blindly

    Try it Yourself »
    Python
    try:
        do_thing()
    except Exception:
        logger.error("Something went wrong")

    Problems with this approach:

    • • Swallows context — you don't know what went wrong
    • • Hides the original traceback — debugging becomes impossible
    • • The caller doesn't know it failed — silent failures are the worst!

    Good pattern: Catch specific exceptions, log with traceback, decide whether to handle or re-raise

    Good Error Handling

    Use domain exceptions with proper logging

    Try it Yourself »
    Python
    import logging
    
    logger = logging.getLogger(__name__)
    
    class PaymentError(Exception):
        pass
    
    def charge_card(user_id: str, amount_pennies: int) -> None:
        try:
            external_gateway.charge(user_id, amount_pennies)
        except GatewayTimeoutError as e:
            logger.error(
                "Payment timeout",
                exc_info=True,
                extra={"user_id": user_id, "amount": amount_pennies}
            )
            raise PaymentError("Temporary payment failure") from e

    Key ideas: Use domain-level exceptions (PaymentError), wrap low-level exceptions, pass exc_info=True to capture traceback, use raise ... from e to keep the error chain.

    🧨 7. Logging Tracebacks Correctly

    Two main options:

    1. logger.exception() inside except

    logger.exception()

    Logs message with full traceback

    Try it Yourself »
    Python
    try:
        risky_operation()
    except Exception:
        logger.exception("Unexpected exception during risky_operation")

    This logs message, full traceback, and error type.

    2. exc_info=True

    exc_info=True

    Alternative way to include traceback

    Try it Yourself »
    Python
    logger.error("Something failed", exc_info=True)

    Both approaches are valid. logger.exception() is just a shortcut for logger.error(..., exc_info=True) inside an except block.

    🪤 8. Global Exception Hooks (Last-Resort Safety Net)

    For CLI/worker processes, you can register a global handler:

    Global Exception Hook

    Catch all uncaught exceptions

    Try it Yourself »
    Python
    import logging
    import sys
    
    logger = logging.getLogger(__name__)
    
    def handle_uncaught(exc_type, exc, tb):
        logger.critical("Uncaught exception", exc_info=(exc_type, exc, tb))
    
    sys.excepthook = handle_uncaught

    Now any uncaught error in the main thread is logged at CRITICAL. You still let the process crash (which is often correct in prod). In frameworks: Django has middleware for logging unhandled errors, FastAPI lets you add global exception handlers, Celery workers log errors from tasks.

    🧪 9. Debugging Strategy: Logs + Debugger + Assertions

    At scale, you don't "guess":

    ✅ Use logging for:

    • long-term tracing in production
    • measuring how often things break
    • understanding user behaviour

    ✅ Use debugger for:

    • stepping through tricky paths in dev
    • inspecting state live

    Example with pdb / breakpoint():

    Using breakpoint()

    Debug interactively with pdb

    Try it Yourself »
    Python
    def compute_total(order):
        breakpoint()  # or import pdb; pdb.set_trace()
        # inspect order, step through logic
        total = sum(item.price for item in order.items)
        return total

    ✅ Use assertions in dev: assert total >= 0, "Total should never be negative"

    In production, you can turn off certain assertions or keep only the critical ones.

    🎯 10. Principles for "At Scale" Error Handling

    1. Don't hide errors silently

    Always log unexpected exceptions. Prefer fail loud and fast to silent corruption.

    2. Centralize error handling at boundaries

    HTTP layer (FastAPI/Django view), Message queue consumer, CLI/worker entry point

    3. Use custom exception hierarchy

    DomainError → ValidationError, PaymentError, ExternalServiceError. This makes handling & logging more deliberate.

    4. Keep logs human-readable AND machine-parseable

    Combined message + structured context

    5. Tune log levels

    Too much DEBUG in prod = noisy. Too few logs in prod = blind.

    Part 2: Production Engineering — Centralized Logging, Correlation IDs & Distributed Tracing

    Part 1 covered local logging and error handling. Now we move into real production engineering — the systems used by FastAPI/Django backends, microservices, SaaS platforms, and distributed workers.

    🔥 1. Centralised Log Aggregation (The Real World Standard)

    As soon as you have multiple servers, background workers, containers, or functions-as-a-service — you cannot inspect logs locally anymore. Real systems send logs to a central place.

    ✔ ELK Stack (Elasticsearch + Logstash + Kibana) — Most customizable, open-source, works at huge scale

    ✔ Loki + Grafana — Insanely fast, cheap, streams logs from Docker/Kubernetes

    ✔ AWS CloudWatch / GCP Logging / Azure Monitor — Great if you already host on those platforms

    ✔ Datadog / Sentry / NewRelic — Expensive, but world-class dashboards + alerts

    ⚙️ 2. Logging to STDOUT in Containers (Best Practice)

    In Docker/Kubernetes, you never write log files inside the container. You output logs to STDOUT:

    Container Logging

    Log to stdout for Docker/Kubernetes

    Try it Yourself »
    Python
    logging.basicConfig(
        level=logging.INFO,
        format="[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s",
        handlers=[logging.StreamHandler()]
    )

    Kubernetes will automatically: ✔ capture your stdout, ✔ send it to your log system, ✔ attach metadata (pod, namespace, service). This makes logs fully searchable across the entire cluster.

    🧠 3. The Importance of Request IDs / Correlation IDs

    Imagine debugging: User logs in → Their request hits API A → API A calls API B → API B queries the database → A background worker processes a message → Something fails.

    Without correlation IDs, logs look like a mess. Solution: Generate a unique ID per user request.

    Correlation IDs

    Generate unique IDs for request tracing

    Try it Yourself »
    Python
    import uuid
    
    request_id = str(uuid.uuid4())
    logger.info("Received request", extra={"request_id": request_id})

    Pass the ID through HTTP headers, background jobs, microservice calls, and log contexts. Now you can filter your log system for request_id = "1f0cd133-f9ab-4bd9-a6cd-92d3a002a415" and see EVERYTHING that happened.

    This alone can save hours per week in debugging.

    🧩 4. Logging Context Automatically (ContextVars)

    Python provides a way to attach values (like request IDs) to all logs inside async or threaded code.

    ContextVars for Logging

    Automatically attach context to all logs

    Try it Yourself »
    Python
    from contextvars import ContextVar
    
    request_id_var = ContextVar("request_id", default="-")
    
    def set_request_id(rid: str):
        request_id_var.set(rid)
    
    # Using a custom log filter:
    class RequestIDFilter(logging.Filter):
        def filter(self, record):
            record.request_id = request_id_var.get()
            return True
    
    # Attach filter to logger:
    logger.addFilter(RequestIDFilter())

    Now every log line automatically includes request_id, user_id (if added), job_id, trace_id. This is essential in async web apps like FastAPI.

    🚀 5. Distributed Tracing (How Big Systems Debug)

    Used by Uber, Netflix, Stripe, Google-scale systems. Distributed tracing tracks: "This request flowed through: API → Worker → DB → Cache → Queue → Another Worker"

    Tools: OpenTelemetry (industry standard), Jaeger, Zipkin, Datadog APM

    OpenTelemetry Tracing

    Add distributed tracing to your app

    Try it Yourself »
    Python
    from opentelemetry import trace
    
    tracer = trace.get_tracer(__name__)
    
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order_id", oid)
        span.set_attribute("user_id", user_id)

    Now logs + traces appear linked. This gives timings for each step, slow points, bottlenecks, failed spans, and dependency maps. This is how modern SaaS teams debug performance problems.

    🧵 6. Logging in Async Python (Trickier Than It Looks)

    Async apps (FastAPI, aiohttp) handle hundreds/thousands of concurrent tasks. Debugging them needs extra care.

    Problem 1: Interleaved logs — Two tasks printing logs at the same time → scrambled output

    Solution: structured logging (JSON logs)

    Instead of plain text logs:

    {"timestamp": "...", "level": "INFO", "task_id": 128, "msg": "User logged in"}

    Tools: python-json-logger, structlog, loguru. Structured logs remove all ambiguity.

    🧪 7. Debugging Async Issues

    Common async issues: tasks never awaited, task cancelled too early, infinite loops in async code, race conditions modifying shared state, slow network calls blocking event loop.

    Use asyncio.all_tasks() to debug:

    Debug Async Tasks

    Inspect all running asyncio tasks

    Try it Yourself »
    Python
    import asyncio
    
    for t in asyncio.all_tasks():
        print(t)

    You can inspect pending tasks, long-running tasks, and tasks stuck on await.

    Async Timeouts & Shield

    Handle timeouts and prevent cancellation

    Try it Yourself »
    Python
    # Use asyncio timeout wrappers
    await asyncio.wait_for(sync(), timeout=3)
    
    # Use asyncio.shield to prevent unwanted cancellation
    result = await asyncio.shield(long_task())

    This is how real async backends prevent accidental cancellations.

    ⚡ 8. Advanced Exception Routing (Tiered Error Strategy)

    In large systems, errors must be: 1. Logged, 2. Categorized, 3. Reported to monitoring tools, 4. Potentially retried, 5. Potentially suppressed, 6. Potentially escalated

    ✔ Business error (expected)

    Logged at WARNING, Message shown to user, No alerts

    ✔ External service error

    Logged at ERROR, Trigger retry/backoff, Alert if repeated failures

    ✔ System error

    Logged at CRITICAL, Immediate alert, Fallback mode or shutdown

    This prevents alert fatigue and improves reliability.

    🔔 9. Error Tracking Tools (Sentry, Rollbar, Bugsnag)

    Instead of forcing developers to read logs manually, use automatic error monitoring.

    Sentry Integration

    Capture errors with automatic tracking

    Try it Yourself »
    Python
    try:
        risky()
    except Exception as e:
        logger.exception("Error during risky operation")
        sentry_sdk.capture_exception(e)
        raise

    Sentry provides: stack trace, breadcrumbs (log history), user ID + metadata, environment (dev/staging/prod), frequency of error, release version. You can review top crashing endpoints, sudden spikes, and regressions after deploys.

    This is mandatory in modern SaaS.

    🧊 10. Debug Builds vs Production Builds

    Your debugging configuration must change by environment:

    Development

    • • logging.DEBUG
    • • colourful logs
    • • stack traces everywhere
    • • profiler enabled
    • • hot reload

    Staging

    • • logging.INFO
    • • Sentry enabled
    • • performance profiling on
    • • request IDs enabled

    Production

    • • logging.WARNING or ERROR
    • • no sensitive data
    • • structured JSON logs
    • • aggressive error tracking
    • • correlation IDs
    • • distributed tracing

    This separation keeps production clean and safe.

    Part 3: Monitoring, Alerting & Observability — Production Architecture

    The final deep-dive covers how large-scale systems (millions of users) keep themselves healthy, monitor failures, detect outages, and maintain reliability.

    🔥 1. Observability: The "Holy Trinity"

    Modern systems don't rely on just logging. True observability comes from three pillars:

    ✔ Logs

    Text records of what happened (requests, errors, info messages)

    ✔ Metrics

    Numerical time-series measurements (CPU, latency, throughput, queue length)

    ✔ Traces

    Request flow across services (API → DB → Worker → Cache)

    Together, these allow you to answer: What broke? Why? When? Where? Who was affected? How long?

    This is how Google, Netflix, Uber and Stripe prevent downtime.

    ⚙️ 2. Metrics You MUST Track in Any Real System

    A professional backend MUST export metrics like:

    Performance Metrics

    • • Request latency (p50/p95/p99)
    • • Event loop lag
    • • Worker queue size
    • • CPU usage per service
    • • Memory usage per process

    Reliability Metrics

    • • Error rate (5xx/exceptions)
    • • Retry rate
    • • Timeout rate
    • • Dead-letter queue count

    Throughput Metrics

    • • Requests per second
    • • Tasks executed per minute
    • • DB queries per request
    • • Cache hit/miss ratio

    Example using Prometheus:

    Prometheus Metrics

    Track requests and latency

    Try it Yourself »
    Python
    from prometheus_client import Counter, Histogram
    
    REQUESTS = Counter("api_requests_total", "Total API Requests")
    LATENCY = Histogram("request_latency_seconds", "Latency")
    
    def handler():
        REQUESTS.inc()
        with LATENCY.time():
            process()

    These metrics show EXACTLY where your system slows or breaks.

    🧠 3. Alerting Like a Real SaaS Company

    Your system should notify you when:

    • ❌ Error rates spike (e.g., more than 2% of requests fail)
    • ❌ Latency spikes (p99 over 500ms)
    • ❌ CPU % hits critical (over 80% sustained)
    • ❌ Memory leaks begin (memory steadily grows without dropping)
    • ❌ Queue backups (worker queue length keeps increasing)

    Alert channels: Email, SMS, Discord webhook, Slack, PagerDuty (industry standard)

    Alerts MUST be actionable and never noisy. (Otherwise you get alert fatigue and ignore them.)

    🚨 4. Creating Alert Rules (Industry Examples)

    Latency Alert: IF p99_latency > 400ms FOR 5 minutes → alert

    Error Rate Alert: IF 5xx_errors > 2% OF total_requests FOR 2 minutes → alert

    Worker Queue Alert: IF queue_length > 1000 FOR 10 minutes → alert

    Memory Leak Alert: IF memory_usage increases for 30 minutes straight → alert

    This protects your system from silent failures.

    🔍 5. Log Schema for Enterprise-Level Systems

    Logs must be consistent, otherwise your log system becomes useless. Here's a clean, universal JSON format:

    {
      "timestamp": "2025-01-01T12:00:00Z",
      "level": "INFO",
      "service": "payment_api",
      "event": "charge_created",
      "message": "Payment successful",
      "request_id": "xyz-123",
      "user_id": 314159,
      "duration_ms": 142,
      "status_code": 200
    }

    Why JSON logs? Machines can process it, log systems index it, developers can query it, easy to parse in dashboards.

    Never do: Free-text logs, logs with no structure, logs with inconsistent keys.

    Stronger logs = faster debugging.

    🧩 6. Building Dashboards (Real Engineering Style)

    Good dashboards answer questions instantly.

    API Dashboard

    • • Incoming requests
    • • Error rate
    • • Latency p50/p95/p99
    • • DB queries per request
    • • Cache hit %

    Worker Dashboard

    • • Queue depth
    • • Processing rate
    • • Task wait time
    • • Failures vs retries

    System Dashboard

    • • CPU per service
    • • Memory usage
    • • Disk I/O
    • • Network outbound/inbound

    Business Dashboard

    • • Sign-ups
    • • Payments
    • • Failed payments
    • • Daily active users

    These dashboards turn your logs + metrics into a map of system health.

    🧵 7. Detecting Problems Automatically (Heuristics)

    Some failures don't generate hard errors. Symptoms you must watch for:

    🚨 Slow Increase → Memory Leak (Memory grows hour after hour)

    🚨 Sawtooth Pattern → GC Thrashing (Frequent garbage collections)

    🚨 CPU Stuck at 100% → Hot Loop (Infinite work created accidentally)

    🚨 Queue Length Increasing → Bottleneck (Workers can't keep up)

    🚨 Error Spikes at Same Time Daily → Scheduled Task Problem

    Your monitoring system should automatically surface these patterns.

    🔧 8. Production Failure Scenarios & How to Diagnose

    Scenario 1: CPU Spikes to 100%

    Check: pprof/py-spy flame graph, number of tasks, bad while loops, accidental synchronous code inside async

    Scenario 2: Latency Spikes

    Check: DB query count, slow endpoints, overloaded workers, missing indexes in SQL

    Scenario 3: Queue Backup

    Check: processing time per job, number of workers, retry storms, deadlocks

    Scenario 4: Silent Failure with No Errors

    Check: logs suppressed accidentally, exceptions swallowed, callbacks failed silently, network retries hiding real issues

    Scenario 5: Memory Leak

    Check: tracemalloc snapshots, reference cycles, global caches growing, large objects kept alive by closures

    These are real issues companies debug DAILY.

    📦 9. Production Architecture: Logging + Monitoring Stack

    Small app (learning platform)

    Logging: basic JSON logs, Monitoring: CloudWatch or simple Grafana, Error tracking: Sentry

    Medium SaaS

    Logging: Loki/Elasticsearch, Metrics: Prometheus, Traces: Jaeger, Error monitoring: Sentry

    Large platform

    Distributed tracing everywhere, Multi-layer dashboards, Real-time anomaly detection, Canary deployments with rollback triggers, Full observability pipeline

    This is the path your apps will grow into.

    🎓 10. The Engineering Mindset for Production Stability

    Professional devs think differently:

    ❌ Not: "How do I fix this bug?"

    ✔ Yes: "How do I prevent this class of bug forever?"

    ❌ Not: "Why did the app crash?"

    ✔ Yes: "Why wasn't this detected earlier?"

    ❌ Not: "Why is it slow?"

    ✔ Yes: "What instrumentation do we add so slowdowns are always visible?"

    You build systems that protect themselves.

    🎉 Final Summary — Master Level

    You now understand how real companies handle logging & debugging:

    ✔ Structured logs

    ✔ Centralised logging systems

    ✔ Correlation IDs

    ✔ Async logging techniques

    ✔ Distributed tracing

    ✔ Monitoring metrics

    ✔ Alert rules & thresholds

    ✔ Dashboards for system health

    ✔ Diagnosing CPU, memory, queue & latency issues

    ✔ Production-ready error strategies

    ✔ Observability stack for any size of project

    This is enterprise-level engineering knowledge, the type senior developers use every day.

    📋 Quick Reference — Logging & Debugging

    SyntaxWhat it does
    logging.basicConfig(level=logging.INFO)Set up basic logging config
    logger = logging.getLogger(__name__)Create a named logger
    logger.exception("msg")Log error with full traceback
    pdb.set_trace()Drop into interactive debugger
    breakpoint()Python 3.7+ built-in debugger shortcut

    🎉 Great work! You've completed this lesson.

    You can now add structured logging, use the debugger effectively, and handle errors at enterprise scale.

    Up next: Testing with pytest — write fixtures, parametrised tests, and mocks like a professional.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service