Lesson 22 • Advanced Track
Replication: Sync, Async, Failover & Read Scaling
By the end of this lesson you'll be able to reason about a production database the way an on-call engineer does: choose synchronous vs asynchronous replication for a given durability requirement, route reads to replicas without serving stale data, and survive a server failure with a clean, fenced failover.
What You'll Learn
- ✓How a primary and its replicas divide reads and writes
- ✓Synchronous vs asynchronous: the durability ↔ latency trade-off
- ✓Physical/streaming vs logical replication, and when to pick each
- ✓Scale reads by routing SELECTs to replicas
- ✓Avoid stale reads & the read-your-writes trap from replication lag
- ✓Promote a replica on failure without causing split-brain
COMMIT. Replication is what happens after a commit, so you need that idea solid first. These commands are PostgreSQL administration statements — you normally run them on a real server cluster, not a single-file playground. The in-browser editor lets you read and edit them; every example explains exactly what it returns or does so you can follow along without a cluster.The Picture: One Primary, Many Replicas
Replication means keeping live copies of your database on more than one server. Exactly one node — the primary (a.k.a. leader/master) — accepts writes. The others — replicas (a.k.a. standbys/followers) — continuously copy the primary's changes and serve read-only queries. Picture a store: only the manager restocks shelves (writes), but every checkout counter can serve customers (reads).
Result — who does what in a typical 3-node cluster:
| Node | Role | Handles writes? | Handles reads? | If primary dies |
|---|---|---|---|---|
| db-1 | Primary | Yes | Yes (but keep it free for writes) | — (it's the one that died) |
| db-2 | Replica | No (read-only) | Yes | Can be promoted to primary |
| db-3 | Replica | No (read-only) | Yes | Can be promoted to primary |
Writes flow in one direction: client → primary → replicas. That single rule is what keeps the copies consistent — and it's why the interesting questions are all about timing (how far behind are replicas?) and failure (what if the primary dies?).
1. Primary vs Replica — Knowing Who You Are
Before anything else, a server needs to know its own role, because a replica that thinks it's a primary is a disaster waiting to happen. PostgreSQL answers this with one function, and you'll use it constantly — especially right after a failover.
🏪 Real-world analogy
The primary is the shop's manager (the only one allowed to restock), and replicas are the cashiers (they serve customers but can't change stock). When the manager is out, you promote one cashier to acting manager — but first you make sure the old manager can't quietly keep restocking too.
Am I primary or replica?
The one function every node uses to know its role.
-- Ask any PostgreSQL server one question:
-- "Am I a primary (read+write) or a replica (read-only)?"
SELECT pg_is_in_recovery();
-- FALSE = this server is the PRIMARY — it accepts INSERT/UPDATE/DELETE.
-- TRUE = this server is a REPLICA — it is read-only and is
-- replaying changes streamed from the primary.
-- Every replication setup starts here: exactly ONE primary owns
-- the writes; every other node is a copy that follows it.Result — run on a replica:
| pg_is_in_recovery |
|---|
| t |
t (true) means read-only replica; f (false) means it's the primary and accepts writes. This is the check your failover automation runs to confirm a promotion actually worked.
2. Asynchronous Replication & Replication Lag
Asynchronous replication is the common default: the primary commits a change the instant it has the change safely on its own disk, then streams that change to replicas afterwards. It never waits for them. That makes writes fast, but it introduces replication lag — the gap (usually milliseconds) between the primary having data and a replica having it.
Async is like posting a letter: you drop it in the box and carry on; the recipient reads it whenever it arrives. The number you must watch is the lag, because it's also your worst-case data loss if the primary dies before a replica catches up.
replication_lag creeping upward almost always means the replica can't keep up — slow replica disk (WAL replay is I/O-bound), a saturated network link, or a long-running query on the replica blocking replay. Alert on it.Async replication — measure the lag
See connected replicas and how far behind they are.
-- ASYNCHRONOUS replication (the common default)
-- The primary commits as soon as IT has the change on disk and
-- streams the change to replicas AFTERWARDS. It never waits.
-- → Fast writes, but a replica can be a few ms–seconds behind.
-- 1) On the PRIMARY: see every connected replica and how far behind it is.
SELECT client_addr,
state,
sync_state, -- 'async' here
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replicati
...Result — pg_stat_replication on the primary:
| client_addr | state | sync_state | lag_bytes |
|---|---|---|---|
| 10.0.0.12 | streaming | async | 4096 |
| 10.0.0.13 | streaming | async | 8192 |
3. Synchronous Replication — Trading Latency for Durability
Synchronous replication flips the deal: the primary will not tell your application COMMIT succeeded until a replica has also stored the change. That guarantees zero data loss if the primary instantly explodes — but every write now pays a network round-trip, so writes are slower. Sync is a phone call (you wait for "got it"); async is a letter.
This is the central trade-off of the whole lesson: durability vs latency. You don't have to choose once for the whole database — PostgreSQL's synchronous_commit setting can be changed per transaction, so money-movement can be synchronous while analytics logging stays fast and async.
remote_apply is the strictest level: COMMIT waits until the replica has applied the change, so a read on that replica is guaranteed to see it. off is the loosest: fastest writes, but you may lose the last few committed transactions in a crash.
Synchronous commit levels
The durability ↔ latency dial, set per transaction.
-- SYNCHRONOUS replication
-- The primary will NOT report COMMIT to your app until a replica has
-- also stored the change. Zero data loss on failover — slower writes.
-- postgresql.conf on the primary:
-- synchronous_standby_names = 'replica1'
-- synchronous_commit = on
-- 'synchronous_commit' is a DURABILITY vs LATENCY dial:
-- off -> fastest, may lose the last few committed txns
-- local -> fast, safe on PRIMARY's disk only
-- remote_write -> medium,
...Result — synchronous_commit — durability vs latency:
| Level | Write speed | Guarantee on a primary crash |
|---|---|---|
| off | Fastest | May lose last few committed txns |
| local | Fast | Safe on primary's disk only |
| remote_write | Medium | Replica has received it |
| on (default) | Slower | Replica has flushed it to disk |
| remote_apply | Slowest | Replica has applied it (reads see it) |
Your Turn: sync or async?
You're recording a bank transfer that must never be lost on failover, even if it makes the write slower. Fill in the blank with the right synchronous_commit level. The expected answer and why are in the comments.
🎯 Your Turn: durability for money
Pick the level that guarantees the replica applied the row.
-- 🎯 YOUR TURN — choose the right durability level, then run it.
-- Requirement: this INSERT records a customer's bank transfer.
-- Losing it on a failover is UNACCEPTABLE, even if writes get slower.
BEGIN;
SET LOCAL synchronous_commit = ___; -- 👉 the level where the replica
-- has APPLIED the row before COMMIT
INSERT INTO transfers (amount) VALUES (12000.00);
COMMIT;
-- ✅ Expected: 'remote_apply'
-- COMMIT now blocks until a replica confirms the
...4. Streaming/Physical vs Logical Replication
There are two mechanisms for copying the data. Physical (streaming) replication ships the raw write-ahead log (WAL) — a byte-for-byte copy of the entire cluster. It's cheap, fast, and what you use for high availability and read replicas, but it's all-or-nothing and tied to the same major version.
Logical replication instead decodes changes into rows and replicates table-by-table using a publish/subscribe model. That lets you copy only certain tables, replicate across different major versions, or funnel several databases into one warehouse — at the cost of more moving parts (DDL changes and sequences aren't carried over automatically).
Physical vs logical replication
Whole-cluster byte copy vs selective table sync.
-- TWO ways to copy the data, for two different jobs.
-- PHYSICAL / STREAMING replication (block-for-block copy)
-- Ships the raw WAL (write-ahead log) — an exact byte copy of the
-- primary. Whole cluster only, same major version. This is what you
-- use for high availability and read replicas. Cheap and fast.
-- LOGICAL replication (publish / subscribe, table by table)
-- Ships decoded row changes, so you choose WHICH tables to copy.
-- On the PUBLISHER (source):
CREATE PUBLICATION s
...Result — pick the mechanism for the job:
| Aspect | Physical / streaming | Logical (pub/sub) |
|---|---|---|
| Scope | Whole cluster | Chosen tables |
| Cross major version | No | Yes |
| Typical use | HA + read replicas | Selective sync / warehousing |
| DDL & sequences | Always in sync | Not auto-replicated |
5. Read Scaling — Routing Reads to Replicas
Because replicas hold a full copy of the data, you can spread your read traffic across them while the primary handles only writes. Your application keeps two connection pools and routes each query by type: SELECT goes to a replica, anything that writes goes to the primary. Add more replicas, serve more reads — that's horizontal read scaling.
Route reads to replicas
Two pools, and how to dodge stale reads.
-- READ SCALING: send writes to the primary, spread reads over replicas.
-- Your app keeps TWO connection pools and routes each query by type.
-- write_pool -> primary_host (INSERT / UPDATE / DELETE)
-- read_pool -> replica_loadbalancer (SELECT)
-- Pseudocode router:
-- if (query is SELECT) use read_pool; -- scales horizontally
-- else use write_pool; -- one writer, always correct
-- ⚠️ Async replicas LAG. A SELECT from a replica may not yet see a row
--
...Your Turn: route the read & spot the risk
Send a heavy dashboard query to the right pool, then explain the lag risk in a comment. Both blanks have their expected answers in the code.
🎯 Your Turn: read_pool vs write_pool
Pick the pool, then explain when a replica read is risky.
-- 🎯 YOUR TURN — route this report query to a read replica.
-- Goal: count yesterday's orders for a dashboard. Stale-by-seconds is fine.
-- 1) Pick the pool: a heavy read-only SELECT should NOT hit the primary.
USE ___; -- 👉 write_pool or read_pool ?
SELECT COUNT(*) AS orders_yesterday
FROM orders
WHERE order_date = CURRENT_DATE - 1;
-- ✅ Expected: read_pool
-- Reads scale across replicas, sparing the primary for writes.
-- 👉 In one sentence in a -- comment, say WHY this is safe here
...6. Failover & Promotion — Surviving a Dead Primary
When the primary fails, the show must go on: a replica is promoted to become the new primary that accepts writes. You can do this manually (run pg_promote() on a replica) or automatically with a cluster manager like Patroni, which detects the failure and promotes the healthiest replica within seconds — your app just reconnects through a pooler with no code change.
Failover & promotion
Promote a replica, confirm the role, avoid split-brain.
-- FAILOVER: the primary died — promote a replica to take over writes.
-- Manual promotion, run ON the chosen replica:
SELECT pg_promote(wait := true, timeout := 60);
-- Afterwards confirm it is now the primary:
SELECT pg_is_in_recovery(); -- must return FALSE (writes allowed)
-- AUTOMATIC failover (e.g. Patroni + etcd/Consul):
-- 1. The cluster manager misses the primary's heartbeat.
-- 2. It picks the most up-to-date replica as the new leader.
-- 3. It promotes that replica and point
...Deep Dive: Why Not Just Have Multiple Primaries?
It's tempting to let every node accept writes (multi-primary, a.k.a. multi-master) for even more write capacity and no failover step. The catch is conflicts: if two primaries update the same row at the same time, which value wins? You're now signed up for conflict-resolution rules, possible lost updates, and far harder reasoning about consistency.
For the vast majority of apps, a single primary plus read replicas is simpler, correct by construction, and scales reads — which is the bottleneck most systems actually hit first. Reach for multi-primary only with a clear need (e.g. multi-region active-active writes) and eyes open to the trade-offs.
Common Errors (and the fix)
- Reading stale data from a lagged replica: a user saves, reloads, and the change is "missing" because they hit a behind replica. Fix: read-your-writes — route that user to the primary briefly, or make the write synchronous (
remote_apply). - Async data loss on failover: the primary dies before replicas catch up, and the last few async commits are gone. Fix: use synchronous replication for the data that must never be lost; accept the latency cost for that data only.
- Split-brain (two primaries): the old primary returns and keeps writing after a replica was promoted, so data diverges. Fix: fence the old primary before/at promotion and use a quorum-based manager (Patroni + etcd) so only one leader can exist.
- No monitoring of replication lag: you discover the lag only during an outage. Fix: alert on
NOW() - pg_last_xact_replay_timestamp()and onpg_stat_replicationlong before you ever need a failover. - Replication slot fills the primary's disk: a replica goes offline but its slot makes the primary keep every WAL file, eventually filling the disk. Fix: monitor
pg_replication_slotsand set amax_slot_wal_keep_sizecap.
📘 Quick Reference
| Command / setting | Purpose |
|---|---|
| pg_is_in_recovery() | TRUE = replica (read-only), FALSE = primary |
| pg_stat_replication | On primary: each replica's state & lag_bytes |
| pg_last_xact_replay_timestamp() | On replica: measure replication lag |
| synchronous_commit = on | Wait for replica to flush (durable, slower) |
| synchronous_commit = remote_apply | Wait until replica applied (reads see it) |
| synchronous_commit = local / off | Don't wait (fast, async, may lose data) |
| CREATE PUBLICATION … FOR TABLE | Logical replication: publish chosen tables |
| CREATE SUBSCRIPTION … PUBLICATION | Logical replication: subscribe on destination |
| pg_promote() | Promote this replica to primary (failover) |
Frequently Asked Questions
Q: Sync or async — which should I default to?
Default to asynchronous for its speed, then make specific high-value writes synchronous with a per-transaction SET LOCAL synchronous_commit = 'remote_apply'. Making everything synchronous is usually slower than you need.
Q: Can I write to a replica?
No. Replicas are read-only — an attempted write errors with "cannot execute … in a read-only transaction". Writes go to the primary; that one-way flow is what keeps the copies consistent.
Q: How is replication different from a backup?
A backup is a point-in-time snapshot for recovering from data loss or corruption. Replication is a continuously-updated live copy for availability and read scaling. You need both — replication won't save you from a bad DELETE, because it faithfully copies the delete to every replica.
Q: Does adding read replicas speed up writes?
No — there's still a single primary doing all writes, and with synchronous replication more replicas can make writes slightly slower. Replicas scale reads; scale writes with bigger hardware, partitioning/sharding, or (carefully) multi-primary.
Mini-Challenge: Design a Topology
Put it all together — a brief and a blank canvas. Design the replication topology for a read-heavy news site that must survive one server dying. The expected shape of a good answer is in the comments.
🎯 Mini-Challenge
Design primaries, replicas, sync choice, and failover for a read-heavy app.
-- 🎯 MINI-CHALLENGE: design a topology for a read-heavy app
-- Scenario: a news site. ~95% of traffic is reads (article pages),
-- ~5% is writes (editors publishing). It must survive one server dying.
--
-- Sketch your answer as -- comments (no SQL to run). Decide:
-- 1. How many PRIMARIES? (who owns the writes?)
-- 2. How many REPLICAS, and what do they serve?
-- 3. Async or sync replication for the editors' writes — and why?
-- 4. How do readers avoid hitting a lagged replica right af
...🎉 Lesson Complete
- ✅ One primary owns writes; replicas copy it and serve reads
- ✅ Async is fast but lags; sync trades latency for zero data loss
- ✅ Physical/streaming copies the whole cluster; logical copies chosen tables
- ✅ Route
SELECTs to replicas to scale reads — but mind read-your-writes - ✅ On failure, promote a caught-up replica and fence the old primary to avoid split-brain
- ✅ Next: Advanced Views — virtual tables that simplify complex queries
Sign up for free to track which lessons you've completed and get learning reminders.