Replication lag threatening data consistency
criticalIncident Response
Primary-replica replication lag increasing, risking stale reads and potential data loss during failover
Prompt: βMy PostgreSQL read replicas are falling behind the primary by 30 seconds and climbing. Help me diagnose if this is a resource bottleneck, network issue, or replication slot problem before it causes an outage.β
With vs Without Schema
Same prompt, same model (claude-sonnet-4-5-20250929) β the only difference is whether Schema's knowledge graph was available.
Tokens
2,277vs4,352
Latency
40.5smsvs1.0mms
Turns
5vs7
Tool Calls
1vs2
Schema Calls
0vs0
Response
3,044 charsvs4,120 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses provide comprehensive diagnostic queries covering replication lag, slots, and resource bottlenecks, but the response with Schema includes several critical additions. Most notably, it dedicates a section to checking for blocking queries on hot_standby replicasβa common but often-overlooked cause of replication lag. The with-Schema response also provides specific configuration parameters with actual values (hot_standby_feedback, max_standby_streaming_delay, wal_compression) rather than generic guidance, and includes more granular analysis distinguishing between WAL receive and replay gaps. The additional numeric thresholds (>10ms latency, >50% iowait) and clearer RED FLAGS formatting make it faster to act on during an incident.
Without Schema
1 tool calls
I'll help you diagnose this replication lag issue quickly. Let me create a diagnostic script that checks all three potential causes systematically.
I'll help you diagnose this immediately. Here are the key diagnostic queries to run:
## **1. Check Replication Lag (Run on PRIMARY)**
```sql
-- See current lag per replica
SELECT
client_addr,
application_name,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS replay_lag_mb,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;
```
## **2. Check Replication Slots (Run on PRIMARY)**
```sql
-- Identify stuck or inactive slots
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS retained_mb
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
```
**π¨ If you see > 5GB retained WAL with `active=false`, you have a stuck slot.**
## **3. Check Resource Bottlenecks (Run on REPLICA servers)**
```bash
# CPU and memory
top -b -n 1 | head -20
free -h
# Disk I/O wait
iostat -x 2 5
# Postgres replica process CPU
ps aux | grep "wal receiver\|startup" | grep -v grep
# Check if replica is in recovery
psql -c "SELECT pg_is_in_recovery();"
# Check replica apply rate
psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;"
```
## **4. Check WAL Generation Rate (Run on PRIMARY)**
```sql
-- High write activity can overwhelm replicas
SELECT
datname,
xact_commit + xact_rollback AS transactions_per_sec,
blks_read + blks_hit AS blocks_accessed,
tup_inserted + tup_updated + tup_deleted AS tuples_modified
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY tuples_modified DESC;
```
## **5. Check Network Latency**
```bash
# From replica to primary
ping -c 10 <PRIMARY_IP>
# Bandwidth test (if you have iperf)
iperf3 -c <PRIMARY_IP> -t 10
```
## **Quick Triage Questions:**
1. **Is `replay_lag` increasing linearly?** β Resource bottleneck on replica
2. **Is `retained_wal` growing with inactive slots?** β Replication slot problem
3. **Is `write_lag` << `replay_lag`?** β Replica can't apply changes fast enough (CPU/disk)
4. **Are all lags growing equally?** β Network issue
## **Immediate Actions Based on Root Cause:**
**If Resource Bottleneck:**
- Check disk I/O: `iostat -x 1` on replica
- Check for slow queries blocking recovery: `SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;` on replica
**If Replication Slot Problem:**
- Drop inactive slots: `SELECT pg_drop_replication_slot('slot_name');`
- Check if `max_wal_size` is too small
**If Network Issue:**
- Check firewall/security groups
- Verify network bandwidth between AZs/regions
- Check `wal_sender_timeout` and `wal_receiver_timeout` settings
What results do you see when you run the first two SQL queries? That will tell us which path to go down.
With Schema
2 tool calls
I'll help you diagnose this replication lag issue quickly. Let me create a diagnostic script to check all three potential causes systematically.
I'll provide you with immediate diagnostic queries and system checks. Given the urgency, here's what you need to run:
## **Quick Diagnosis - Run These Immediately**
### **1. Check Replication Lag (On PRIMARY)**
```sql
-- Shows current lag for all replicas
SELECT
application_name,
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS replay_lag_mb,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;
```
**Look for:**
- `replay_lag` > 30 seconds and growing = **PROBLEM CONFIRMED**
- `state` should be 'streaming' (not 'catchup')
- Large `replay_lag_mb` values
---
### **2. Check Replication Slots (On PRIMARY)**
```sql
-- Identify stuck or inactive slots
SELECT
slot_name,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS retained_wal_mb,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) / 1024 / 1024 AS lag_mb
FROM pg_replication_slots
ORDER BY retained_wal_mb DESC;
```
**RED FLAGS:**
- `active = false` but slot exists = **REPLICATION SLOT PROBLEM**
- `retained_wal_mb` is very large (>10GB) = WAL accumulation, disk space risk
- Missing slots for known replicas
---
### **3. Resource Check on REPLICA**
```sql
-- Run this on the REPLICA
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag,
pg_is_in_recovery() AS is_replica,
pg_last_wal_receive_lsn() AS received,
pg_last_wal_replay_lsn() AS replayed,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) / 1024 / 1024 AS receive_replay_gap_mb;
```
**If `receive_replay_gap_mb` is large:** Replica can't replay fast enough = **RESOURCE BOTTLENECK on replica**
---
### **4. System Resource Check**
Run on **both PRIMARY and REPLICAS**:
```bash
# CPU usage
top -bn1 | grep "postgres:" | head -20
# Disk I/O
iostat -x 1 3
# Network between primary and replica
ping -c 10 <replica_ip> # Run from primary
```
**Look for:**
- **Replica CPU at 100%** on wal receiver/startup process = CPU bottleneck
- **High iowait** (>50%) = Disk I/O bottleneck
- **Packet loss or high latency** (>10ms) = Network issue
---
### **5. Check for Blocking Queries (On REPLICA)**
```sql
-- Long-running queries can block replication on hot_standby replicas
SELECT
pid,
usename,
state,
now() - xact_start AS duration,
wait_event_type,
wait_event,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND datname IS NOT NULL
ORDER BY duration DESC
LIMIT 10;
```
**If you see queries with `duration > 30s`:** Long-running analytical queries might be blocking WAL replay
---
## **Quick Fixes Based on Root Cause**
### **If REPLICATION SLOT problem:**
```sql
-- Check if slot is inactive
SELECT * FROM pg_replication_slots WHERE active = false;
-- Drop inactive slot (CAUTION: only if replica is dead)
-- SELECT pg_drop_replication_slot('slot_name');
```
### **If RESOURCE bottleneck on replica:**
```sql
-- Increase replication performance (on replica postgresql.conf)
-- hot_standby_feedback = on
-- max_standby_streaming_delay = 30s # Lower if safe
-- wal_receiver_status_interval = 1s
-- Then restart replica
```
### **If NETWORK issue:**
- Check network saturation between primary and replicas
- Check if WAL compression is enabled: `wal_compression = on`
- Verify network path and firewall rules
### **If PRIMARY is overloaded:**
```sql
-- Check write load on primary
SELECT
datname,
tup_inserted + tup_updated + tup_deleted AS writes_per_sec
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');
```
---
## **Immediate Action Priority:**
1. **Run queries #1 and #2** to identify the bottleneck type
2. **Run query #3 on replica** to see if it's receiving vs replaying
3. **Check system resources** on the lagging replica
4. Report back what you find and I'll give you the specific fix
What do you see when you run these queries?
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When diagnosing replication lag, start by confirming where in the replication pipeline the bottleneck occurs (write, flush, or replay), then immediately check replica resource saturation (CPU, I/O, memory) since this is the most common cause. Next investigate blocking queries on the replica, network throughput issues, and finally WAL generation rate and replication slot problems.
1Identify where in the replication pipeline the lag is occurring
Check `replay-lag`, `flush-lag`, and `write-lag` in pg_stat_replication to pinpoint the bottleneck. If `replay-lag` is significantly higher than `write-lag`, the replica is receiving WAL but can't apply it fast enough β this points to replica resource or query blocking issues. If all three lag metrics are similar, suspect network or primary-side issues. The `replication-lag-stale-data` insight notes that lag over 30 seconds (which you're seeing) is critical and typically indicates the replica can't keep up with primary write volume.
2Check replica server resource saturation
This is the most common cause of replication lag. Check CPU utilization, disk I/O wait time, and memory pressure on the replica server. If the replica is undersized compared to the primary, it simply can't keep up with the WAL replay workload. Look for high iowait (>20% sustained) or CPU saturation (>80%) during periods of lag. Compare replica server specs to primary β replicas should generally match or exceed primary capacity since they need to apply changes serially while the primary can parallelize writes.
3Identify long-running queries blocking WAL replay on the replica
Query pg_stat_activity on the replica for queries with state='active' and long query_start times. Long-running analytical queries or idle transactions can hold locks that prevent WAL replay from advancing, causing `replay-lag` to spike even when the replica has adequate resources. If you see queries running for minutes while replay lag climbs, that's your culprit. Consider setting hot_standby_feedback=off or using statement_timeout to prevent replicas from blocking themselves.
4Verify network bandwidth and latency between primary and replica
Compare `sent-lsn` on the primary (from pg_stat_replication) to `write-lsn` and `flush-lsn` on the replica. If sent_lsn is far ahead of write_lsn, the network is the bottleneck. Also check `postgresql-replication-delay-bytes` β if it's growing while `write-lag` is high but `replay-lag` is low, you have network issues. Test bandwidth with iperf between nodes and check for packet loss. This is especially common in cross-region or cloud environments with bandwidth limits.
5Check WAL generation rate on the primary against replica capacity
Monitor `postgresql-wal-bytes` generation rate on the primary and compare to the replica's ability to consume it. If the primary is generating WAL faster than the replica can replay (due to heavy write bursts), `postgresql-replication-replay-lsn-delay` will grow continuously. The `replication-lag-cpu-saturation` insight shows this can happen when inefficient query plans on the primary cause CPU saturation and excessive WAL generation. Check if a VACUUM ANALYZE on the primary helps, or if you need to tune checkpoint settings (checkpoint_timeout, max_wal_size) to smooth out WAL generation.
6Inspect replication slots and WAL accumulation
Check `replication-slot-safe-wal-size-size` and look for negative values or excessive WAL retention. A replication slot that's fallen far behind can prevent WAL cleanup, filling the disk and creating a cascading failure as described in `wal-accumulation-replication-lag`. Query pg_replication_slots for inactive slots with large restart_lsn differences. Also verify `postgresql-wal-receiver-connected` is 1 β if it's 0, the replica isn't even connected to receive WAL. If you find stuck slots, you may need to drop and recreate them, but be aware this risks data loss if the replica needs to catch up.
Technologies
Related Insights
Replication lag causes stale reads on replicas
warning
Streaming replication lag between primary and replica nodes increases, causing replicas to serve stale data. High lag indicates replica cannot keep up with primary write volume, network issues, or resource constraints on replica.
Replication lag causes stale replica data
critical
Replication lag causing stale reads on standby
warning
Replication lag triggered by CPU saturation from bad query plans
critical
Excessive WAL file accumulation fills disk when replication lags or archiving fails
critical
Replication lag causes replica data staleness and read inconsistencies
diagnostic_stepwarning
Replication lag threatens high availability and data consistency
outcomewarning
replica undersized compared to primary
prerequisite
Stub insight created from scenario discovery hint: replica undersized compared to primary
network latency or packet loss
prerequisite
Stub insight created from scenario discovery hint: network latency or packet loss
replication slot preventing WAL cleanup
supporting_evidence
Stub insight created from scenario discovery hint: replication slot preventing WAL cleanup
configuration mismatch between primary and replica
prerequisite
Stub insight created from scenario discovery hint: configuration mismatch between primary and replica
WAL generation rate exceeding checkpoint capacity
prerequisite
Stub insight created from scenario discovery hint: WAL generation rate exceeding checkpoint capacity
Relevant Metrics
postgresql.replication.replay_lagpostgresql.replication.flush_lagwrite_lagpostgresql.replication.replay_lsn_delaysent_lsnwrite_lsnflush_lsnreplay_lsnpostgresql.wal.lagwal_receiver_connectedreplication_delay_bytes(monitor)replication_lag_seconds(primary_diagnostic)replay_lag(primary_diagnostic)flush_lag(secondary_diagnostic)postgresql.replication.data_delay(secondary_diagnostic)replication_slot_safe_wal_size(monitor)postgresql.wal.bytes(monitor)Monitoring Interfaces
PostgreSQL Native