Autovacuum Falling Behind

warningIncident Response

Dead tuple accumulation indicates autovacuum is not keeping up with update/delete workload, risking bloat and performance degradation.

Prompt: “My Postgres tables are showing millions of dead tuples and autovacuum doesn't seem to be keeping up. How do I know if I need to tune autovacuum settings or if there's something blocking it?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,029vs3,185

Latency

20.6smsvs1.2mms

Turns

2vs32

Tool Calls

0vs16

Schema Calls

0vs11

Response

2,874 charsvs2,771 chars

Evaluation

Without Schema

B+

With Schema

A-

The response with Schema provides deeper technical insight by explaining the mathematical reason why default autovacuum settings fail on large tables (a 10M row table requires 2M+ changes before triggering vacuum), while the response without Schema provides this tuning advice without the underlying rationale. The Schema-informed response also includes a more comprehensive list of blocking scenarios (prepared transactions, replication slots, hot_standby_feedback) and cites authoritative sources (Citus, pganalyze, PostgreSQL docs) that allow the operator to verify the guidance. However, the response without Schema offers a slightly more immediately actionable diagnostic workflow with four copy-paste-ready SQL queries. Both responses correctly identify the core issues and provide accurate guidance, but the Schema-informed answer gives the operator better understanding of the root cause and more confidence through external references.

Without Schema

I'll help you diagnose your autovacuum issues. Here's a systematic approach to identify whether autovacuum is blocked or just needs tuning:

## 1. **Check for Long-Running Transactions (Most Common Blocker)**

```sql
-- Find long-running transactions that block autovacuum
SELECT pid, 
       usename, 
       application_name,
       state,
       age(clock_timestamp(), xact_start) AS xact_age,
       age(clock_timestamp(), query_start) AS query_age,
       query
FROM pg_stat_activity
WHERE state != 'idle'
  AND xact_start IS NOT NULL
ORDER BY xact_start;
```

**Why this matters:** Even a single long-running transaction prevents autovacuum from cleaning up rows. The transaction's snapshot keeps those "dead" tuples visible.

## 2. **Check Current Autovacuum Activity**

```sql
-- See what autovacuum is currently doing
SELECT pid, 
       age(clock_timestamp(), query_start) AS duration,
       usename, 
       query,
       wait_event_type,
       wait_event
FROM pg_stat_activity 
WHERE query LIKE '%autovacuum%' 
  AND pid != pg_backend_pid();
```

## 3. **Identify Tables with Bloat**

```sql
-- Check dead tuple stats per table
SELECT schemaname, 
       relname,
       n_live_tup,
       n_dead_tup,
       round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
       last_autovacuum,
       last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;
```

## 4. **Check Autovacuum Configuration**

```sql
-- Current autovacuum settings
SELECT name, setting, unit, context
FROM pg_settings
WHERE name LIKE '%autovacuum%' 
   OR name LIKE '%vacuum%'
ORDER BY name;
```

## Key Diagnostics

**If autovacuum is blocked:**
- Long-running queries in step #1 (especially `xact_age > 1 hour`)
- Idle transactions holding locks
- Replication slots lagging behind

**If autovacuum needs tuning:**
- `last_autovacuum` is NULL or very old
- Dead tuple % keeps climbing
- Autovacuum processes are running but falling behind

## Quick Fixes

**If blocked by transactions:**
```sql
-- Find and kill the blocking transaction (use with caution)
SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE pid = <blocking_pid>;
```

**If autovacuum is too conservative:**
```sql
-- Temporarily more aggressive settings (requires reload)
ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.05;  -- default: 0.2
ALTER SYSTEM SET autovacuum_vacuum_threshold = 50;       -- default: 50
ALTER SYSTEM SET autovacuum_max_workers = 5;             -- default: 3
SELECT pg_reload_conf();
```

**Manual vacuum for immediate relief:**
```sql
VACUUM VERBOSE ANALYZE your_table_name;
```

---

Run these queries and share the results - I can help interpret them and recommend specific tuning based on what we find. The most common culprit is usually an abandoned idle transaction or a slowly-consuming application connection.

With Schema

16 tool calls11 Schema

I'll check Schema.ai's expert guidance on diagnosing autovacuum issues with dead tuples.
Based on Schema.ai's expert guidance, here's how to diagnose whether you need to tune autovacuum settings or if something is blocking it:

## First: Check if autovacuum is blocked

Run this query to find long-running transactions that prevent vacuum from cleaning up dead tuples:

```sql
SELECT pid, datname, usename, state, backend_xmin, 
       age(backend_xmin) as xmin_age,
       now() - xact_start AS duration
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;
```

**What blocks autovacuum:**
- **Long-running transactions** (on any table, not just the one being vacuumed)
- **Stale replication slots** - check `pg_replication_slots` for high `age(xmin)`
- **Abandoned prepared transactions** - check `pg_prepared_xacts`
- **Long queries on standbys** (if `hot_standby_feedback = on`)

If you see "dead tuples not yet removable" in autovacuum logs, this is the smoking gun that something is holding back the xmin horizon.

## Second: Check current dead tuple accumulation

```sql
SELECT schemaname, relname, n_live_tup, n_dead_tup,
       last_vacuum, last_autovacuum,
       n_dead_tup::float / NULLIF(n_live_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;
```

## Third: Determine if tuning is needed

**For large tables (>1M rows)**, the default settings cause problems:
- Default: `autovacuum_vacuum_threshold = 50` + `autovacuum_vacuum_scale_factor = 0.2` (20%)
- A 10M row table needs **2M+ changes** before autovacuum triggers
- This causes massive dead tuple accumulation

**Fix for large tables:**
```sql
ALTER TABLE your_large_table SET (
  autovacuum_vacuum_scale_factor = 0.05,  -- or even 0.01
  autovacuum_vacuum_threshold = 1000
);
```

**Global tuning options** (if many tables affected):
- `autovacuum_max_workers` - increase concurrent vacuum processes
- `autovacuum_naptime` - reduce time between autovacuum runs
- `autovacuum_vacuum_cost_limit` - increase to make vacuum more aggressive

## Resolution steps

1. **If blocked**: Terminate blocking transactions with `pg_terminate_backend(pid)`
2. **If misconfigured**: Adjust scale factors for large tables
3. **Never disable autovacuum** - it risks transaction ID wraparound

**Sources:**
- [Citus: Debugging autovacuum problems](https://www.citusdata.com/blog/2022/07/28/debugging-postgres-autovacuum-problems-13-tips)
- [pganalyze VACUUM Advisor](https://pganalyze.com/blog/introducing-vacuum-advisor-postgres)
- [PostgreSQL docs: Routine Vacuuming](https://www.postgresql.org/docs/current/routine-vacuuming.html)
- [pghealth.io production guide](https://pghealth.io/blog/postgresql_health_check_production_guide)

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When autovacuum falls behind, start by quantifying the dead tuple accumulation to understand severity, then verify autovacuum is actually running by checking last vacuum timestamps. The most common blockers are long-running transactions preventing cleanup and worker pool saturation on busy systems; if those aren't the issue, you likely need to tune autovacuum thresholds and cost limits to match your write workload.

1Quantify dead tuple accumulation and identify affected tables

Check `dead-tuple` (n_dead_tup) and calculate dead tuple percentage using `live-tuple` across all tables. Tables with >1000 dead tuples or >10% dead ratio are bloated—query pg_stat_user_tables ordered by n_dead_tup DESC to find the worst offenders. This baseline tells you whether you have a widespread problem or just a few high-churn tables, which determines whether you need global config changes or per-table tuning.

Dead tuple accumulation causes table bloat and performance degradation High dead tuple percentage degrades query performance and bloats tables Table bloat from dead tuples degrades performance and wastes disk space n_dead_tupn_live_tuppostgresql.dead_rowspostgresql.live_rows

2Check if autovacuum is running at all

Look at `last-vacuum` and `postgresql-last-vacuum-age` for your high-dead-tuple tables. If last_autovacuum is NULL or >24 hours old on active tables, autovacuum isn't running or is severely delayed. Check `postgresql-vacuum-count` to see if the counter is incrementing over time—if autovacuum hasn't run in 24+ hours on tables accumulating dead tuples, something is blocking or disabling it entirely.

Autovacuum has not run in 24+ hours on active tables Autovacuum falling behind on tables causes bloat accumulation last_vacuumpostgresql.last_vacuum_agepostgresql.vacuum.countautovacuum_timestamp

3Look for long-running transactions blocking vacuum

Query pg_stat_activity for transactions with xact_start older than 1+ hours. Even idle transactions hold a transaction ID snapshot that prevents autovacuum from reclaiming rows visible to that snapshot. A single connection sitting idle in transaction for hours can block autovacuum across the entire database—this is the most common cause of autovacuum appearing to "not work" despite being configured correctly.

long-running transactions preventing vacuum Blocked autovacuum allows unbounded bloat accumulation

4Check autovacuum worker pool saturation

Count active autovacuum workers with `SELECT count(*) FROM pg_stat_activity WHERE query LIKE 'autovacuum:%'`. If this count frequently equals autovacuum_max_workers (default 3), your workers are saturated and tables are queuing. With dozens of tables needing vacuum but only 3 workers, each table waits its turn while bloat accumulates—increase autovacuum_max_workers to 5-10 on busy systems to allow parallel processing.

Autovacuum worker starvation delays vacuum on high-churn tables autovacuum workers insufficient for workload Autovacuum worker pool saturation delays emergency vacuum completion

5Evaluate autovacuum threshold and cost settings for your workload

For large or high-write tables, the default autovacuum_vacuum_scale_factor of 0.2 means autovacuum won't trigger until 20% of rows change—on a 10M row table, that's 2M changes before vacuum starts. Check `insert-since-vacuum` and `mod-since-analyze` to see how many changes are queued. Lower scale_factor to 0.05-0.1 for high-churn tables, increase autovacuum_vacuum_cost_limit from default 200 to 500-1000, and reduce cost_delay to 10-20ms to make vacuum more aggressive.

Autovacuum lag causes table bloat and performance degradation Autovacuum falls behind on high-write tables due to conservative defaults autovacuum cost limits too conservative Table bloat from suboptimal autovacuum settings slows query performance Autovacuum lag causes bloat and planner estimate degradation n_ins_since_vacuumn_mod_since_analyzetable_bloat

6Check for transaction ID wraparound pressure

Review `postgresql-before-xid-wraparound` to see how many transactions remain before wraparound. If this is below 10 million, PostgreSQL will force aggressive emergency autovacuum that can saturate workers and delay normal maintenance vacuum. Wraparound pressure can make it seem like autovacuum is "falling behind" when it's actually prioritizing anti-wraparound work over bloat cleanup.

approaching transaction ID wraparound Autovacuum worker pool saturation delays emergency vacuum completion before_xid_wraparound