Comprehensive Operational Readiness Review
infoProactive Health
Baseline assessment of PostgreSQL instance health, configuration, and operational maturity across all dimensions.
Prompt: “I set up PostgreSQL a while ago and the workload has evolved a lot since then. How do I make sure it's well configured and provisioned? I want a comprehensive health check covering performance, capacity, reliability, and cost.”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
For a comprehensive operational readiness review, start by understanding how your workload has evolved since initial setup, then ensure your reliability fundamentals (backups, replication, critical risks) are sound before diving into resource constraints, autovacuum effectiveness, and configuration tuning opportunities. This prioritizes preventing outages over optimizing performance.
1Establish baseline health and understand workload evolution
Start with `overall-instance-health-status` and `workload-characteristics-and-trends` to understand what's actually changed. Check `postgresql-backends` and `postgresql-connections-by-process` to see current connection patterns, and `postgresql-database-size` to understand growth. If your database has grown 10x but your config hasn't changed, you'll find issues in every other area. This baseline tells you what to prioritize in subsequent steps.
2Verify backup and disaster recovery readiness
Check `backup-and-recovery-readiness` and `replication-health` before anything else—if you can't recover from failure, performance tuning is meaningless. If you have replicas, verify `replication-lag` is consistently under 10 seconds; sustained lag over 60s means your replica can't keep up. Test your backup restoration process if you haven't in the last quarter; I've seen too many teams discover broken backups during an outage.
3Assess critical reliability risks that cause outages
Check `reliability-risk-assessment` for silent killers like transaction ID wraparound. Look at `postgresql-before-xid-wraparound`—if this is below 100 million, you're at risk of forced shutdown. Check `long-running-transaction-oldest-timestamp`—transactions open for hours block autovacuum and can cascade into performance emergencies. These risks tend to grow slowly as workload evolves, then bite you suddenly.
4Identify resource bottlenecks and capacity headroom
Check `resource-bottleneck-identification` and `capacity-headroom-adequacy` to see if you're running out of headroom. Look at `postgresql-percent-usage-connections`—if you're consistently above 80%, you'll start seeing connection errors. High `postgresql-temp-bytes` means `work_mem` is too low for your current workload. Review `postgresql-locks` for contention patterns that didn't exist when the workload was smaller.
5Evaluate autovacuum effectiveness and table bloat
Check `autovacuum-effectiveness` because configs that worked at low volume often fall behind as workload grows. Look at `dead-tuple` counts and `postgresql-table-bloat`—bloat consistently above 20% means autovacuum can't keep up with your update/delete rate. This degrades performance slowly over time, so it's easy to miss until queries that used to take milliseconds now take seconds. Tuning autovacuum is often the highest-impact fix for evolved workloads.
6Review configuration tuning and performance optimization opportunities
Check `configuration-tuning-needed` and `performance-optimization-opportunities` to align settings with your current workload. Look at `postgresql-buffer-hit` or `hit`—if this is below 99%, your `shared_buffers` may be undersized for current working set. Review `checkpoint-timed` frequency—if checkpoints happen too often, you're creating unnecessary I/O; too rarely, and recovery time suffers. Initial configurations rarely scale optimally with workload evolution.
7Identify cost optimization opportunities
Finally, check `cost-optimization-opportunities` now that you understand reliability and performance needs. Look for over-provisioned resources—if `postgresql-percent-usage-connections` is consistently below 30%, you may have room to reduce connection pooling overhead or consolidate instances. Review if monitoring and alerting gaps from `monitoring-and-alerting-gaps` mean you're over-provisioned "just in case." Cost optimization comes last because you need reliability and performance baselines first.
Technologies
Related Insights
overall instance health status
diagnostic_step
Stub insight created from scenario discovery hint: overall instance health status
backup and recovery readiness
Stub insight created from scenario discovery hint: backup and recovery readiness
replication health
Stub insight created from scenario discovery hint: replication health
cost optimization opportunities
diagnostic_step
Stub insight created from scenario discovery hint: cost optimization opportunities
version upgrade considerations
Stub insight created from scenario discovery hint: version upgrade considerations
monitoring and alerting gaps
Stub insight created from scenario discovery hint: monitoring and alerting gaps
operational best practice adherence
Stub insight created from scenario discovery hint: operational best practice adherence
workload characteristics and trends
diagnostic_step
Stub insight created from scenario discovery hint: workload characteristics and trends
resource bottleneck identification
Stub insight created from scenario discovery hint: resource bottleneck identification
configuration tuning needed
diagnostic_step
Stub insight created from scenario discovery hint: configuration tuning needed
autovacuum effectiveness
diagnostic_step
Stub insight created from scenario discovery hint: autovacuum effectiveness
capacity headroom adequacy
diagnostic_step
Stub insight created from scenario discovery hint: capacity headroom adequacy
reliability risk assessment
diagnostic_step
Stub insight created from scenario discovery hint: reliability risk assessment
performance optimization opportunities
supporting_evidence
Stub insight created from scenario discovery hint: performance optimization opportunities
Relevant Metrics
hitspostgresql.backends(primary_diagnostic)percent_usage_connections(threshold_indicator)n_dead_tup(primary_diagnostic)buffer_hit(primary_diagnostic)checkpoint_timed(primary_diagnostic)postgresql.database.size(primary_diagnostic)before_xid_wraparound(threshold_indicator)replication_lag_seconds(primary_diagnostic)table_bloat(secondary_diagnostic)postgresql.locks(secondary_diagnostic)postgresql.temp_bytes(secondary_diagnostic)long_running_transaction_oldest_timestamp(secondary_diagnostic)uptime(monitor)connections_by_process(primary_diagnostic)transactions_open(monitor)Monitoring Interfaces
PostgreSQL Native