
Disaster Recovery
This guide covers failure scenarios and recovery procedures for the S3 Orchestrator.
Architecture Context
The orchestrator has two required stateful components and one optional:
- PostgreSQL stores object locations, quota counters, usage stats, multipart state, and the cleanup queue. This is the source of truth for “which object lives on which backend.”
- Storage backends (OCI, R2, S3, MinIO, etc.) hold the actual object data. These are independent and unaware of each other.
- Redis (optional) provides shared usage counters across instances. Not a data dependency — all authoritative data lives in PostgreSQL. See Redis Failure below.
The orchestrator binary itself is stateless. Any instance with access to the database and backends can serve requests.
PostgreSQL Failure
What happens
When PostgreSQL becomes unreachable, the circuit breaker activates automatically:
- Reads enter degraded mode: the orchestrator broadcasts GET requests to all backends (or uses a short-lived cache of recent key-to-backend mappings if
circuit_breaker.cache_ttlis configured). Unencrypted objects found on any backend are returned normally. Encrypted objects return503 Service Unavailablebecause decryption requires encryption metadata (wrapped DEK, key ID) stored in the database. - Writes return
503 Service Unavailablebecause the orchestrator cannot record the object’s location without the database. - Background workers (rebalancer, replicator, usage flush, cleanup queue) pause until the database recovers.
The circuit breaker transitions back to healthy after successful database probes. No manual intervention is needed for recovery.
Monitoring
Watch for these log messages:
The /health endpoint continues to return 200 OK with "status":"degraded" during database outages, so the service stays in load balancer rotation for reads.
PostgreSQL High Availability
The orchestrator connects to PostgreSQL via a standard connection string and has no opinion about how the database is made highly available. Any PostgreSQL-compatible endpoint works, including:
- Patroni — open-source HA with automatic failover via etcd/ZooKeeper consensus
- Amazon RDS Multi-AZ / Aurora — managed failover with DNS endpoint
- Google Cloud SQL HA — regional instances with automatic failover
- Neon / Supabase — serverless PostgreSQL with built-in redundancy
When the database fails over to a replica, the circuit breaker briefly opens (writes return 503), then the probe detects the new primary and recovers automatically. The failover window is typically under 30 seconds with Patroni or managed services.
Writes are intentionally rejected during database outages rather than queued locally. This is a deliberate design choice: the database is the source of truth for object locations, and accepting writes without recording metadata would create orphaned objects that diverge from the metadata store. Use HA PostgreSQL to minimize the rejection window.
Replication and the data loss window
Object replication is asynchronous. When a client writes an object, it is stored on a single backend and a 200 response is returned immediately. The background replication worker creates additional copies at the configured replication.worker_interval (default: 5 minutes).
If the primary backend fails before replication completes, unreplicated objects written in the last worker interval are at risk. To minimize this window:
- Set
replication.worker_intervalto a lower value (e.g.,30s) at the cost of more backend API calls - Use backends with built-in durability guarantees (e.g., S3 Standard stores objects across 3+ availability zones)
- Monitor
s3o_replication_pending— a sustained non-zero value indicates the replicator cannot keep up
Synchronous replication (write to N backends before returning 200) is not currently supported.
Restoring PostgreSQL from Backup
Restore the database using your standard PostgreSQL backup procedure (pg_dump/pg_restore, WAL replay, etc.).
Start the orchestrator (or let it reconnect). Database migrations are embedded in the binary and applied automatically on startup via goose. Any missing migrations are applied in order.
Reconcile metadata if objects were written to backends while the database was down (e.g., via direct S3 API access). For each backend, run:
The sync command scans the backend’s S3 bucket and imports objects not already tracked in the database. Existing records are skipped.
Verify by checking the admin CLI:
Backend Permanently Unavailable
If a storage backend is permanently lost (provider shutdown, account deleted, etc.):
With replication (factor >= 2)
Objects that have replicas on other backends continue to be served transparently. The orchestrator tries the failed backend, gets an error, and falls through to the next copy.
If backend circuit breakers are enabled and the outage lasts longer than replication.unhealthy_threshold (default: 10 minutes), the replication worker automatically creates replacement copies on healthy backends to maintain the configured replication factor. This restores full redundancy without manual intervention while the failed backend remains down.
To clean up, remove the backend’s database records:
Then remove the backend from the config file and restart.
Without replication (factor = 1)
Objects stored exclusively on the lost backend are unrecoverable from the orchestrator’s perspective. The database records remain but point to a backend that no longer exists.
To clean up, remove the orphaned database records:
Then remove the backend from the config file and restart.
Planned decommission
If a backend is still reachable but you want to decommission it, use the drain operation to migrate all objects to other backends first (no data loss):
See the Admin Guide for the full drain workflow.
Preventing data loss
Enable replication with replication.factor: 2 or higher so every object exists on at least two backends. This is the primary defense against backend loss.
Cleanup Queue Recovery
The cleanup queue tracks failed delete operations with exponential backoff (1 minute to 24 hours, max 10 attempts). Each item tracks size_bytes, and the corresponding backend’s orphan_bytes counter is incremented on enqueue and decremented on successful cleanup. This ensures the write path never overcommits storage on a backend with pending cleanups. If items get stuck:
Items that exhaust all 10 retry attempts are graduated to the cleanup_dlq table by the worker (single transaction: read the queue row, insert into cleanup_dlq, delete the queue row). orphan_bytes is intentionally NOT decremented during the move because the backend object is still on disk. The write path continues to account for the unreleased space until an operator confirms the object is gone and writes off the row.
Monitor s3o_cleanup_dlq_depth (gauge) and s3o_cleanup_dlq_enqueued_total{backend} (counter): a non-zero depth means at least one unrecoverable orphan needs operator action; a single backend dominating the counter rate means that backend’s delete path is broken and should be investigated.
After manually resolving a DLQ entry (e.g. confirming via the reconciler that the object is gone, or deleting it out-of-band), decrement orphan_bytes and remove the row:
Multi-Instance Recovery
When running multiple orchestrator instances:
- Advisory locks in PostgreSQL ensure only one instance runs each background worker (rebalancer, replicator, cleanup queue, lifecycle). If an instance crashes, the lock is released and another instance picks up the work.
- No split-brain risk for writes because PostgreSQL transactions serialize object location records. Two instances writing the same key will both succeed, but only one location record wins (the database is the arbiter).
- Startup catch-up: the replication worker runs an immediate reconciliation pass on startup before entering its periodic loop. This handles any objects that fell behind while instances were down.
Redis Failure
When Redis is configured for shared usage counters:
What happens
The circuit breaker opens after consecutive failures (default: 3). Each instance falls back to local in-memory counters — identical behavior to running without Redis. Usage enforcement continues but with the per-instance blind spot restored. The s3o_redis_fallback_active gauge transitions to 1.
A background health probe PINGs Redis every 5 seconds while the circuit is open. This requires no manual intervention.
Recovery sequence
When the health probe detects Redis is reachable again:
- Stale Redis keys for the current period are deleted (PG already absorbed those values via flushes during the outage)
- Each instance INCRBYs its unflushed local deltas to Redis (additive — safe even if instances recover at different times)
- Local counters are zeroed
- Circuit breaker closes, shared operation resumes
Monitoring
s3o_redis_fallback_active—1when using local counters,0when Redis is healthys3o_redis_operations_total{operation,status}— track Redis operation success/error ratess3o_circuit_breaker_state{name="redis"}— circuit breaker state (closed/open)
Impact
Redis is a performance optimization, not a data dependency. All authoritative usage data lives in PostgreSQL. A Redis outage causes:
- Temporary accuracy reduction — same as running without Redis (per-instance counters with flush-gap)
- No data loss — PG flush continues via local counters, one instance per tick
- Automatic recovery — no operator action required
Recovery Checklist
- Identify the failure (database, backend, or orchestrator instance)
- Restore the failed component
- Check logs for circuit breaker state transitions
- Run
s3-orchestrator admin statusto verify health - If database was restored from backup, run
syncfor each backend to reconcile - Monitor the cleanup queue for any stuck items