Disaster Recovery

This guide covers failure scenarios and recovery procedures for the S3 Orchestrator.

Architecture Context

The orchestrator has two required stateful components and one optional:

PostgreSQL stores object locations, quota counters, usage stats, multipart state, and the cleanup queue. This is the source of truth for “which object lives on which backend.”
Storage backends (OCI, R2, S3, MinIO, etc.) hold the actual object data. These are independent and unaware of each other.
Redis (optional) provides shared usage counters across instances. Not a data dependency — all authoritative data lives in PostgreSQL. See Redis Failure below.

The orchestrator binary itself is stateless. Any instance with access to the database and backends can serve requests.

PostgreSQL Failure

What happens

When PostgreSQL becomes unreachable, the circuit breaker activates automatically:

Reads enter degraded mode: the orchestrator broadcasts GET requests to all backends (or uses a short-lived cache of recent key-to-backend mappings if circuit_breaker.cache_ttl is configured). Unencrypted objects found on any backend are returned normally. Encrypted objects return 503 Service Unavailable because decryption requires encryption metadata (wrapped DEK, key ID) stored in the database.
Writes return 503 Service Unavailable because the orchestrator cannot record the object’s location without the database.
Background workers (rebalancer, replicator, usage flush, cleanup queue) pause until the database recovers.

The circuit breaker transitions back to healthy after successful database probes. No manual intervention is needed for recovery.

Monitoring

Watch for these log messages:

{"level":"WARN","msg":"Circuit breaker opened","consecutive_failures":3}
{"level":"INFO","msg":"Circuit breaker closed","state":"healthy"}

The /health endpoint continues to return 200 OK with "status":"degraded" during database outages, so the service stays in load balancer rotation for reads.

PostgreSQL High Availability

The orchestrator connects to PostgreSQL via a standard connection string and has no opinion about how the database is made highly available. Any PostgreSQL-compatible endpoint works, including:

Patroni — open-source HA with automatic failover via etcd/ZooKeeper consensus
Amazon RDS Multi-AZ / Aurora — managed failover with DNS endpoint
Google Cloud SQL HA — regional instances with automatic failover
Neon / Supabase — serverless PostgreSQL with built-in redundancy

When the database fails over to a replica, the circuit breaker briefly opens (writes return 503), then the probe detects the new primary and recovers automatically. The failover window is typically under 30 seconds with Patroni or managed services.

Writes are intentionally rejected during database outages rather than queued locally. This is a deliberate design choice: the database is the source of truth for object locations, and accepting writes without recording metadata would create orphaned objects that diverge from the metadata store. Use HA PostgreSQL to minimize the rejection window.

Replication and the data loss window

Object replication is asynchronous. When a client writes an object, it is stored on a single backend and a 200 response is returned immediately. The background replication worker creates additional copies at the configured replication.worker_interval (default: 5 minutes).

If the primary backend fails before replication completes, unreplicated objects written in the last worker interval are at risk. To minimize this window:

Set replication.worker_interval to a lower value (e.g., 30s) at the cost of more backend API calls
Use backends with built-in durability guarantees (e.g., S3 Standard stores objects across 3+ availability zones)
Monitor s3o_replication_pending — a sustained non-zero value indicates the replicator cannot keep up

Synchronous replication (write to N backends before returning 200) is not currently supported.

Restoring PostgreSQL from Backup

Restore the database using your standard PostgreSQL backup procedure (pg_dump/pg_restore, WAL replay, etc.).
Start the orchestrator (or let it reconnect). Database migrations are embedded in the binary and applied automatically on startup via goose. Any missing migrations are applied in order.
Reconcile metadata if objects were written to backends while the database was down (e.g., via direct S3 API access). For each backend, run:
```
s3-orchestrator sync \
  --config config.yaml \
  --backend <backend-name> \
  --bucket <virtual-bucket>
```
The sync command scans the backend’s S3 bucket and imports objects not already tracked in the database. Existing records are skipped.
Verify by checking the admin CLI:
```
s3-orchestrator admin status
```

Backend Permanently Unavailable

If a storage backend is permanently lost (provider shutdown, account deleted, etc.):

With replication (factor >= 2)

Objects that have replicas on other backends continue to be served transparently. The orchestrator tries the failed backend, gets an error, and falls through to the next copy.

If backend circuit breakers are enabled and the outage lasts longer than replication.unhealthy_threshold (default: 10 minutes), the replication worker automatically creates replacement copies on healthy backends to maintain the configured replication factor. This restores full redundancy without manual intervention while the failed backend remains down.

To clean up, remove the backend’s database records:

s3-orchestrator admin remove-backend <backend-name>

Then remove the backend from the config file and restart.

Without replication (factor = 1)

Objects stored exclusively on the lost backend are unrecoverable from the orchestrator’s perspective. The database records remain but point to a backend that no longer exists.

To clean up, remove the orphaned database records:

s3-orchestrator admin remove-backend <backend-name>

Then remove the backend from the config file and restart.

Planned decommission

If a backend is still reachable but you want to decommission it, use the drain operation to migrate all objects to other backends first (no data loss):

# Start the drain — objects are migrated in the background
s3-orchestrator admin drain <backend-name>

# Monitor progress
s3-orchestrator admin drain-status <backend-name>

# Once complete, remove from config and restart

See the Admin Guide for the full drain workflow.

Preventing data loss

Enable replication with replication.factor: 2 or higher so every object exists on at least two backends. This is the primary defense against backend loss.

Cleanup Queue Recovery

The cleanup queue tracks failed delete operations with exponential backoff (1 minute to 24 hours, max 10 attempts). Each item tracks size_bytes, and the corresponding backend’s orphan_bytes counter is incremented on enqueue and decremented on successful cleanup. This ensures the write path never overcommits storage on a backend with pending cleanups. If items get stuck:

# Check the queue
s3-orchestrator admin cleanup-queue

# The background worker processes items automatically every minute.
# If the queue is backed up, check the logs for persistent errors.

Items that exhaust all 10 retry attempts are graduated to the cleanup_dlq table by the worker (single transaction: read the queue row, insert into cleanup_dlq, delete the queue row). orphan_bytes is intentionally NOT decremented during the move because the backend object is still on disk. The write path continues to account for the unreleased space until an operator confirms the object is gone and writes off the row.

Monitor s3o_cleanup_dlq_depth (gauge) and s3o_cleanup_dlq_enqueued_total{backend} (counter): a non-zero depth means at least one unrecoverable orphan needs operator action; a single backend dominating the counter rate means that backend’s delete path is broken and should be investigated.

After manually resolving a DLQ entry (e.g. confirming via the reconciler that the object is gone, or deleting it out-of-band), decrement orphan_bytes and remove the row:

-- View pending DLQ entries
SELECT id, original_id, backend_name, object_key, reason, attempts,
       size_bytes, first_enqueued_at, moved_at, last_error
FROM cleanup_dlq
ORDER BY moved_at;

-- After confirming the object is gone:
BEGIN;
UPDATE backend_quotas
   SET orphan_bytes = GREATEST(0, orphan_bytes - (SELECT size_bytes FROM cleanup_dlq WHERE id = 42))
 WHERE backend_name = (SELECT backend_name FROM cleanup_dlq WHERE id = 42);
DELETE FROM cleanup_dlq WHERE id = 42;
COMMIT;

-- To push a DLQ entry back through automatic retry (e.g. after fixing the backend):
INSERT INTO cleanup_queue (backend_name, object_key, reason, size_bytes, next_retry, attempts, last_error)
SELECT backend_name, object_key, reason, size_bytes, NOW(), 0, last_error
  FROM cleanup_dlq WHERE id = 42;
DELETE FROM cleanup_dlq WHERE id = 42;

Multi-Instance Recovery

When running multiple orchestrator instances:

Advisory locks in PostgreSQL ensure only one instance runs each background worker (rebalancer, replicator, cleanup queue, lifecycle). If an instance crashes, the lock is released and another instance picks up the work.
No split-brain risk for writes because PostgreSQL transactions serialize object location records. Two instances writing the same key will both succeed, but only one location record wins (the database is the arbiter).
Startup catch-up: the replication worker runs an immediate reconciliation pass on startup before entering its periodic loop. This handles any objects that fell behind while instances were down.

Redis Failure

When Redis is configured for shared usage counters:

What happens

The circuit breaker opens after consecutive failures (default: 3). Each instance falls back to local in-memory counters — identical behavior to running without Redis. Usage enforcement continues but with the per-instance blind spot restored. The s3o_redis_fallback_active gauge transitions to 1.

A background health probe PINGs Redis every 5 seconds while the circuit is open. This requires no manual intervention.

Recovery sequence

When the health probe detects Redis is reachable again:

Stale Redis keys for the current period are deleted (PG already absorbed those values via flushes during the outage)
Each instance INCRBYs its unflushed local deltas to Redis (additive — safe even if instances recover at different times)
Local counters are zeroed
Circuit breaker closes, shared operation resumes

Monitoring

s3o_redis_fallback_active — 1 when using local counters, 0 when Redis is healthy
s3o_redis_operations_total{operation,status} — track Redis operation success/error rates
s3o_circuit_breaker_state{name="redis"} — circuit breaker state (closed/open)

Impact

Redis is a performance optimization, not a data dependency. All authoritative usage data lives in PostgreSQL. A Redis outage causes:

Temporary accuracy reduction — same as running without Redis (per-instance counters with flush-gap)
No data loss — PG flush continues via local counters, one instance per tick
Automatic recovery — no operator action required

Recovery Checklist

Identify the failure (database, backend, or orchestrator instance)
Restore the failed component
Check logs for circuit breaker state transitions
Run s3-orchestrator admin status to verify health
If database was restored from backup, run sync for each backend to reconcile
Monitor the cleanup queue for any stuck items