s3-orchestrator

Architecture

Architecture

              S3 clients (aws cli, rclone, etc.)
                          |
                          v
                    +-----------+
                    | S3 Orch.  |  <-- SigV4 auth, rate limiting, quota routing
                    +-----------+
                     |         |
            +--------+         +------------------+------------------+
            v                  v                  v                  v
       PostgreSQL        OCI Object         Backblaze B2          AWS S3
       (metadata)       Storage (20 GB)       (10 GB)             (5 GB)
                              \                  |                  /
                               '------------ 35 GB total ---------'

Metadata layer

PostgreSQL (or embedded SQLite) stores:

  • object_locations — every object’s exact backend placement + content hash + per-copy size
  • backend_quotas — per-backend quota counters + orphan-bytes tracking
  • multipart_uploads / multipart_parts — multipart upload state
  • cleanup_queue — durable retry queue for failed deletions (DLQ-capable)
  • pending_intents — PUT-before-COMMIT crash-recovery rows

Schema is applied automatically on startup via goose versioned migrations embedded in the binary. All queries are generated by sqlc from annotated .sql files and executed via pgx/v5 connection pools.

See docs/database.md for engine choice, migration mechanics, and full schema.

Storage layer

Split into three Go packages so the same orchestration code drives both engines:

  • internal/store/core — engine-agnostic types, role interfaces, and orchestration helpers (the multi-step transactional operations like RecordObject, PromotePending, MoveObjectLocation)
  • internal/store/postgres — Postgres adapter
  • internal/store/sqlite — SQLite adapter

Both adapters implement core.TxAdapter. SQLite is the default for single-instance use; PostgreSQL is required for multi-instance deployments.

Backends

Standard S3-compatible services accessed via AWS SDK v2, each with a dedicated tuned HTTP transport (connection pooling, idle timeout for DNS freshness). Streaming operations use a shared buffer pool to reduce GC pressure. Any provider that speaks the S3 API works — OCI Object Storage, Backblaze B2, AWS S3, MinIO, Wasabi, Cloudflare R2, GCS (with caveats), etc.

See docs/backends.md for the provider quick-reference table and supported configurations.

Write routing

The routing_strategy config selects how a write picks its target backend:

  • pack (default) — fill backends in config order, first one with available quota wins. Good for stacking free-tier allocations sequentially.
  • spread — pick the backend with the lowest utilization ratio ((bytes_used + orphan_bytes) / bytes_limit). Good for distributing load evenly.

Quota updates are written atomically in the same transaction as the object location record. Set quota_bytes: 0 (or omit it) to disable quota enforcement on a backend — useful when you want unified access or replication without cost control. Backends with a max_object_size limit automatically skip objects that exceed the limit during routing, rebalancing, and replication, preventing repeated 413 errors from providers with per-object size restrictions.

See docs/backends.md for full routing semantics.

Usage limits

Optionally cap monthly API requests, egress bytes, and ingress bytes per backend. When a backend exceeds a limit, writes overflow to other backends and reads fail over to replicas. Delete and abort operations always bypass limits.

Limits are enforced using cached database totals (refreshed at the configured flush interval) plus unflushed counters held in Redis when configured, otherwise in local in-memory atomics. Adaptive flushing automatically shortens the interval when any backend approaches a limit.

See docs/backends.md for the enforcement semantics and docs/monitoring.md for the relevant metrics.

Replication

Set a per-bucket replication.factor and the background replicator ensures every object has N copies across distinct backends. Replication is health-aware: a backend whose circuit breaker is open is excluded from the replica count, and the replicator creates a substitute copy on a healthy backend to maintain the factor. When the unhealthy backend returns, the over-replication cleaner removes the excess copy according to a scoring policy (draining < circuit-broken < healthy-by-utilization).

See docs/replication.md for the full lifecycle, including orphan reconciliation.

Background services

Long-running workers keep the metadata layer consistent with the backends:

WorkerWhat it does
ReplicatorCreates missing replicas to reach replication.factor
RebalancerMoves objects between backends to balance utilization
Over-replication cleanerRemoves excess copies when factor decreases or a stale replica appears
Cleanup queueRetries failed deletions; graduates to DLQ after repeated failures
Pending reaperResolves abandoned PUT-before-COMMIT intents
LifecycleExpires objects matching configured prefix/age rules
ScrubberVerifies stored SHA-256 hashes against backend content
ReconcilerDetects orphans (backend objects with no DB record) and stale DB rows
Usage flushPeriodically writes unflushed usage counters to the DB

All workers are advisory-locked so a multi-instance deployment runs each one on exactly one instance at a time.

See docs/background-services.md for the full table with intervals and lock IDs.