
Admin Guide
This guide walks through deploying, configuring, and operating the S3 Orchestrator from scratch. For architecture and feature details, see the README. For client-side usage (AWS CLI, rclone, SDKs), see the User Guide.
Prerequisites
- PostgreSQL — any recent version. The orchestrator auto-applies its schema on startup.
- At least one S3-compatible storage backend — OCI Object Storage, Backblaze B2, AWS S3, MinIO, Wasabi, etc. You need a bucket and access credentials on that backend.
- The orchestrator binary — a Docker image (via
make push VERSION=vX.Y.Z), a.debpackage (viamake deb VERSION=X.Y.Z), or built from source (make run). - Redis (optional) — for shared usage counters in multi-instance deployments. See usage_flush for details.
Quickstart
Get a minimal single-bucket, single-backend orchestrator running in five steps.
1. Generate a config file
Run the interactive config generator:
This prompts for a database driver (SQLite by default), backend credentials, and virtual bucket settings. Or create a config manually:
2. Create a minimal config
Save this as config.yaml.
3. Start the orchestrator
4. Verify it’s running
5. Test with a quick upload/download
Configuration Walkthrough
This section covers each config section in detail. See packaging/config.yaml for a complete template.
All config values support ${ENV_VAR} expansion — the orchestrator calls os.Expand on the entire YAML file before parsing. Use this for secrets:
server
listen_addris the only required field.max_object_sizecaps single-PUT uploads. Larger objects should use multipart upload (most clients do this automatically). PutObject buffers the entire body in memory to support write failover across backends, so peak memory from uploads is approximatelymax_object_size x max_concurrent_writes.max_concurrent_requestslimits the number of S3 requests processed simultaneously. When the limit is reached, new requests are rejected with503 SlowDownandRetry-After: 1. Set to 2-3xdatabase.max_connsfor load shedding.0disables the limit.max_concurrent_readsandmax_concurrent_writesprovide separate concurrency limits for reads (GET, HEAD) and writes (PUT, POST, DELETE). When both are set, they replacemax_concurrent_requestswith independent pools so write storms cannot starve reads. Background workers contend with HTTP writes, not reads — cleanup, replication, rebalance, pending reaper, and over-replication acquire admission slots from the same pool sized tomax_concurrent_writes. In merged mode (max_concurrent_requestsonly), every HTTP request and every background worker shares the single global pool. Sizemax_concurrent_writesto accommodate both peak HTTP write traffic and the worst-case overlap of background worker activity (typically the replication factor × replicator concurrency for the dominant case). See issue #835 for the design rationale.load_shed_thresholdenables active load shedding. When in-flight requests exceed this fraction of pool capacity (e.g.0.8), new requests are probabilistically rejected before the hard limit, providing smooth degradation instead of a cliff.admission_waitadds a brief wait before rejecting when the semaphore is full (e.g.50ms). Smooths micro-bursts without adding latency during sustained overload. Default0means instant rejection.backend_timeoutbounds individual S3 API calls to backends. Increase if you have slow backends or large objects.read_header_timeoutprotects against slow-read attacks that hold connections open by sending headers slowly. The 10-second default is generous for any legitimate client.read_timeoutandwrite_timeoutbound the total time for reading/writing entire requests and responses. The 5-minute defaults accommodate large object transfers.idle_timeoutcontrols how long keep-alive connections stay open waiting for the next request.shutdown_delayadds a pause between marking the instance as not-ready and starting the HTTP drain on SIGTERM. Set this to ~5s in environments where service deregistration is asynchronous (Consul, Kubernetes) so load balancers stop routing before connections are closed. Default0means no delay.
buckets
Each bucket defines a virtual namespace with one or more credential sets.
Generating credentials: Use openssl rand to produce random keys:
Validation rules:
- Bucket names must not contain
/. - Bucket names must be unique across the config.
- Access key IDs must be globally unique across all buckets.
- Each bucket must have at least one credential set.
- Each credential needs either
access_key_id+secret_access_key(SigV4) ortoken(legacy).
Multiple credentials on the same bucket let different services share a namespace with independent keys. This is useful when you want a writer service and a reader service accessing the same files.
SigV4 credentials also support presigned URLs automatically. Clients can generate time-limited presigned URLs using any AWS SDK presign client — no additional configuration is needed on the orchestrator side.
database
The driver field selects between SQLite (embedded, zero-dependency) and PostgreSQL (required for multi-instance deployments). When driver is omitted, the orchestrator infers postgres if host is set, otherwise sqlite.
SQLite (default for single-instance):
SQLite requires no external dependencies. The database file is created automatically on first start. Advisory lock-based leader election is replaced by a process-local mutex, so multi-instance deployments are not supported with SQLite.
PostgreSQL (required for multi-instance):
Pool settings (max_conns, min_conns, max_conn_lifetime) control the pgx connection pool. Size max_conns to 2-3x your max_concurrent_requests setting. See Performance Tuning - Connection Pool Sizing for detailed guidance.
routing_strategy
Controls how the orchestrator selects a backend when writing new objects.
- pack (default) — fills the first backend in config order until its quota is full, then overflows to the next. Best for stacking free-tier allocations sequentially.
- spread — places each object on the backend with the lowest utilization ratio (
(bytes_used + orphan_bytes) / bytes_limit). Best for distributing storage evenly across backends.
Both strategies respect quota limits and usage limits — full or over-limit backends are always skipped.
backends
Each backend is an S3-compatible storage service with its own credentials and optional quota.
Endpoint URLs by provider:
| Provider | Endpoint format | force_path_style |
|---|---|---|
| OCI Object Storage | https://<namespace>.compat.objectstorage.<region>.oraclecloud.com | true |
| Backblaze B2 | https://s3.<region>.backblazeb2.com | true |
| AWS S3 | https://s3.<region>.amazonaws.com | false |
| MinIO | http://<host>:9000 | true |
| Wasabi | https://s3.<region>.wasabisys.com | true |
Quota: Set quota_bytes to limit how much data a backend can hold. Set to 0 or omit for unlimited. Quota is tracked in PostgreSQL and updated atomically with every write/delete. Note that multipart uploads do not reserve quota upfront — temporary parts consume backend storage without being counted against the quota until CompleteMultipartUpload records the final object size. A client uploading many large parts could temporarily exceed a backend’s quota before completion.
Max object size: Some providers impose per-object size limits (e.g. Supabase rejects uploads over 50 MB with 413 EntityTooLarge). Set max_object_size to prevent the orchestrator from routing writes, rebalance moves, or replication copies to a backend when the object exceeds the limit:
Usage limits: Optional monthly caps on API requests, egress, and ingress per backend:
When a backend exceeds a usage limit, writes overflow to the next eligible backend. Limits reset each month automatically.
Unsigned payload: By default, uploads stream directly to backends without buffering the entire body in memory. The AWS SDK normally buffers the request body to compute a SigV4 payload hash (SHA-256), but the orchestrator uses UNSIGNED-PAYLOAD to skip this. Without streaming, large uploads (multipart completion, replication) can cause out-of-memory kills.
For HTTPS endpoints, unsigned payload is enabled by default. For plain HTTP endpoints, it is auto-disabled unless explicitly set — AWS S3 rejects unsigned payloads over HTTP, but most S3-compatible backends (MinIO, R2, etc.) accept them. Set unsigned_payload: true on HTTP backends to enable streaming:
Set unsigned_payload: false to force payload hashing. This buffers the entire object in memory before uploading — only use this if you have a specific compliance requirement for end-to-end payload integrity independent of TLS.
Disable checksum: AWS SDK v2 defaults to sending streaming checksums (CRC64NVME) on uploads. Some S3-compatible providers — notably Google Cloud Storage — reject these with SignatureDoesNotMatch. Set disable_checksum: true on backends that don’t support the AWS checksum headers:
This sets the SDK’s RequestChecksumCalculation and ResponseChecksumValidation to WhenRequired, disabling automatic checksum injection without affecting SigV4 request signing.
Strip SDK headers: AWS SDK v2 adds headers (amz-sdk-invocation-id, amz-sdk-request, accept-encoding) and a query parameter (x-id) that are included in the SigV4 signed header set. Google Cloud Storage does not include these when verifying the signature, causing SignatureDoesNotMatch errors. Set strip_sdk_headers: true to remove them before request signing:
For GCS backends, you typically need both disable_checksum: true and strip_sdk_headers: true:
Credential source: credential_source selects how the orchestrator obtains credentials for the backend. Default is static, which uses the access_key_id / secret_access_key fields above. Set to default_chain to delegate to the AWS SDK’s default credential chain (env vars, EC2 IMDS, SSO, ~/.aws/credentials, STS assume-role). When default_chain is set, the two key fields must be omitted — leaving stale keys behind is rejected at validation so they cannot silently shadow the SDK-resolved credentials.
Use default_chain when:
- The orchestrator runs on an EC2 instance with an IAM role attached (IMDS-vended credentials rotate every ~6 hours and cannot be tracked by YAML).
- Local development uses SSO (
aws sso login) instead of long-lived keys. - You want the SDK to resolve credentials via STS assume-role chains.
Note: the config loader already expands ${ENV_VAR} references at load time, so access_key_id: ${AWS_ACCESS_KEY_ID} covers the env-var case under credential_source: static. Use default_chain for credential sources the loader cannot reach (IMDS, SSO, STS) and for cases where refresh matters.
telemetry
Metrics are served on the same port as the S3 API. Tracing exports spans via gRPC OTLP (e.g., to Tempo or Jaeger).
Production sample rate guidance: A sample_rate of 1.0 traces every request, which is appropriate for development and low-traffic deployments. For production workloads above ~100 RPS, reduce to 0.01–0.1 to avoid overwhelming the trace backend with storage, network, and CPU overhead. Metrics and logs are unaffected by sample rate.
circuit_breaker
The circuit breaker is always active. These settings tune its sensitivity.
When the database is unreachable, the orchestrator enters degraded mode: reads broadcast to all backends (with caching), writes return 503. The circuit automatically recovers when the database comes back.
By default, degraded reads try each backend sequentially. When parallel_broadcast is enabled, all backends are tried concurrently and the first success wins — reducing worst-case read latency from N * backend_timeout to roughly the fastest backend’s response time. Enable this if read latency during outages is critical, but note that each parallel broadcast sends API requests to all backends simultaneously, which counts against monthly usage limits.
For fleets with many configured backends, set degraded_broadcast_parallelism to cap how many backends are probed at once. With a positive value, probes run as a rolling window: the first N launch immediately and each failure replenishes the slot with the next pending backend, so at most N goroutines (and at most N concurrent backend API calls / TLS handshakes) are in flight at any time. The default of 0 preserves the historical “fan out to every backend at once” behaviour.
The other defaults are sensible for most deployments. Increase cache_ttl if you have many read-heavy clients and want fewer backend round-trips during outages.
backend_circuit_breaker
Per-backend circuit breakers isolate failures at the individual backend level. When a backend’s credentials expire or the provider becomes unreachable, the circuit opens after consecutive failures and the backend is excluded from request routing. A single probe request tests recovery after the timeout elapses. Disabled by default.
Unlike the database circuit breaker, which triggers degraded mode for the entire system, backend circuit breakers affect only the individual backend. Reads fall back to other replicas, and writes route to other backends with available quota. No extra API calls are made — the breaker trips purely on organic traffic failures.
The s3o_circuit_breaker_state{name="<backend>"} metric tracks each backend’s circuit state (0=closed, 1=open, 2=half-open). Alert on > 0 for individual backends to detect credential or provider issues. Requires a restart to change (not hot-reloadable).
rebalance
Moves objects between backends to optimize storage distribution. Disabled by default — enabling it will generate egress/ingress traffic on your backends.
- pack — fills backends in config order, consolidating free space onto the last backend. Good for maximizing free-tier allocations.
- spread — equalizes utilization ratios across all backends. Good for distributing load.
Object moves run concurrently within each batch, bounded by concurrency. Increase for faster rebalancing; decrease to reduce backend load.
replication
Creates additional copies of objects on different backends for redundancy.
The replication factor must be <= number of backends. The worker runs once at startup to catch up on any pending replicas, then continues at the configured interval. Reads automatically fail over to replicas if the primary copy is unavailable.
Replication is asynchronous — writes go to a single backend and the replicator creates additional copies in the background. When a client overwrites an existing key, all old copies (including replicas) are removed and a single new copy is written. The replication factor drops to 1 until the next replicator cycle creates the additional copies. If the single backend holding the new copy fails before replication runs, the new version of the object is at risk. For most workloads this window (up to worker_interval) is acceptable. Lowering worker_interval reduces the exposure at the cost of more frequent DB queries and backend I/O.
Health-aware replication: When backend circuit breakers are enabled, the replicator monitors backend health. If a backend’s circuit breaker has been open longer than unhealthy_threshold, copies on that backend are treated as unavailable and replacement copies are created on healthy backends. This prevents sustained outages from silently reducing redundancy. The threshold prevents churn during brief transient failures. Set to 0 to disable health-aware replication (copies on down backends are still counted).
Cleanup Queue
The cleanup queue is always active. Tunables:
multipart_stale_timeout is consumed by the hourly CleanupStaleMultipartUploads sweep — uploads that have been open longer than this are aborted, their parts deleted from the backend, and the multipart rows removed. The default 24h matches the AWS S3 SDK’s default abort behavior; lower it on backends with tight free-tier headroom to recover quota faster.
When any backend object deletion fails during normal operations (PutObject orphan cleanup, DeleteObject, overwrite displaced copies, multipart part cleanup, rebalancer, replicator), the failed deletion is automatically enqueued for retry.
Each enqueued item tracks the object’s size_bytes. On enqueue, the backend’s orphan_bytes counter is incremented so that write routing and replication target selection account for the physically unreleased space. On successful cleanup the row is removed and orphan_bytes is decremented in a single atomic CTE; a worker crash between the two operations cannot leave the counter inconsistent.
Per-row claim pattern. Every row carries claimed_at and claimed_by columns. When a worker tick fetches a batch it stamps each row with the current instance’s identifier and timestamp, gated by FOR UPDATE SKIP LOCKED (Postgres) or SQLite’s intrinsic single-writer serialisation. Two instances ticking concurrently always see disjoint row sets, so a connection death or rolling-deploy overlap that would otherwise let two workers process the same row is now structurally impossible. A claim older than claim_grace_period (default 5m) is reclaimable so a worker that died mid-process does not leave the row stuck; reclaims emit s3o_cleanup_queue_stale_claims_recovered_total and a cleanup_queue.claim_recovered audit event.
The background worker runs every minute and retries with exponential backoff (1 minute to 24 hours). Scheduling a retry clears the row’s claim so it is immediately re-eligible for the next tick. After 10 failed attempts, the row is graduated to the cleanup_dlq table via core.MoveCleanupToDLQ (single transaction: read the row, insert it into cleanup_dlq, delete it from cleanup_queue). orphan_bytes is intentionally NOT decremented during the move because the backend object is still on disk. The DLQ entry retains the full row payload (key, backend, size, reason, last_error) plus an original_id correlation column so an operator can find the original queue entry.
Monitoring:
s3o_cleanup_queue_depthstaying elevated — orphaned objects are accumulating in the active queue.s3o_cleanup_queue_processed_total{status="exhausted"}— counter increments each time an item exhausts retries.s3o_cleanup_queue_processed_total{status="success_absent"}— counter increments each time a backend DELETE returned 404 and the row was dropped as idempotent success (the backend already agrees the object is gone). A sustained rate here is benign and just means upstream PUTs are silently failing somewhere; spikes are worth correlating with backend health.s3o_cleanup_queue_stale_claims_recovered_total{backend}— non-zero rate means a worker died mid-process or the grace period is too short for realistic worst-case processing time.s3o_cleanup_dlq_depth > 0— the DLQ holds at least one unrecoverable orphan; alerting here gives operators a direct signal instead of a counter delta.s3o_cleanup_dlq_enqueued_total{backend}— rate of graduations per backend; a single backend dominating means that backend’s delete path is broken.s3o_cleanup_enqueue_failures_total{backend,reason,stage}— orphan-leak blind spot signal. The cleanup-queue itself is durable, but its enqueue path is best-effort: when a backend write succeeds and the DB is then unreachable, the orphan cannot be recorded incleanup_queueand the only signal is this counter plus the matchingstorage.OrphanEnqueueFailedaudit event.stage="enqueue"is the worst case (the cleanup-queue worker will never see this orphan);stage="orphan_bytes"means the row landed but the quota counter drifts. See the runbook below.s3o_quota_orphan_bytes— elevated values mean backends have significant physically unreleased space (DLQ entries are the long-tail contributors).
Untracked-orphan recovery (cleanup enqueue failed during DB outage). A non-zero rate of s3o_cleanup_enqueue_failures_total{stage="enqueue"} means at least one orphan exists on a backend with no cleanup_queue row. The cleanup-queue worker will not retry it; the storage will leak until reconciled. Recovery workflow:
- Query the audit log for
event="storage.OrphanEnqueueFailed"to enumerate the specific backend/key/size of each affected orphan during the outage window. - Once DB connectivity is restored, run
POST /admin/api/reconcile[?backend=name]. The reconciler walks each backend’s actual key list againstobject_locationsusing a bounded-memory sorted-merge and emits S3-only keys to the cleanup path (with a freshcleanup_queuerow this time). This is the same diff machinery that runs on the nightly reconcile interval. - If the audit log indicates more than a handful of failures, target the reconciler at the affected backends specifically rather than waiting for the next scheduled scan.
stage="orphan_bytes" failures do not need step 2 — the cleanup_queue row landed and the worker will eventually delete the object. The quota counter drift is reset when backend_quotas.orphan_bytes is reconciled against cleanup_queue (a periodic safety pass; not yet automated).
Manual cleanup: Inspect DLQ entries and resolve them deliberately. The bytes are still on the backend, so the workflow is delete the object out-of-band, then write off the row + adjust orphan_bytes by the row’s size:
write_path
The write path can run in two modes. Direct mode (enabled: false) writes to the backend and commits the metadata immediately afterward; a crash between the two leaks bytes onto the backend with no DB record. Pending-intent mode (enabled: true, the default) inserts a row into pending_objects before the backend PUT and atomically deletes that row when the metadata commits — so a crash between the PUT and the commit leaves a recoverable intent the background reaper can resolve.
How recovery works. On every tick the PendingReaper worker (internal/worker/pending.go) claims a batch of pending_objects rows older than min_age, HEADs the backend at the recorded key, and resolves each one:
- HEAD 200 → the backend received the bytes. Promote the intent to a committed
object_locationsrow (pending_reaper.promotedaudit event). - HEAD 404 → the backend never received the bytes. Drop the intent (
pending_reaper.droppedaudit event). No orphan exists. - Non-404 HEAD error → leave the intent for the next tick. A sustained backend reachability problem here surfaces as
s3o_pending_intents_resolved_total{status="ambiguous"}. - A later write for the same key already committed → drop the intent as superseded (
pending_reaper.superseded).
Why min_age matters. The reaper must not race the foreground write path; if min_age is too short the reaper can interrogate an intent whose backend PUT is still in flight and either prematurely commit it or churn ambiguous resolutions. The 5-minute default is generous; lower it only if you have measured the p99 PUT duration and accept the operational tradeoff.
Monitoring:
s3o_pending_intents_enqueued_total— should track the PutObject rate closely.s3o_pending_intents_resolved_total{status}—committedis the happy path (synchronous commit succeeded);promoted+droppedare reaper resolutions;ambiguousis the alert.s3o_pending_intents_depth— gauge of unresolved intents. Alert when consistently abovebatch_size— the reaper is not keeping up (raisebatch_size, lowerreaper_tick, or add concurrency).- Audit events:
pending_reaper.promoted/pending_reaper.dropped/pending_reaper.superseded.
When to disable. Don’t, unless you are running an embedded SQLite single-instance demo and trust the OS to flush. The pattern adds one DB write per PUT (cheap) and saves you from one entire class of write-path crash leak.
rate_limit
Per-IP token bucket rate limiting. When enabled, rate limiting applies to both the S3 proxy and the admin API. Requests exceeding the limit receive 429 SlowDown.
A background goroutine evicts per-IP entries not seen within cleanup_max_age every cleanup_interval. Under high source-IP cardinality (e.g., DDoS), the map can hold up to cleanup_max_age worth of unique IPs — tune both values down if memory pressure is a concern.
When trusted_proxies is configured, the orchestrator extracts the real client IP from the X-Forwarded-For header using rightmost-untrusted extraction: it walks the XFF chain from right to left, skipping addresses within trusted CIDRs, and uses the first untrusted address for rate limiting. If the direct peer is not in a trusted CIDR, X-Forwarded-For is ignored entirely to prevent spoofing. Without trusted_proxies, the direct connection IP is always used.
Multi-instance note: Rate limits are enforced per-instance. Behind a load balancer with round-robin routing, the effective rate for a given client is
requests_per_sec * instance_count. Divide your desired aggregate rate by the number of API instances when configuring.
ui
Built-in web dashboard for operational visibility and management. Disabled by default. Requires authentication via an admin key/secret pair — sessions are HMAC-signed cookies with a 24-hour TTL.
admin_key, admin_secret, and session_secret are all required when enabled is true. Generate credentials the same way as bucket credentials:
Bcrypt-hashed secrets: For bare-metal deployments where the config file is at rest on disk, you can store admin_secret as a bcrypt hash instead of plaintext. The orchestrator detects bcrypt hashes automatically (they start with $2). Generate one with:
Both plaintext and bcrypt secrets are fully supported — no config migration needed.
Session secret: Session keys are derived deterministically from session_secret using HMAC-SHA256, so sessions survive restarts. For multi-instance deployments behind a load balancer, all instances sharing the same session_secret will accept each other’s sessions. Generate a value with:
session_secret is independent of admin_secret — rotating the admin password does not invalidate active sessions, and vice versa.
usage_flush
Controls how often usage counters are flushed to the database. When adaptive flushing is enabled, the interval shortens automatically when any backend approaches a usage limit, improving enforcement accuracy.
interval— how often counters are flushed under normal conditions. Lower values reduce staleness but increase database writes.adaptive_enabled— whentrue, the flush interval drops tofast_intervalwhenever any backend’s effective usage exceedsadaptive_thresholdof its configured limit.adaptive_threshold— the ratio (0–1 exclusive) at which fast flushing kicks in. At0.8, a backend at 80% of any usage limit triggers the fast interval.fast_interval— must be less thaninterval. Used when adaptive flushing detects a backend near its limits.
Multi-instance note: Without Redis, each instance accumulates usage counters in memory between flushes. With N instances, the enforcement margin near limits is up to
N * intervalworth of unaccounted operations. Adaptive flushing reduces this near limits but doesn’t eliminate it. For tighter enforcement, configure Redis shared counters to eliminate the cross-instance blind spot entirely, or reduceintervaland run fewer API instances.
redis
Optional shared usage counters for multi-instance deployments. When configured, all instances share usage counters via Redis instead of tracking them independently in memory. This eliminates the cross-instance blind spot between PostgreSQL flushes.
address— required when theredissection is present. The orchestrator PINGs Redis on startup and fails hard if unreachable.key_prefix— namespaces all Redis keys. Use different prefixes if multiple orchestrator deployments share one Redis instance.failure_thresholdandopen_timeout— control the circuit breaker that falls back to local counters when Redis is unavailable.
When Redis is active, the usage flush service acquires a PostgreSQL advisory lock so only one instance performs the destructive GETSET + flush-to-PG operation. When Redis is in fallback (or not configured), each instance flushes independently without a lock.
A background health probe runs every 5 seconds while the breaker is open: it PINGs Redis and, on success, syncs the accumulated local-counter deltas back via an additive INCRBY pipeline (no DEL — keys from before the outage expire via TTL) and recloses the breaker. The breaker recovery is clean: the failure counter is zeroed so the system tolerates the configured failure_threshold of new transient errors before tripping again. No process restart is required after a Redis outage.
Redis is not reloadable — changing Redis settings requires a restart.
lifecycle
Automatically deletes objects whose key matches a prefix and whose age exceeds the configured expiration. Useful for temporary uploads, staging artifacts, or anything with a known retention period.
prefix— key prefix to match (required, must be non-empty).expiration_days— delete objects older than this many days (required, must be > 0).- Omit the
lifecyclesection or leaverulesempty to disable lifecycle entirely. - Rules are evaluated every hour by a background worker with an advisory lock.
- Deletions go through the standard
DeleteObjectpath — all copies removed, quotas decremented, failed deletes enqueued to the cleanup queue. - Hot-reloadable via
SIGHUP.
encryption
Server-side envelope encryption with chunked AES-256-GCM. When enabled, objects are encrypted before being stored on backends and decrypted transparently on read. Exactly one key source is required.
Generating a master key:
Key source options — exactly one of the following must be set:
| Source | Config field | When to use |
|---|---|---|
| Inline | master_key | Base64-encoded 256-bit key in config or env var. Simplest option. |
| File | master_key_file | Path to a file containing exactly 32 raw bytes. Good for bare-metal with config management. |
| Vault Transit | vault | Delegate key wrapping/unwrapping to HashiCorp Vault. Best for production with HSM-backed key management. |
Vault Transit configuration:
The Vault Transit engine handles wrapping and unwrapping DEKs — the orchestrator never sees the master key material. The key_name must reference an existing key in the Transit engine.
Key rotation support:
When rotating to a new master key, move the old key to previous_keys so existing objects can still be decrypted:
After updating the config, call the rotate-encryption-key admin API to re-wrap all DEKs with the new key. See Rotating encryption keys below.
Important notes:
- Encryption is not reloadable — changing encryption settings requires a restart.
- The
chunk_sizemust stay the same for the lifetime of the data. Changing it after objects are encrypted will make those objects unreadable. - Encrypted objects are slightly larger than their plaintext (header + per-chunk overhead). The exact overhead is: 32 bytes (header) + 28 bytes per chunk (nonce + auth tag).
integrity
SHA-256 content hashing for data integrity verification. When enabled, objects are checksummed on write and the hash is stored alongside the object location in PostgreSQL.
How it works:
- Write path: SHA-256 is computed on the plaintext body (before encryption) and stored in
object_locations.content_hash. - Read path (
verify_on_read): AVerifyingReaderwraps the response body and computes the hash as data streams to the client. On mismatch at EOF, the corrupted copy is enqueued for cleanup. - Scrubber: A background worker periodically reads random objects from backends, decrypts if needed, and verifies their hash. Corrupted copies are enqueued for cleanup. Each read counts against the backend’s usage quota.
- Backfill: Objects written before integrity was enabled have no stored hash. Use
admin backfill-checksumsto read those objects and compute their hashes.
Integrity is hot-reloadable — changes take effect on SIGHUP without a restart.
cache
Optional in-memory LRU cache for full GET responses. Reduces backend API calls and egress by serving repeated reads from memory. Per-instance only — not shared across instances.
max_size— total memory the cache may consume. Size this based on available container memory after accounting for the Go heap, connection pools, and streaming buffers. A good starting point is 10-25% of the container’s memory allocation.max_object_size— objects larger than this are never admitted to the cache. Prevents a single large object from evicting many smaller frequently-accessed objects. Set this below the typical “hot” object size in your workload.ttl— maximum time an entry stays cached before automatic expiry. In multi-instance deployments, this bounds how stale a cached object can be when writes happen on another instance. Lower values reduce staleness at the cost of more backend requests.
Cache entries are automatically invalidated on PutObject, DeleteObject, CopyObject, DeleteObjects, and CompleteMultipartUpload. Range requests bypass the cache on miss but are served from cache on hit.
When to enable:
- Read-heavy workloads where the same objects are fetched repeatedly (thumbnails, config files, assets)
- Backends with per-request API charges or egress costs
- High-latency backends where caching improves P99 latency
When to skip:
- Write-heavy workloads with few repeated reads
- Objects are too large to fit meaningfully in memory
- Single-instance with very low read traffic
The cache is not hot-reloadable — changing cache settings requires a restart. When encryption is enabled, the cache stores post-decryption plaintext.
Metrics:
| Metric | Labels | Description |
|---|---|---|
s3o_integrity_checks_total | operation | Hash verifications performed (read, scrub) |
s3o_integrity_errors_total | operation | Hash mismatches detected (read, scrub) |
When enabled, the dashboard is served at {path}/ on the same port as the S3 API.
All dashboard responses include security headers (X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Content-Security-Policy). The dashboard requires authentication via the configured admin_key/admin_secret — unauthenticated requests are redirected to the login page (HTML) or receive 401 (API).
Multi-Backend Configurations
Single backend with quota
The simplest setup. One backend with a byte cap:
Multiple backends with quotas (pack routing)
Stack multiple free-tier allocations. With the default routing_strategy: "pack", when one backend fills up, writes overflow to the next. Use routing_strategy: "spread" instead to distribute objects evenly by utilization ratio:
This gives you 30 GB of combined storage across two providers.
Multiple backends without quotas (requires replication or spread)
When all backends are unlimited and using the default pack routing, only the first backend would receive writes. To distribute data, either set replication.factor >= 2 to replicate across backends, or use routing_strategy: "spread" to distribute writes by utilization ratio.
Validation rule: You cannot mix unlimited and quota-limited backends. Either all backends have quota_bytes set (quota routing) or all are unlimited (replication or spread routing required).
Onboarding a New Tenant
To add a new application to the orchestrator:
Generate credentials for the new tenant:
Add a bucket entry to your config:
Reload the configuration by sending
SIGHUP(kill -HUP $(pidof s3-orchestrator)) — or restart the orchestrator.Hand the client four pieces of information:
- Endpoint URL (e.g.,
http://s3-orchestrator.service.consul:9000) - Bucket name (e.g.,
new-app) - Access Key ID
- Secret Access Key
- Endpoint URL (e.g.,
Point them to the User Guide for client setup instructions.
CLI Subcommands
version
Prints the binary version, Go version, and platform:
validate
Validates a configuration file without starting the server. Exits 0 on success with a brief summary, or exits 1 with error details. Useful for CI pipelines or pre-deploy checks:
admin
Operational CLI for inspecting and controlling a running instance. Reads config.yaml to discover the server address and admin token (ui.admin_token, falling back to ui.admin_key), then makes HTTP requests to the admin API.
Flags:
| Flag | Default | Description |
|---|---|---|
-config | config.yaml | Path to configuration file |
-addr | from config | Override server address |
Commands:
The admin API requires ui.admin_token (or ui.admin_key as fallback) to be set in the configuration. All requests are authenticated via the X-Admin-Token header.
Importing Existing Data
The sync subcommand imports objects from an existing backend bucket into the orchestrator’s metadata database. Use this when bringing a bucket that already has data under orchestrator management.
Dry run first
Always preview what would be imported before committing:
Run the import
The --bucket flag specifies which virtual bucket the imported objects belong to. Keys are stored internally as {bucket}/{key}, so this determines the namespace.
Partial import with –prefix
Import only objects under a specific key prefix:
Objects already tracked in the database for that backend are automatically skipped. The command logs per-page progress and a final summary.
Sync flags
| Flag | Default | Description |
|---|---|---|
--config | config.yaml | Path to configuration file |
--backend | (required) | Backend name to sync from |
--bucket | (required) | Virtual bucket name to assign to imported objects |
--prefix | "" | Only sync objects with this key prefix |
--dry-run | false | Preview without writing to the database |
Monitoring
Web dashboard
When ui.enabled is true, the dashboard at {path}/ shows a live snapshot of:
- Storage summary — total bytes used/capacity across all backends
- Backend quota — bytes used/limit with progress bars per backend, object counts, active multipart uploads
- Monthly usage — API requests, egress, and ingress per backend with limits
- Object tree — interactive collapsible file browser. Buckets and directories are collapsed by default; click to expand. Each directory shows a rollup file count and total size.
- Configuration — virtual bucket names, write routing strategy, replication factor, rebalance status, rate limiting, encryption status
- Logs — recent structured log output from an in-memory ring buffer (last 5,000 entries). Filter by severity level and search by text. Logs are available immediately on page load — no need to SSH into the host.
The dashboard also provides management actions:
- Upload — upload files to any virtual bucket directly from the browser (up to 512 MiB per file)
- Download — download individual objects by clicking the download icon on any file in the tree
- Delete — delete individual objects by clicking the delete icon on any file in the tree
- Rebalance — trigger an on-demand rebalance using the configured strategy and settings
- Clean Excess — remove over-replicated copies that exceed the replication factor
- Sync — import pre-existing objects from a backend’s S3 bucket into the proxy database. Select a backend and a virtual bucket — objects already in the database are skipped, and objects belonging to other virtual buckets are excluded.
On-demand reconciliation
When a backend loses data (expired credentials, provider outage, accidental deletion), the metadata database retains stale entries that cause log noise in the rebalancer, replicator, and scrubber. The reconcile endpoint walks both the backend (paginated ListObjects) and the metadata DB (paginated cursor over object_locations ordered by key) as ascending key streams, then merges them in lockstep. Memory is bounded by a 1000-entry page size on each side regardless of total object count, so a backend holding millions of objects reconciles without OOM. Keys belonging to sibling virtual buckets stored on the same backend are skipped — each virtual bucket reconciles in its own pass.
Response:
- imported: objects on the backend but not in the DB (brought under management)
- removed: DB entries whose objects no longer exist on the backend (stale metadata cleaned up)
- backends_scanned: how many backends were processed
The background reconciler (reconcile.enabled: true) runs the same logic on a timer. The admin endpoint is for immediate use after incidents.
The dashboard requires authentication. Users log in at {path}/login with the admin_key and admin_secret configured in the ui section. Sessions last 24 hours.
The dashboard is server-rendered HTML. The object tree uses JavaScript for lazy-loaded directory expansion — directories fetch their children on click via the /ui/api/tree endpoint.
JSON endpoints at {path}/api/dashboard, {path}/api/tree, and {path}/api/logs return data for programmatic access or integration with other tools. The logs endpoint accepts optional query parameters: level (minimum severity: DEBUG, INFO, WARN, ERROR), since (RFC3339 timestamp), component, and limit. Management endpoints ({path}/api/delete, {path}/api/upload, {path}/api/rebalance, {path}/api/clean-excess, {path}/api/sync) accept POST requests. The download endpoint ({path}/api/download?key=...) accepts GET requests. All API endpoints require authentication.
Health endpoints
Two health endpoints serve different purposes:
Liveness (/health) — always returns HTTP 200. Use this for liveness probes (Consul checks, K8s livenessProbe). The service stays in rotation during temporary database outages.
Readiness (/health/ready) — returns HTTP 200 when the service is ready to handle traffic, HTTP 503 during startup (before migrations and backend initialization complete) and during shutdown drain. Use this for readiness probes (K8s readinessProbe, Nomad on_update = "require_healthy").
The HTTP response body is intentionally minimal — only the status field is returned, so log aggregators can grep on a fixed pattern. To identify which instance answered a probe in a multi-instance deployment, query GET /admin/api/workers (which includes the instance identifier) or correlate by source IP.
Background worker health
/health only reflects database breaker state. Background services
(replicator, cleanup queue, lifecycle, pending reaper, …) are
supervised by the lifecycle manager and recover on their own, but a
service that is running yet failing every tick looks identical to
a healthy one in /health.
GET /admin/api/workers returns a JSON snapshot of every registered
background service’s last-tick health, including last_success,
last_failure, last_error, and consecutive_failures. Use it
during incidents to distinguish:
Workers in proxy-only deployments return 503 from this endpoint
because no worker pool is registered.
The same data flows into Prometheus as
s3o_worker_last_success_timestamp_seconds,
s3o_worker_consecutive_failures, and s3o_worker_ticks_total{result}
so alerting can run without scraping the admin endpoint. Suggested
alert shapes are in the metrics table below.
Grafana dashboard
A comprehensive Grafana dashboard is included at grafana/s3-orchestrator.json. Import it via Grafana’s UI (Dashboards → Import → Upload JSON file) or provision it from disk. It expects a Prometheus datasource with UID prometheus.
The dashboard covers all emitted metrics, organised by domain into rows: overview, quota & storage, request performance, backend operations, manager operations, circuit breaker & degraded mode, replication, usage tracking, rate limiting & rejections, rebalancer, drain & lifecycle, cleanup queue & audit, encryption, object data cache, integrity verification, Redis, over-replication cleanup, pending PUT intents, and authentication (streaming SigV4). Rows for less frequently inspected domains are collapsed by default.
Key Prometheus metrics
If telemetry.metrics.enabled is true, metrics are exposed at /metrics. Two listener modes:
- Inline (
telemetry.metrics.listenempty, the default):/metricsis served on the same listener as the public S3 API. Convenient for single-port deployments and the docker-compose / local-dev workflow. In this mode/debug/pprof/*endpoints are not registered — mounting pprof on the public S3 listener would leak runtime internals (de-anonymized stack frames via/debug/pprof/heap, the command line and flag values via/debug/pprof/cmdline) and offer a DoS amplifier (/debug/pprof/profile?seconds=300triggers minutes of CPU profiling on demand). - Dedicated listener (
telemetry.metrics.listenset, e.g.127.0.0.1:9001):/metricsand/debug/pprof/*are both mounted on the dedicated listener. Bind to127.0.0.1or a private network interface so the surface is only reachable from operators inside the trust boundary. The nomad demo uses0.0.0.0:9001so the docker-compose Prometheus container can scrape via the bridge gateway; production deployments should tighten this to127.0.0.1:9001or a private network address.
Once the dedicated listener is configured, captures look like:
The standard net/http/pprof endpoints are available: /debug/pprof/, /debug/pprof/profile, /debug/pprof/heap, /debug/pprof/allocs, /debug/pprof/goroutine, /debug/pprof/block, /debug/pprof/mutex, /debug/pprof/trace, /debug/pprof/cmdline, /debug/pprof/symbol.
Key metrics to alert on:
| Metric | What to watch |
|---|---|
s3o_quota_bytes_available{backend="..."} | Alert when approaching 0 — backend is almost full (accounts for orphan bytes) |
s3o_quota_orphan_bytes{backend="..."} | Elevated values mean backends have physically unreleased space from pending cleanups |
s3o_circuit_breaker_state{name="database"} | Alert when > 0 — database is unreachable (1=open, 2=half-open) |
s3o_circuit_breaker_state{name="<backend>"} | Alert when > 0 — backend is unreachable or credentials expired |
s3o_replication_pending | Alert when consistently > 0 — replicas are falling behind |
s3o_replication_health_copies_total | Non-zero means health-aware replication is creating replacement copies for circuit-broken backends |
s3o_over_replication_pending | Objects with more copies than the replication factor — should return to 0 after cleanup runs |
s3o_over_replication_errors_total | Cleanup errors — indicates backends or metadata issues preventing excess copy removal |
s3o_requests_total{status_code="5xx"} | Alert on elevated 5xx rates |
s3o_http_panic_recovered_total{route} | Any non-zero rate is an alert: a handler panicked and the recovery middleware returned a 500. Pivot via the matching http.PanicRecovered audit entry for the captured stack and request id |
s3o_degraded_write_rejections_total | Writes being rejected due to degraded mode |
s3o_usage_limit_rejections_total | Operations rejected by usage limits |
s3o_rate_limit_rejections_total | Requests rejected by per-IP rate limiting |
s3o_admission_rejections_total | Requests rejected at the hard admission limit |
s3o_load_shed_total | Requests probabilistically shed before the hard admission limit |
s3o_early_rejections_total | Uploads rejected before body transmission (no backend capacity) |
s3o_list_pages_capped_total | Non-zero rate means real workloads are hitting the ListObjects page cap; profile before tuning |
s3o_cleanup_queue_depth | Alert when consistently > 0 — orphaned objects are failing cleanup |
s3o_cleanup_queue_processed_total{status="exhausted"} | Items that exceeded max retries — graduated to the DLQ |
s3o_cleanup_dlq_depth | Alert when > 0 — at least one unrecoverable orphan needs operator action |
s3o_cleanup_dlq_enqueued_total{backend="..."} | Rate of graduations per backend; one backend dominating means its delete path is broken |
s3o_cleanup_queue_stale_claims_recovered_total{backend="..."} | Non-zero rate means a worker died mid-process or cleanup_queue.claim_grace_period is shorter than realistic worst-case row processing time |
s3o_cleanup_enqueue_failures_total{backend,reason,stage} | Alert on any non-zero rate of stage="enqueue" — backend object exists with no cleanup_queue row (orphan-leak risk). Pivot via the matching storage.OrphanEnqueueFailed audit event for backend/key/size, then run POST /admin/api/reconcile after DB recovery |
s3o_audit_events_total{event="..."} | Audit log volume by event type — useful for detecting unusual activity |
s3o_pending_intents_enqueued_total | Rate of PUT intents inserted (write-path PUT-before-COMMIT pattern). Should track the PutObject rate closely; a sustained gap suggests the pending pattern is bypassed |
s3o_pending_intents_resolved_total{status} | Pending intents resolved by status (committed = atomic commit happy path, promoted = reaper found backend object and promoted the intent, dropped = reaper found HEAD 404 and dropped the intent, ambiguous = HEAD failed for non-404 reasons, already_resolved = race). A sustained ambiguous rate means the reaper cannot reach the backend |
s3o_pending_intents_depth | Current unresolved intents. Alert when consistently above the batch_size of the reaper — the reaper is not keeping up |
s3o_drain_active | Count of in-flight backend drain operations (Inc/Dec so concurrent drains compose); 0 means no drains are running. Page on s3o_drain_active > 0 for 6h (drain stuck) |
s3o_drain_race_aborted_total | PutObject attempts aborted after drain started mid-write. Any non-zero rate is benign (the orchestrator recovers automatically) but a sustained rate suggests longer-than-expected gaps between EligibleForWrite and the backend PUT — typically very large objects against a fast-draining backend |
time() - s3o_worker_last_success_timestamp_seconds{service="..."} | Alert when greater than the worker’s expected tick interval times a margin (e.g. > 4 * interval) — the service has not completed a successful tick in that window |
s3o_worker_consecutive_failures{service="..."} | Alert when consistently > 0 — the service is running but every tick fails; logs and /admin/api/workers carry the underlying error |
rate(s3o_worker_ticks_total{result="error"}[15m]) | Persistent error rate; alongside consecutive_failures distinguishes flapping from sustained failure |
s3o_auth_streaming_requests_total{variant} | Rate of streaming-payload SigV4 PUTs by variant — track which client SDKs are sending streaming uploads |
s3o_auth_streaming_rejections_total{reason} | Alert on any non-zero rate — every increment is a chunk-validation failure (tampered body, malformed framing, length mismatch, or signature mismatch) |
s3o_encryption_errors_total{op,error_type} | Any non-zero rate indicates encryption/decryption failures. error_type="stream_failed" specifically flags transport errors that surfaced mid-stream (after the encryptor/decryptor was constructed). |
s3o_encrypt_existing_objects_total{status="error"} | Failures during bulk encryption of existing data |
s3o_decrypt_existing_objects_total{status="error"} | Failures during bulk decryption of existing data |
s3o_key_rotation_objects_total{status="error"} | Failures during key rotation |
s3o_redis_fallback_active | Alert when 1 — Redis is unavailable, using local counters |
s3o_redis_operations_total{operation,status} | Track Redis operation success/error rates |
s3o_cache_hits_total / s3o_cache_misses_total | Cache hit ratio — low hit rate may indicate the cache is undersized or the workload is not read-heavy |
s3o_cache_evictions_total | High eviction rate suggests max_size is too small for the working set |
s3o_cache_size_bytes / s3o_cache_entries | Current cache utilization — watch for the cache staying near max_size |
Structured logs
All logs are JSON to stdout. Key fields: msg, level, error, backend, operation.
Audit logs are a subset of structured logs with "audit": true. Every S3 API request and significant internal operation emits an audit entry with a request_id for correlation. Filter audit entries in your log pipeline with a JSON query on the audit field.
Key audit events:
| Event | Source | Description |
|---|---|---|
s3.PutObject, s3.GetObject, s3.DeleteObjects, etc. | HTTP layer | S3 API request with method, path, bucket, status, duration |
storage.PutObject, storage.GetObject, storage.DeleteObjects, etc. | Storage layer | Backend operation with key, backend name, size |
rebalance.start, rebalance.move, rebalance.complete | Rebalancer | Object redistribution runs |
replication.start, replication.copy, replication.complete | Replicator | Replica creation runs |
storage.MultipartCleanup | Multipart cleanup | Stale upload cleanup |
cleanup_queue.processed | Cleanup queue | Orphaned object successfully deleted on retry |
cleanup_queue.already_absent | Cleanup queue | Backend DELETE returned 404 — the object is already gone. Row dropped as idempotent success instead of retrying nine more times |
cleanup_queue.claim_recovered | Cleanup queue | A row whose claim aged past the grace period was reclaimed by a different worker tick (typical after a process crash or rolling-deploy overlap) |
cleanup_queue.exhausted_to_dlq | Cleanup queue | Row graduated to cleanup_dlq after exhausting retries |
over_replication.start, over_replication.complete, over_replication.remove | Over-replication cleaner | Surplus replica removal cycle; .remove carries the per-copy decision |
pending_reaper.promoted | Pending reaper | The reaper HEADed the backend, found the object, and promoted the pending intent into a committed object_locations row |
pending_reaper.dropped | Pending reaper | The reaper HEADed the backend and got 404. No orphan exists; the intent row is dropped |
pending_reaper.superseded | Pending reaper | A later write for the same key completed and superseded the pending intent before the reaper resolved it |
storage.OrphanEnqueueFailed | Coordinator | The cleanup-queue enqueue path itself failed after a successful backend write (DB outage). Carries backend / key / size / stage so an operator can reconcile manually once DB connectivity returns |
storage.UploadPart | Multipart | A multipart part upload completed successfully |
http.PanicRecovered | HTTP panic-recovery middleware | A handler panicked and the recovery layer returned a 500 to the client. Carries route, method, path, and the panic value. The matching error-level slog entry carries the captured stack trace |
Each S3 API request produces two correlated audit entries (HTTP-level and storage-level) sharing the same request_id. Internal operations (rebalance, replication) generate their own correlation IDs. The request_id also appears as a s3o.request_id attribute on OpenTelemetry spans.
HTTP panic recovery
A panic inside an HTTP handler is caught by the panic-recovery middleware applied to the S3 and admin route groups. The recovery contract:
- The client gets a structured response, not a connection reset. S3 routes receive an XML
<Error><Code>InternalError</Code><Message>...Request ID: ...</Message></Error>body with HTTP 500; admin routes receive the same shape as a JSON{"error": "...Request ID: ..."}. - The Prometheus counter
s3o_http_panic_recovered_total{route}increments. A non-zero value is an immediate alert candidate. - A
slog.ErrorContextline is written at levelERRORwithcomponent=httputil, the route, the method and path, the panic value, the captured Go stack, and atrace_id/span_idif a span was active when the panic occurred. - An
http.PanicRecoveredaudit entry is emitted with the same correlationrequest_idas the failing request. - If an OpenTelemetry span was active on the request, it is marked as failed via
SetStatus(Error)+RecordErrorso traces in Tempo highlight the failure.
UI routes are intentionally not wrapped in the first iteration (mounted on the same mux as the S3 catch-all; the bulk of panic risk is on S3 and admin anyway). The recovery message deliberately does not echo the panic value to the client; only the request ID is returned so support tickets can be correlated back to the orchestrator log line.
Clients can supply their own correlation ID via the X-Request-Id request header; otherwise the orchestrator generates one. The ID is returned in the X-Amz-Request-Id response header.
Trace-to-log correlation — JSON log output includes trace_id and span_id fields on every line emitted within an active OpenTelemetry span. Log aggregators like Grafana Loki can extract these fields to link directly from a log entry to the corresponding trace in Tempo, and vice versa.
Admin API Reference
All /admin/api/* endpoints require the X-Admin-Token header. Set the token via the S3O_ADMIN_TOKEN environment variable or the admin.token config key. All request and response bodies are JSON; request bodies are capped at 1 MiB.
Common error responses are documented under each section; for any unexpected server-side failure the response is {"error": "<short message>"} with HTTP 500 and the original error is logged with the request_id.
Health & status
GET /admin/api/status
Backend health snapshot: per-backend bytes used, bytes limit, object count, API requests, egress, ingress, plus the database circuit-breaker state.
Response (200):
GET /admin/api/reload-status
Most recent config-reload result. Returns {"status": "no_reload_yet"} on a freshly started instance that has not yet had a SIGHUP. After a reload, returns the structured result (generation, summary of diffs, validation outcome, errors).
GET /admin/api/workers
JSON snapshot of every registered background service’s last-tick state. See Background worker health above for the response schema and full operator workflow. Returns 503 when the lifecycle manager is not wired (proxy-only deployments).
GET /admin/api/log-level · PUT /admin/api/log-level
Inspect or change the runtime log level without restarting. Accepted levels: debug, info, warn, error.
The change is logged via an INFO audit entry. Useful during incidents — flip to debug for a minute, capture the failure mode, then flip back.
Object inspection
GET /admin/api/object-locations?key=<key>
Returns all backend copies of a single object key, with backend name, size, and encryption metadata. Useful for debugging “where did this object actually land?” and for verifying that the replication factor is satisfied.
400 when key is missing.
Cleanup & quota
GET /admin/api/cleanup-queue
Current cleanup queue depth plus a sample of up to 50 pending rows. The web dashboard’s cleanup panel uses this endpoint.
Response includes depth (total pending count) and items (sample with id, backend, key, reason, attempts, next_retry).
POST /admin/api/usage-flush
Forces an out-of-band flush of the in-memory usage counters to the database. Use after a manual quota adjustment to make the new value visible immediately instead of waiting up to usage_flush.interval seconds. Same effect as the scheduled flush; the operation is idempotent.
Replication & over-replication
POST /admin/api/replicate
Triggers a single replication cycle synchronously. Useful after a backend recovery to fast-forward repair without waiting for the next scheduled tick.
GET /admin/api/over-replication · POST /admin/api/over-replication[?batch_size=N]
The GET form reports how many object keys currently have more copies than the configured replication.factor requires. The POST form runs one cleanup pass that removes surplus replicas (preferring drained, circuit-broken, and most-utilised backends — see the OverReplicationCleaner section in the Background Services diagram).
batch_size defaults to the configured value; the POST form clamps user input to 10000.
Drain & backend lifecycle
POST /admin/api/backends/{name}/drain
Begins draining {name}. Subsequent writes filter the backend out of eligibility (drain race re-check in objects_write.go covers the in-flight case), and the drain worker starts migrating existing objects to other healthy backends.
Returns 400 if the backend is unknown or already draining.
GET /admin/api/backends/{name}/drain
Returns the drain’s current state: total objects/bytes moved, total remaining, started_at, last_progress_at, and any per-batch errors. Poll this during a drain to track progress.
DELETE /admin/api/backends/{name}/drain
Cancels the active drain. In-flight migrations finish; new migrations stop. The backend’s eligibility filter is re-enabled. Useful when a drain was triggered by accident or when the intended target’s free space turns out to be insufficient.
DELETE /admin/api/backends/{name}[?purge=true&confirm=<token>]
Non-purge form (default): drops the backend’s DB records (object_locations, quota, usage). The backend’s actual S3 objects are left in place — reversible by re-adding the backend and running s3-orchestrator admin sync. Use this when retiring a backend whose data has already been migrated.
Purge form (two-phase, irreversible): also deletes every S3 object on the backend.
The confirmation token is HMAC-bound to the backend name and expires after 5 minutes, so the preview cannot be replayed against a different backend or after a typo.
Encryption operations
POST /admin/api/rotate-encryption-key
Re-wraps every DEK still wrapped with the named old key using the current primary master key. Required after a key rotation (see Rotating encryption keys below).
old_key_id formats: inline keys → config-0 (primary), config-1, …; file-based → file-0; Vault Transit → the key name. Returns 400 if encryption is not enabled or the body is malformed.
POST /admin/api/encrypt-existing
Bulk-encrypts every object that does not yet have encryption metadata. The orchestrator downloads each object, encrypts it, re-uploads the ciphertext, and updates the DB record. Idempotent — re-running only processes objects that are still plaintext.
Counts against backend API quotas (one GET + one PUT per object).
POST /admin/api/decrypt-existing
Inverse of the above: downloads every encrypted object, decrypts it, re-uploads plaintext, and clears the encryption metadata. The orchestrator must still be configured with the key provider (the DEKs need to be unwrapped). Use only when permanently removing encryption from a deployment.
Integrity
POST /admin/api/scrub[?batch_size=N]
Runs one integrity scrub pass: reads random objects from each backend, computes the SHA-256, and compares to the stored content_hash. On mismatch the bad copy is enqueued for cleanup with reason integrity_scrub_failed. The scrubber’s scheduled runs are independent — this endpoint just triggers an immediate pass.
POST /admin/api/backfill-checksums[?batch_size=N]
Computes and stores content_hash for every object that does not have one yet. Use after enabling integrity.enabled: true on an existing deployment so the scrubber has hashes to compare against. Paginates internally until done.
Reconciliation
POST /admin/api/reconcile[?backend=<name>]
On-demand reconciler pass. See On-demand reconciliation above for the operator workflow and the bounded-memory sorted-merge details.
Cache management
The object-data cache is optional (cache.enabled: true). All cache endpoints return 503 {"status":"disabled","reason":"object data cache is not enabled"} when the cache is not configured, so callers can distinguish “no cache” from “cache empty after flush”.
GET /admin/api/cache
Current utilization: entry count, bytes used, configured max. Mirrors the s3o_cache_* gauges so operators without Prometheus access can still inspect cache state.
POST /admin/api/cache/flush
Drops every entry from the cache. Used by load-test tooling to characterise cache-cold GET performance. The response carries entries_cleared so the caller can confirm the cache was actually populated before the flush.
DELETE /admin/api/cache/keys/{key...}
Drops a single object from the cache. The {key...} wildcard pattern accepts keys with embedded slashes. Always returns 200 — invalidating an unknown key is a no-op, matching the cache’s own contract.
DELETE /admin/api/cache/prefix?prefix=<prefix>
Drops every cached key under the given prefix. Useful after a bulk update for one tenant or directory. Returns 400 if prefix is empty — use /admin/api/cache/flush to clear the whole cache.
Common Operations
Reloading configuration
Many settings can be updated without restarting the orchestrator by sending SIGHUP:
- Edit the config file with your changes.
- Send SIGHUP to the running process:
- Check the logs to confirm the reload succeeded:
What takes effect immediately:
- Log level (
server.log_level) — can also be changed at runtime vias3-orchestrator admin log-level -set debug - Bucket credentials (add/remove/rotate credentials without downtime)
- Rate limit settings (requests per second, burst)
- Backend quota limits (
quota_bytes) - Backend usage limits (
api_request_limit,egress_byte_limit,ingress_byte_limit) - Rebalance settings (strategy, interval, batch size, threshold, enable/disable)
- Replication settings (factor, worker interval, batch size, unhealthy threshold)
- Usage flush settings (interval, adaptive enabled/threshold/fast interval)
What requires a restart:
server.listen_addr, server timeouts,server.shutdown_delay,database,telemetry,ui,routing_strategy,encryption,redis- Backend structural changes (endpoint, S3 credentials, adding/removing backends)
If any of these fields change, the reload still proceeds for the reloadable settings, and warnings are logged:
If the config file is invalid, the orchestrator keeps the current configuration entirely and logs the parse/validation error:
No partial reload happens — either all reloadable settings update, or none do.
Adding a new backend
Add the backend to the backends list in your config and restart the orchestrator. Backend count changes are not reloadable — a restart is required. Quota limits are synced to the database on startup.
Draining a backend
Draining migrates all objects off a backend to other backends without data loss. Use this when decommissioning a backend but preserving all stored objects.
Start the drain:
This immediately excludes the backend from new writes (PutObject and CreateMultipartUpload skip it) and begins migrating objects in batches of 100. Any in-progress multipart uploads on the backend are aborted first.
Monitor progress:
Returns objects remaining, bytes remaining, objects moved so far, and whether the drain is still active. Poll this periodically until
activeisfalseandobjects_remainingis0.Wait for completion. The drain runs as a background goroutine. Each object is read from the source backend, written to the least-utilized eligible backend, and the database record is atomically swapped via compare-and-swap. Failed moves are logged but don’t stop the drain.
Remove the backend from config and restart:
After drain completes,
DeleteBackendDatacleans up remaining database records (usage, quota, cleanup queue) automatically. Removing the backend from config on restart prevents it from being re-initialized.
Cancelling a drain:
Objects already moved are not rolled back. The backend becomes eligible for new writes again.
Metrics to watch during drain:
| Metric | Description |
|---|---|
s3o_drain_active | 1 while a drain is in progress |
s3o_drain_objects_moved_total | Objects successfully migrated |
s3o_drain_bytes_moved_total | Bytes migrated |
Removing a backend
Removing deletes all database records for a backend. This is destructive — objects on that backend become inaccessible. Use drain first if you want to preserve data.
Drop database records only (objects remain on the backend’s S3 storage):
Preview what purge would destroy (dry-run):
Drop database records AND delete S3 objects (requires confirmation):
The --purge flag without --confirm shows a preview of what would be destroyed (object count, total bytes) and exits. With --confirm, the CLI obtains a signed confirmation token from the server (valid for 60 seconds) and executes the purge. Individual delete failures are logged but don’t stop the operation.
After removing, edit the config to remove the backend entry and restart.
Note: You cannot remove a backend that is currently draining. Cancel the drain first with
drain-cancel.
Important: update the config after drain or remove
Drain and remove state is held in memory only — it is not persisted to the database. This means:
- If the service restarts with a drained/removed backend still in the config,
SyncQuotaLimitsre-creates the backend’s quota record and the backend is re-initialized as a fresh, empty backend eligible for new writes. No data is lost, but the decommissioned backend silently starts receiving traffic again. - If the service crashes during an active drain, all drain progress is lost. The backend reverts to active on restart. You would need to restart the drain.
- SIGHUP does not remove backends — config reload only updates quota limits and usage limits. The in-memory backend map is set at startup and cannot be modified at runtime.
Always remove the backend from the config file and restart (or redeploy) after a drain or remove operation completes. The dashboard UI shows a pulsing “Draining” badge on backends with an active drain so you can monitor progress visually.
Adjusting quotas
Change quota_bytes in the config and send SIGHUP. Quota limits are synced to the database on reload. Alternatively, restart the orchestrator — SyncQuotaLimits also runs on startup.
Enabling replication after initial setup
Add a replication section with factor > 1 and send SIGHUP (or restart). When restarting, the replication worker runs immediately at startup to begin creating copies of existing objects, then continues at the configured interval. With SIGHUP, the new factor takes effect on the next worker tick.
Remember: the replication factor cannot exceed the number of backends.
Enabling encryption on existing data
If you enable encryption on an orchestrator that already has unencrypted objects, those objects remain unencrypted until you explicitly encrypt them. New objects are encrypted automatically; existing ones need the encrypt-existing admin API.
Enable encryption in the config and restart the orchestrator.
Encrypt existing objects:
This processes all unencrypted objects in batches of 100: downloads from the backend, encrypts, re-uploads the ciphertext (overwriting the plaintext), and updates the database record. The response shows progress:
Monitor via the
s3o_encrypt_existing_objects_totalmetric (labels:success,error).
Failed objects are logged individually and can be retried by calling encrypt-existing again — it only processes objects without encryption metadata.
Disabling encryption / decrypting existing data
To remove encryption from all objects and restore plaintext on backends, use the decrypt-existing admin API. Encryption must still be configured when you run this (the orchestrator needs the key provider to unwrap DEKs). Disable encryption in the config after decryption completes.
Decrypt all encrypted objects:
This processes all encrypted objects in batches of 100: downloads ciphertext from the backend, decrypts, re-uploads plaintext (overwriting the ciphertext), and clears encryption metadata in the database. Each object costs 2 API calls (one GET, one PUT) plus egress and ingress against the backend’s usage quota. The response shows progress:
Disable encryption in the config and restart the orchestrator.
Monitor via the
s3o_decrypt_existing_objects_totalmetric (labels:success,error).
Failed objects are logged individually and can be retried by calling decrypt-existing again — it only processes objects with encryption metadata.
Both encrypt-existing and decrypt-existing keep backend_quotas.bytes_used consistent with the on-disk byte count: each object is rewritten at a different size (encryption inflates by per-chunk overhead, decryption removes it), and the per-backend counter advances by the size delta inside the same transaction as the metadata update. No manual reconciliation against SUM(object_locations.size_bytes) is needed after a run.
Rotating encryption keys
Key rotation re-wraps DEKs with a new master key without re-encrypting object data. This is a metadata-only operation and is fast regardless of object sizes.
Generate a new master key:
Update the config — set the new key as
master_keyand move the old key toprevious_keys:Restart the orchestrator (encryption config is not reloadable).
Re-wrap all DEKs:
The
old-key-ididentifies which key’s DEKs to re-wrap. For inline config keys, the ID isconfig-0for the primary andconfig-1,config-2, etc. for previous keys in order. For file-based keys, the ID isfile-0. For Vault Transit, it’s the key name.The response shows progress:
After all DEKs are re-wrapped, you can optionally remove the old key from
previous_keysand restart. Objects that were rotated no longer need the old key.
Metrics to watch during rotation:
| Metric | Description |
|---|---|
s3o_key_rotation_objects_total{status="success"} | DEKs successfully re-wrapped |
s3o_key_rotation_objects_total{status="error"} | DEKs that failed re-wrapping |
Rotating client credentials
Update the credentials in the bucket config and send SIGHUP. The new credentials take effect immediately and old credentials stop working. Coordinate with the tenant to update their client configuration at the same time.
Example: rotating credentials without downtime
To perform a zero-downtime credential rotation, temporarily add both old and new credentials:
- Add the new credential alongside the old one:
- Send
SIGHUP— both credentials now work. - Update the client to use the new credentials.
- Remove the old credential from the config and send
SIGHUPagain.
Deployment
Nomad and Kubernetes
Production-ready manifests for both platforms are in deploy/. Each includes a local demo script that stands up a complete environment in one command:
See deploy/README.md for production deployment instructions, Vault integration, TLS/mTLS configuration, and Ingress setup.
Multi-instance deployment
By default, every instance runs both the HTTP API and all background workers (--mode=all). For larger deployments, the --mode flag separates these roles:
| Mode | HTTP API | Background workers | Use case |
|---|---|---|---|
all (default) | Yes | All 6 services | Single-instance or small deployments |
api | Yes | Usage flush only | Scale-out API instances behind a load balancer |
worker | Health + metrics only | All 6 services | Dedicated background processing |
How it works:
- API instances serve S3 requests, the web UI, and rate limiting. They run the usage-flush service to avoid losing counters on restart, but skip all advisory-locked background tasks.
- Worker instances run all background services and expose
/health,/health/ready, and/metricsfor monitoring, but don’t serve S3 traffic or the web UI. - Background tasks that modify state (rebalancer, replicator, cleanup, lifecycle, multipart cleanup) use PostgreSQL advisory locks — only one instance cluster-wide executes each task per cycle. Running multiple worker instances is safe; extra instances simply skip cycles when the lock is held. The circuit breaker watchdog runs on all instances without a lock (it operates on per-instance circuit state).
Recommended topology: N api instances behind a load balancer + 1–2 worker instances for redundancy. All instances share the same config file and PostgreSQL database.
Docker
The VERSION is baked into the binary via -ldflags and displayed in the web UI and /health endpoint. Use versioned tags (not latest) to avoid Docker layer caching issues on orchestration platforms.
The default entrypoint is s3-orchestrator -config /etc/s3-orchestrator/config.yaml. Mount your config file to that path, or override the command to use a different location.
Environment variables referenced in the config via ${VAR} syntax are expanded at startup, so pass secrets as -e flags or via your orchestration platform’s secret injection.
The listen_addr in your config determines which port the process binds to inside the container — make sure your -p mapping matches.
Debian Package (Systemd)
Build a .deb package for bare-metal or VM deployments:
Install and configure:
The package installs:
| Path | Purpose |
|---|---|
/usr/bin/s3-orchestrator | Binary |
/etc/s3-orchestrator/config.yaml | Configuration (conffile, preserved on upgrade) |
/etc/default/s3-orchestrator | Environment variables for ${VAR} expansion in config |
/usr/lib/systemd/system/s3-orchestrator.service | Systemd unit |
/var/lib/s3-orchestrator/ | Data directory |
The systemd unit runs as a dedicated s3-orchestrator user with filesystem hardening (ProtectSystem=strict, ProtectHome=yes, NoNewPrivileges=yes). The service is enabled on install but not started automatically, allowing configuration before first start.
Config reload works via systemd:
This sends SIGHUP to the process, reloading bucket credentials, quota limits, rate limits, and rebalance/replication settings without downtime. See Reloading configuration for details on what is and isn’t reloadable.
Uninstall: