
Configuration
Configuration Walkthrough
This page walks every YAML block in the config, with examples and validation rules. Each subsection below has a corresponding per-topic doc with operational depth:
- buckets → docs/authentication.md for SigV4 + credential semantics
- database → docs/database.md for engine choice + schema + migrations
- backends → docs/backends.md for routing, quotas, provider quick-ref
- rebalance / replication → docs/replication.md
- encryption / integrity → docs/encryption.md
- cache → docs/backends.md (object data cache)
- cleanup_queue / lifecycle / write_path → docs/cleanup-and-lifecycle.md
- telemetry → docs/monitoring.md
- notifications → docs/notifications.md
This section covers each config section in detail. See packaging/config.yaml for a complete template.
All config values support ${ENV_VAR} expansion — the orchestrator calls os.Expand on the entire YAML file before parsing. Use this for secrets:
server
listen_addris the only required field.max_object_sizecaps single-PUT uploads. Larger objects should use multipart upload (most clients do this automatically). PutObject buffers the entire body in memory to support write failover across backends, so peak memory from uploads is approximatelymax_object_size x max_concurrent_writes.max_concurrent_requestslimits the number of S3 requests processed simultaneously. When the limit is reached, new requests are rejected with503 SlowDownandRetry-After: 1. Set to 2-3xdatabase.max_connsfor load shedding.0disables the limit.max_concurrent_readsandmax_concurrent_writesprovide separate concurrency limits for reads (GET, HEAD) and writes (PUT, POST, DELETE). When both are set, they replacemax_concurrent_requestswith independent pools so write storms cannot starve reads. Background workers contend with HTTP writes, not reads — cleanup, replication, rebalance, pending reaper, and over-replication acquire admission slots from the same pool sized tomax_concurrent_writes. In merged mode (max_concurrent_requestsonly), every HTTP request and every background worker shares the single global pool. Sizemax_concurrent_writesto accommodate both peak HTTP write traffic and the worst-case overlap of background worker activity (typically the replication factor × replicator concurrency for the dominant case). See issue #835 for the design rationale.load_shed_thresholdenables active load shedding. When in-flight requests exceed this fraction of pool capacity (e.g.0.8), new requests are probabilistically rejected before the hard limit, providing smooth degradation instead of a cliff.admission_waitadds a brief wait before rejecting when the semaphore is full (e.g.50ms). Smooths micro-bursts without adding latency during sustained overload. Default0means instant rejection.backend_timeoutbounds individual S3 API calls to backends. Increase if you have slow backends or large objects.read_header_timeoutprotects against slow-read attacks that hold connections open by sending headers slowly. The 10-second default is generous for any legitimate client.read_timeoutandwrite_timeoutbound the total time for reading/writing entire requests and responses. The 5-minute defaults accommodate large object transfers.idle_timeoutcontrols how long keep-alive connections stay open waiting for the next request.shutdown_delayadds a pause between marking the instance as not-ready and starting the HTTP drain on SIGTERM. Set this to ~5s in environments where service deregistration is asynchronous (Consul, Kubernetes) so load balancers stop routing before connections are closed. Default0means no delay.
buckets
Each bucket defines a virtual namespace with one or more credential sets.
Generating credentials: Use openssl rand to produce random keys:
Validation rules:
- Bucket names must not contain
/. - Bucket names must be unique across the config.
- Access key IDs must be globally unique across all buckets.
- Each bucket must have at least one credential set.
- Each credential needs either
access_key_id+secret_access_key(SigV4) ortoken(legacy).
Multiple credentials on the same bucket let different services share a namespace with independent keys. This is useful when you want a writer service and a reader service accessing the same files.
SigV4 credentials also support presigned URLs automatically. Clients can generate time-limited presigned URLs using any AWS SDK presign client — no additional configuration is needed on the orchestrator side.
database
The driver field selects between SQLite (embedded, zero-dependency) and PostgreSQL (required for multi-instance deployments). When driver is omitted, the orchestrator infers postgres if host is set, otherwise sqlite.
SQLite (default for single-instance):
SQLite requires no external dependencies. The database file is created automatically on first start. Advisory lock-based leader election is replaced by a process-local mutex, so multi-instance deployments are not supported with SQLite.
PostgreSQL (required for multi-instance):
Pool settings (max_conns, min_conns, max_conn_lifetime) control the pgx connection pool. Size max_conns to 2-3x your max_concurrent_requests setting. See Performance Tuning - Connection Pool Sizing for detailed guidance.
routing_strategy
Controls how the orchestrator selects a backend when writing new objects.
- pack (default) — fills the first backend in config order until its quota is full, then overflows to the next. Best for stacking free-tier allocations sequentially.
- spread — places each object on the backend with the lowest utilization ratio (
(bytes_used + orphan_bytes) / bytes_limit). Best for distributing storage evenly across backends.
Both strategies respect quota limits and usage limits — full or over-limit backends are always skipped.
backends
Each backend is an S3-compatible storage service with its own credentials and optional quota.
Endpoint URLs by provider:
| Provider | Endpoint format | force_path_style |
|---|---|---|
| OCI Object Storage | https://<namespace>.compat.objectstorage.<region>.oraclecloud.com | true |
| Backblaze B2 | https://s3.<region>.backblazeb2.com | true |
| AWS S3 | https://s3.<region>.amazonaws.com | false |
| MinIO | http://<host>:9000 | true |
| Wasabi | https://s3.<region>.wasabisys.com | true |
Quota: Set quota_bytes to limit how much data a backend can hold. Set to 0 or omit for unlimited. Quota is tracked in PostgreSQL and updated atomically with every write/delete. Note that multipart uploads do not reserve quota upfront — temporary parts consume backend storage without being counted against the quota until CompleteMultipartUpload records the final object size. A client uploading many large parts could temporarily exceed a backend’s quota before completion.
Max object size: Some providers impose per-object size limits (e.g. Supabase rejects uploads over 50 MB with 413 EntityTooLarge). Set max_object_size to prevent the orchestrator from routing writes, rebalance moves, or replication copies to a backend when the object exceeds the limit:
Usage limits: Optional monthly caps on API requests, egress, and ingress per backend:
When a backend exceeds a usage limit, writes overflow to the next eligible backend. Limits reset each month automatically.
Unsigned payload: By default, uploads stream directly to backends without buffering the entire body in memory. The AWS SDK normally buffers the request body to compute a SigV4 payload hash (SHA-256), but the orchestrator uses UNSIGNED-PAYLOAD to skip this. Without streaming, large uploads (multipart completion, replication) can cause out-of-memory kills.
For HTTPS endpoints, unsigned payload is enabled by default. For plain HTTP endpoints, it is auto-disabled unless explicitly set — AWS S3 rejects unsigned payloads over HTTP, but most S3-compatible backends (MinIO, R2, etc.) accept them. Set unsigned_payload: true on HTTP backends to enable streaming:
Set unsigned_payload: false to force payload hashing. This buffers the entire object in memory before uploading — only use this if you have a specific compliance requirement for end-to-end payload integrity independent of TLS.
Disable checksum: AWS SDK v2 defaults to sending streaming checksums (CRC64NVME) on uploads. Some S3-compatible providers — notably Google Cloud Storage — reject these with SignatureDoesNotMatch. Set disable_checksum: true on backends that don’t support the AWS checksum headers:
This sets the SDK’s RequestChecksumCalculation and ResponseChecksumValidation to WhenRequired, disabling automatic checksum injection without affecting SigV4 request signing.
Strip SDK headers: AWS SDK v2 adds headers (amz-sdk-invocation-id, amz-sdk-request, accept-encoding) and a query parameter (x-id) that are included in the SigV4 signed header set. Google Cloud Storage does not include these when verifying the signature, causing SignatureDoesNotMatch errors. Set strip_sdk_headers: true to remove them before request signing:
For GCS backends, you typically need both disable_checksum: true and strip_sdk_headers: true:
Credential source: credential_source selects how the orchestrator obtains credentials for the backend. Default is static, which uses the access_key_id / secret_access_key fields above. Set to default_chain to delegate to the AWS SDK’s default credential chain (env vars, EC2 IMDS, SSO, ~/.aws/credentials, STS assume-role). When default_chain is set, the two key fields must be omitted — leaving stale keys behind is rejected at validation so they cannot silently shadow the SDK-resolved credentials.
Use default_chain when:
- The orchestrator runs on an EC2 instance with an IAM role attached (IMDS-vended credentials rotate every ~6 hours and cannot be tracked by YAML).
- Local development uses SSO (
aws sso login) instead of long-lived keys. - You want the SDK to resolve credentials via STS assume-role chains.
Note: the config loader already expands ${ENV_VAR} references at load time, so access_key_id: ${AWS_ACCESS_KEY_ID} covers the env-var case under credential_source: static. Use default_chain for credential sources the loader cannot reach (IMDS, SSO, STS) and for cases where refresh matters.
telemetry
Metrics are served on the same port as the S3 API. Tracing exports spans via gRPC OTLP (e.g., to Tempo or Jaeger).
Production sample rate guidance: A sample_rate of 1.0 traces every request, which is appropriate for development and low-traffic deployments. For production workloads above ~100 RPS, reduce to 0.01–0.1 to avoid overwhelming the trace backend with storage, network, and CPU overhead. Metrics and logs are unaffected by sample rate.
circuit_breaker
The circuit breaker is always active. These settings tune its sensitivity.
When the database is unreachable, the orchestrator enters degraded mode: reads broadcast to all backends (with caching), writes return 503. The circuit automatically recovers when the database comes back.
By default, degraded reads try each backend sequentially. When parallel_broadcast is enabled, all backends are tried concurrently and the first success wins — reducing worst-case read latency from N * backend_timeout to roughly the fastest backend’s response time. Enable this if read latency during outages is critical, but note that each parallel broadcast sends API requests to all backends simultaneously, which counts against monthly usage limits.
For fleets with many configured backends, set degraded_broadcast_parallelism to cap how many backends are probed at once. With a positive value, probes run as a rolling window: the first N launch immediately and each failure replenishes the slot with the next pending backend, so at most N goroutines (and at most N concurrent backend API calls / TLS handshakes) are in flight at any time. The default of 0 preserves the historical “fan out to every backend at once” behaviour.
The other defaults are sensible for most deployments. Increase cache_ttl if you have many read-heavy clients and want fewer backend round-trips during outages.
backend_circuit_breaker
Per-backend circuit breakers isolate failures at the individual backend level. When a backend’s credentials expire or the provider becomes unreachable, the circuit opens after consecutive failures and the backend is excluded from request routing. A single probe request tests recovery after the timeout elapses. Disabled by default.
Unlike the database circuit breaker, which triggers degraded mode for the entire system, backend circuit breakers affect only the individual backend. Reads fall back to other replicas, and writes route to other backends with available quota. No extra API calls are made — the breaker trips purely on organic traffic failures.
The s3o_circuit_breaker_state{name="<backend>"} metric tracks each backend’s circuit state (0=closed, 1=open, 2=half-open). Alert on > 0 for individual backends to detect credential or provider issues. Requires a restart to change (not hot-reloadable).
rebalance
Moves objects between backends to optimize storage distribution. Disabled by default — enabling it will generate egress/ingress traffic on your backends.
- pack — fills backends in config order, consolidating free space onto the last backend. Good for maximizing free-tier allocations.
- spread — equalizes utilization ratios across all backends. Good for distributing load.
Object moves run concurrently within each batch, bounded by concurrency. Increase for faster rebalancing; decrease to reduce backend load.
replication
Creates additional copies of objects on different backends for redundancy.
The replication factor must be <= number of backends. The worker runs once at startup to catch up on any pending replicas, then continues at the configured interval. Reads automatically fail over to replicas if the primary copy is unavailable.
Replication is asynchronous — writes go to a single backend and the replicator creates additional copies in the background. When a client overwrites an existing key, all old copies (including replicas) are removed and a single new copy is written. The replication factor drops to 1 until the next replicator cycle creates the additional copies. If the single backend holding the new copy fails before replication runs, the new version of the object is at risk. For most workloads this window (up to worker_interval) is acceptable. Lowering worker_interval reduces the exposure at the cost of more frequent DB queries and backend I/O.
Health-aware replication: When backend circuit breakers are enabled, the replicator monitors backend health. If a backend’s circuit breaker has been open longer than unhealthy_threshold, copies on that backend are treated as unavailable and replacement copies are created on healthy backends. This prevents sustained outages from silently reducing redundancy. The threshold prevents churn during brief transient failures. Set to 0 to disable health-aware replication (copies on down backends are still counted).
Cleanup Queue
The cleanup queue is always active. Tunables:
multipart_stale_timeout is consumed by the hourly CleanupStaleMultipartUploads sweep — uploads that have been open longer than this are aborted, their parts deleted from the backend, and the multipart rows removed. The default 24h matches the AWS S3 SDK’s default abort behavior; lower it on backends with tight free-tier headroom to recover quota faster.
When any backend object deletion fails during normal operations (PutObject orphan cleanup, DeleteObject, overwrite displaced copies, multipart part cleanup, rebalancer, replicator), the failed deletion is automatically enqueued for retry.
Each enqueued item tracks the object’s size_bytes. On enqueue, the backend’s orphan_bytes counter is incremented so that write routing and replication target selection account for the physically unreleased space. On successful cleanup the row is removed and orphan_bytes is decremented in a single atomic CTE; a worker crash between the two operations cannot leave the counter inconsistent.
Per-row claim pattern. Every row carries claimed_at and claimed_by columns. When a worker tick fetches a batch it stamps each row with the current instance’s identifier and timestamp, gated by FOR UPDATE SKIP LOCKED (Postgres) or SQLite’s intrinsic single-writer serialisation. Two instances ticking concurrently always see disjoint row sets, so a connection death or rolling-deploy overlap that would otherwise let two workers process the same row is now structurally impossible. A claim older than claim_grace_period (default 5m) is reclaimable so a worker that died mid-process does not leave the row stuck; reclaims emit s3o_cleanup_queue_stale_claims_recovered_total and a cleanup_queue.claim_recovered audit event.
The background worker runs every minute and retries with exponential backoff (1 minute to 24 hours). Scheduling a retry clears the row’s claim so it is immediately re-eligible for the next tick. After 10 failed attempts, the row is graduated to the cleanup_dlq table via core.MoveCleanupToDLQ (single transaction: read the row, insert it into cleanup_dlq, delete it from cleanup_queue). orphan_bytes is intentionally NOT decremented during the move because the backend object is still on disk. The DLQ entry retains the full row payload (key, backend, size, reason, last_error) plus an original_id correlation column so an operator can find the original queue entry.
Monitoring:
s3o_cleanup_queue_depthstaying elevated — orphaned objects are accumulating in the active queue.s3o_cleanup_queue_processed_total{status="exhausted"}— counter increments each time an item exhausts retries.s3o_cleanup_queue_processed_total{status="success_absent"}— counter increments each time a backend DELETE returned 404 and the row was dropped as idempotent success (the backend already agrees the object is gone). A sustained rate here is benign and just means upstream PUTs are silently failing somewhere; spikes are worth correlating with backend health.s3o_cleanup_queue_stale_claims_recovered_total{backend}— non-zero rate means a worker died mid-process or the grace period is too short for realistic worst-case processing time.s3o_cleanup_dlq_depth > 0— the DLQ holds at least one unrecoverable orphan; alerting here gives operators a direct signal instead of a counter delta.s3o_cleanup_dlq_enqueued_total{backend}— rate of graduations per backend; a single backend dominating means that backend’s delete path is broken.s3o_cleanup_enqueue_failures_total{backend,reason,stage}— orphan-leak blind spot signal. The cleanup-queue itself is durable, but its enqueue path is best-effort: when a backend write succeeds and the DB is then unreachable, the orphan cannot be recorded incleanup_queueand the only signal is this counter plus the matchingstorage.OrphanEnqueueFailedaudit event.stage="enqueue"is the worst case (the cleanup-queue worker will never see this orphan);stage="orphan_bytes"means the row landed but the quota counter drifts. See the runbook below.s3o_quota_orphan_bytes— elevated values mean backends have significant physically unreleased space (DLQ entries are the long-tail contributors).
Untracked-orphan recovery (cleanup enqueue failed during DB outage). A non-zero rate of s3o_cleanup_enqueue_failures_total{stage="enqueue"} means at least one orphan exists on a backend with no cleanup_queue row. The cleanup-queue worker will not retry it; the storage will leak until reconciled. Recovery workflow:
- Query the audit log for
event="storage.OrphanEnqueueFailed"to enumerate the specific backend/key/size of each affected orphan during the outage window. - Once DB connectivity is restored, run
POST /admin/api/reconcile[?backend=name]. The reconciler walks each backend’s actual key list againstobject_locationsusing a bounded-memory sorted-merge and emits S3-only keys to the cleanup path (with a freshcleanup_queuerow this time). This is the same diff machinery that runs on the nightly reconcile interval. - If the audit log indicates more than a handful of failures, target the reconciler at the affected backends specifically rather than waiting for the next scheduled scan.
stage="orphan_bytes" failures do not need step 2 — the cleanup_queue row landed and the worker will eventually delete the object. The quota counter drift is reset when backend_quotas.orphan_bytes is reconciled against cleanup_queue (a periodic safety pass; not yet automated).
Manual cleanup: Inspect DLQ entries and resolve them deliberately. The bytes are still on the backend, so the workflow is delete the object out-of-band, then write off the row + adjust orphan_bytes by the row’s size:
write_path
The write path can run in two modes. Direct mode (enabled: false) writes to the backend and commits the metadata immediately afterward; a crash between the two leaks bytes onto the backend with no DB record. Pending-intent mode (enabled: true, the default) inserts a row into pending_objects before the backend PUT and atomically deletes that row when the metadata commits — so a crash between the PUT and the commit leaves a recoverable intent the background reaper can resolve.
How recovery works. On every tick the PendingReaper worker (internal/worker/pending.go) claims a batch of pending_objects rows older than min_age, HEADs the backend at the recorded key, and resolves each one:
- HEAD 200 → the backend received the bytes. Promote the intent to a committed
object_locationsrow (pending_reaper.promotedaudit event). - HEAD 404 → the backend never received the bytes. Drop the intent (
pending_reaper.droppedaudit event). No orphan exists. - Non-404 HEAD error → leave the intent for the next tick. A sustained backend reachability problem here surfaces as
s3o_pending_intents_resolved_total{status="ambiguous"}. - A later write for the same key already committed → drop the intent as superseded (
pending_reaper.superseded).
Why min_age matters. The reaper must not race the foreground write path; if min_age is too short the reaper can interrogate an intent whose backend PUT is still in flight and either prematurely commit it or churn ambiguous resolutions. The 5-minute default is generous; lower it only if you have measured the p99 PUT duration and accept the operational tradeoff.
Monitoring:
s3o_pending_intents_enqueued_total— should track the PutObject rate closely.s3o_pending_intents_resolved_total{status}—committedis the happy path (synchronous commit succeeded);promoted+droppedare reaper resolutions;ambiguousis the alert.s3o_pending_intents_depth— gauge of unresolved intents. Alert when consistently abovebatch_size— the reaper is not keeping up (raisebatch_size, lowerreaper_tick, or add concurrency).- Audit events:
pending_reaper.promoted/pending_reaper.dropped/pending_reaper.superseded.
When to disable. Don’t, unless you are running an embedded SQLite single-instance demo and trust the OS to flush. The pattern adds one DB write per PUT (cheap) and saves you from one entire class of write-path crash leak.
rate_limit
Per-IP token bucket rate limiting. When enabled, rate limiting applies to both the S3 proxy and the admin API. Requests exceeding the limit receive 429 SlowDown.
A background goroutine evicts per-IP entries not seen within cleanup_max_age every cleanup_interval. Under high source-IP cardinality (e.g., DDoS), the map can hold up to cleanup_max_age worth of unique IPs — tune both values down if memory pressure is a concern.
When trusted_proxies is configured, the orchestrator extracts the real client IP from the X-Forwarded-For header using rightmost-untrusted extraction: it walks the XFF chain from right to left, skipping addresses within trusted CIDRs, and uses the first untrusted address for rate limiting. If the direct peer is not in a trusted CIDR, X-Forwarded-For is ignored entirely to prevent spoofing. Without trusted_proxies, the direct connection IP is always used.
Multi-instance note: Rate limits are enforced per-instance. Behind a load balancer with round-robin routing, the effective rate for a given client is
requests_per_sec * instance_count. Divide your desired aggregate rate by the number of API instances when configuring.
ui
Built-in web dashboard for operational visibility and management. Disabled by default. Requires authentication via an admin key/secret pair — sessions are HMAC-signed cookies with a 24-hour TTL.
admin_key, admin_secret, and session_secret are all required when enabled is true. Generate credentials the same way as bucket credentials:
Bcrypt-hashed secrets: For bare-metal deployments where the config file is at rest on disk, you can store admin_secret as a bcrypt hash instead of plaintext. The orchestrator detects bcrypt hashes automatically (they start with $2). Generate one with:
Both plaintext and bcrypt secrets are fully supported — no config migration needed.
Session secret: Session keys are derived deterministically from session_secret using HMAC-SHA256, so sessions survive restarts. For multi-instance deployments behind a load balancer, all instances sharing the same session_secret will accept each other’s sessions. Generate a value with:
session_secret is independent of admin_secret — rotating the admin password does not invalidate active sessions, and vice versa.
usage_flush
Controls how often usage counters are flushed to the database. When adaptive flushing is enabled, the interval shortens automatically when any backend approaches a usage limit, improving enforcement accuracy.
interval— how often counters are flushed under normal conditions. Lower values reduce staleness but increase database writes.adaptive_enabled— whentrue, the flush interval drops tofast_intervalwhenever any backend’s effective usage exceedsadaptive_thresholdof its configured limit.adaptive_threshold— the ratio (0–1 exclusive) at which fast flushing kicks in. At0.8, a backend at 80% of any usage limit triggers the fast interval.fast_interval— must be less thaninterval. Used when adaptive flushing detects a backend near its limits.
Multi-instance note: Without Redis, each instance accumulates usage counters in memory between flushes. With N instances, the enforcement margin near limits is up to
N * intervalworth of unaccounted operations. Adaptive flushing reduces this near limits but doesn’t eliminate it. For tighter enforcement, configure Redis shared counters to eliminate the cross-instance blind spot entirely, or reduceintervaland run fewer API instances.
redis
Optional shared usage counters for multi-instance deployments. When configured, all instances share usage counters via Redis instead of tracking them independently in memory. This eliminates the cross-instance blind spot between PostgreSQL flushes.
address— required when theredissection is present. The orchestrator PINGs Redis on startup and fails hard if unreachable.key_prefix— namespaces all Redis keys. Use different prefixes if multiple orchestrator deployments share one Redis instance.failure_thresholdandopen_timeout— control the circuit breaker that falls back to local counters when Redis is unavailable.
When Redis is active, the usage flush service acquires a PostgreSQL advisory lock so only one instance performs the destructive GETSET + flush-to-PG operation. When Redis is in fallback (or not configured), each instance flushes independently without a lock.
A background health probe runs every 5 seconds while the breaker is open: it PINGs Redis and, on success, syncs the accumulated local-counter deltas back via an additive INCRBY pipeline (no DEL — keys from before the outage expire via TTL) and recloses the breaker. The breaker recovery is clean: the failure counter is zeroed so the system tolerates the configured failure_threshold of new transient errors before tripping again. No process restart is required after a Redis outage.
Redis is not reloadable — changing Redis settings requires a restart.
lifecycle
Automatically deletes objects whose key matches a prefix and whose age exceeds the configured expiration. Useful for temporary uploads, staging artifacts, or anything with a known retention period.
prefix— key prefix to match (required, must be non-empty).expiration_days— delete objects older than this many days (required, must be > 0).- Omit the
lifecyclesection or leaverulesempty to disable lifecycle entirely. - Rules are evaluated every hour by a background worker with an advisory lock.
- Deletions go through the standard
DeleteObjectpath — all copies removed, quotas decremented, failed deletes enqueued to the cleanup queue. - Hot-reloadable via
SIGHUP.
encryption
Server-side envelope encryption with chunked AES-256-GCM. When enabled, objects are encrypted before being stored on backends and decrypted transparently on read. Exactly one key source is required.
Generating a master key:
Key source options — exactly one of the following must be set:
| Source | Config field | When to use |
|---|---|---|
| Inline | master_key | Base64-encoded 256-bit key in config or env var. Simplest option. |
| File | master_key_file | Path to a file containing exactly 32 raw bytes. Good for bare-metal with config management. |
| Vault Transit | vault | Delegate key wrapping/unwrapping to HashiCorp Vault. Best for production with HSM-backed key management. |
Vault Transit configuration:
The Vault Transit engine handles wrapping and unwrapping DEKs — the orchestrator never sees the master key material. The key_name must reference an existing key in the Transit engine.
Key rotation support:
When rotating to a new master key, move the old key to previous_keys so existing objects can still be decrypted:
After updating the config, call the rotate-encryption-key admin API to re-wrap all DEKs with the new key. See Rotating encryption keys below.
Important notes:
- Encryption is not reloadable — changing encryption settings requires a restart.
- The
chunk_sizemust stay the same for the lifetime of the data. Changing it after objects are encrypted will make those objects unreadable. - Encrypted objects are slightly larger than their plaintext (header + per-chunk overhead). The exact overhead is: 32 bytes (header) + 28 bytes per chunk (nonce + auth tag).
integrity
SHA-256 content hashing for data integrity verification. When enabled, objects are checksummed on write and the hash is stored alongside the object location in PostgreSQL.
How it works:
- Write path: SHA-256 is computed on the plaintext body (before encryption) and stored in
object_locations.content_hash. - Read path (
verify_on_read): AVerifyingReaderwraps the response body and computes the hash as data streams to the client. On mismatch at EOF, the corrupted copy is enqueued for cleanup. - Scrubber: A background worker periodically reads random objects from backends, decrypts if needed, and verifies their hash. Corrupted copies are enqueued for cleanup. Each read counts against the backend’s usage quota.
- Backfill: Objects written before integrity was enabled have no stored hash. Use
admin backfill-checksumsto read those objects and compute their hashes.
Integrity is hot-reloadable — changes take effect on SIGHUP without a restart.
cache
Optional in-memory LRU cache for full GET responses. Reduces backend API calls and egress by serving repeated reads from memory. Per-instance only — not shared across instances.
max_size— total memory the cache may consume. Size this based on available container memory after accounting for the Go heap, connection pools, and streaming buffers. A good starting point is 10-25% of the container’s memory allocation.max_object_size— objects larger than this are never admitted to the cache. Prevents a single large object from evicting many smaller frequently-accessed objects. Set this below the typical “hot” object size in your workload.ttl— maximum time an entry stays cached before automatic expiry. In multi-instance deployments, this bounds how stale a cached object can be when writes happen on another instance. Lower values reduce staleness at the cost of more backend requests.
Cache entries are automatically invalidated on PutObject, DeleteObject, CopyObject, DeleteObjects, and CompleteMultipartUpload. Range requests bypass the cache on miss but are served from cache on hit.
When to enable:
- Read-heavy workloads where the same objects are fetched repeatedly (thumbnails, config files, assets)
- Backends with per-request API charges or egress costs
- High-latency backends where caching improves P99 latency
When to skip:
- Write-heavy workloads with few repeated reads
- Objects are too large to fit meaningfully in memory
- Single-instance with very low read traffic
The cache is not hot-reloadable — changing cache settings requires a restart. When encryption is enabled, the cache stores post-decryption plaintext.
Metrics:
| Metric | Labels | Description |
|---|---|---|
s3o_integrity_checks_total | operation | Hash verifications performed (read, scrub) |
s3o_integrity_errors_total | operation | Hash mismatches detected (read, scrub) |
When enabled, the dashboard is served at {path}/ on the same port as the S3 API.
All dashboard responses include security headers (X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Content-Security-Policy). The dashboard requires authentication via the configured admin_key/admin_secret — unauthenticated requests are redirected to the login page (HTML) or receive 401 (API).
Configuration hot-reload
The orchestrator supports hot-reloading a subset of configuration by sending SIGHUP to the running process. This lets you update credentials, quotas, rate limits, and other operational settings without restarting the service or dropping client connections.
Reloadable vs non-reloadable settings
| Setting | Reloadable | Notes |
|---|---|---|
buckets (credentials, limits) | Yes | Credentials and max_multipart_uploads take effect immediately |
rate_limit | Yes | New visitors get updated rates; existing per-IP limiters expire naturally |
backends[].quota_bytes | Yes | Synced to database on reload |
backends[].api_request_limit | Yes | |
backends[].egress_byte_limit | Yes | |
backends[].ingress_byte_limit | Yes | |
rebalance | Yes | Strategy, interval, threshold, concurrency, enabled/disabled |
replication | Yes | Factor, worker interval, batch size |
usage_flush | Yes | Interval, adaptive enabled/threshold/fast interval |
lifecycle | Yes | Rules (prefix, expiration_days) |
integrity | Yes | Enabled, verify_on_read, scrubber interval/batch size |
server.listen_addr | No | Requires restart |
server.max_concurrent_requests | No | Requires restart |
server.max_concurrent_reads | No | Requires restart |
server.max_concurrent_writes | No | Requires restart |
server.load_shed_threshold | No | Requires restart |
server.admission_wait | No | Requires restart |
server timeouts | No | read_header_timeout, read_timeout, write_timeout, idle_timeout, shutdown_delay |
server.tls | No | Requires restart |
database | No | Requires restart |
telemetry | No | Requires restart |
circuit_breaker | No | Requires restart |
backend_circuit_breaker | No | Requires restart |
ui | No | Requires restart |
encryption | No | Requires restart |
cache | No | Requires restart |
redis | No | Requires restart |
routing_strategy | No | Requires restart |
reconcile | No | Requires restart |
backends (structural: endpoint, credentials, count) | No | Requires restart |
On a successful reload, the orchestrator logs each reloaded section:
If the new config file is invalid, the orchestrator keeps the current configuration and logs the error:
Non-reloadable field changes are logged as warnings but do not prevent the reload of other settings: