s3-orchestrator

Configuration

Configuration Walkthrough

This page walks every YAML block in the config, with examples and validation rules. Each subsection below has a corresponding per-topic doc with operational depth:


This section covers each config section in detail. See packaging/config.yaml for a complete template.

All config values support ${ENV_VAR} expansion — the orchestrator calls os.Expand on the entire YAML file before parsing. Use this for secrets:

database:
  password: "${DB_PASSWORD}"

server

server:
  listen_addr: "0.0.0.0:9000"    # required
  log_level: "info"               # debug, info, warn, error (default: info, reloadable via SIGHUP)
  max_object_size: 5368709120     # 5 GB default
  max_concurrent_requests: 0      # max concurrent S3 requests (0 = unlimited, default)
  # max_concurrent_reads: 0       # separate limit for GET/HEAD (0 = use global limit)
  # max_concurrent_writes: 0      # separate limit for PUT/POST/DELETE (0 = use global limit)
  # load_shed_threshold: 0        # active shedding threshold (0.0-1.0, 0 = disabled)
  # admission_wait: "0s"          # brief wait before rejection (0 = instant)
  backend_timeout: "30s"          # per-operation timeout for backend S3 calls
  read_header_timeout: "10s"      # max time to read request headers (default: 10s)
  read_timeout: "5m"              # max time to read entire request including body (default: 5m)
  write_timeout: "5m"             # max time to write response (default: 5m)
  idle_timeout: "120s"            # max time to wait for next request on keep-alive (default: 120s)
  shutdown_delay: "5s"            # delay before HTTP drain on SIGTERM (default: 0, no delay)
  • listen_addr is the only required field.
  • max_object_size caps single-PUT uploads. Larger objects should use multipart upload (most clients do this automatically). PutObject buffers the entire body in memory to support write failover across backends, so peak memory from uploads is approximately max_object_size x max_concurrent_writes.
  • max_concurrent_requests limits the number of S3 requests processed simultaneously. When the limit is reached, new requests are rejected with 503 SlowDown and Retry-After: 1. Set to 2-3x database.max_conns for load shedding. 0 disables the limit.
  • max_concurrent_reads and max_concurrent_writes provide separate concurrency limits for reads (GET, HEAD) and writes (PUT, POST, DELETE). When both are set, they replace max_concurrent_requests with independent pools so write storms cannot starve reads. Background workers contend with HTTP writes, not reads — cleanup, replication, rebalance, pending reaper, and over-replication acquire admission slots from the same pool sized to max_concurrent_writes. In merged mode (max_concurrent_requests only), every HTTP request and every background worker shares the single global pool. Size max_concurrent_writes to accommodate both peak HTTP write traffic and the worst-case overlap of background worker activity (typically the replication factor × replicator concurrency for the dominant case). See issue #835 for the design rationale.
  • load_shed_threshold enables active load shedding. When in-flight requests exceed this fraction of pool capacity (e.g. 0.8), new requests are probabilistically rejected before the hard limit, providing smooth degradation instead of a cliff.
  • admission_wait adds a brief wait before rejecting when the semaphore is full (e.g. 50ms). Smooths micro-bursts without adding latency during sustained overload. Default 0 means instant rejection.
  • backend_timeout bounds individual S3 API calls to backends. Increase if you have slow backends or large objects.
  • read_header_timeout protects against slow-read attacks that hold connections open by sending headers slowly. The 10-second default is generous for any legitimate client.
  • read_timeout and write_timeout bound the total time for reading/writing entire requests and responses. The 5-minute defaults accommodate large object transfers.
  • idle_timeout controls how long keep-alive connections stay open waiting for the next request.
  • shutdown_delay adds a pause between marking the instance as not-ready and starting the HTTP drain on SIGTERM. Set this to ~5s in environments where service deregistration is asynchronous (Consul, Kubernetes) so load balancers stop routing before connections are closed. Default 0 means no delay.

buckets

Each bucket defines a virtual namespace with one or more credential sets.

buckets:
  - name: "app1-files"
    # max_multipart_uploads: 100  # optional; limit active multipart uploads (0 = unlimited)
    credentials:
      - access_key_id: "AKID_APP1"
        secret_access_key: "secret1"

  - name: "app2-files"
    credentials:
      - access_key_id: "AKID_APP2_WRITER"
        secret_access_key: "secret2"
      - access_key_id: "AKID_APP2_READER"
        secret_access_key: "secret3"

Generating credentials: Use openssl rand to produce random keys:

# Generate an access key ID (20 chars, uppercase + digits)
openssl rand -hex 10 | tr '[:lower:]' '[:upper:]'

# Generate a secret access key (40 chars, base64)
openssl rand -base64 30

Validation rules:

  • Bucket names must not contain /.
  • Bucket names must be unique across the config.
  • Access key IDs must be globally unique across all buckets.
  • Each bucket must have at least one credential set.
  • Each credential needs either access_key_id + secret_access_key (SigV4) or token (legacy).

Multiple credentials on the same bucket let different services share a namespace with independent keys. This is useful when you want a writer service and a reader service accessing the same files.

SigV4 credentials also support presigned URLs automatically. Clients can generate time-limited presigned URLs using any AWS SDK presign client — no additional configuration is needed on the orchestrator side.

database

The driver field selects between SQLite (embedded, zero-dependency) and PostgreSQL (required for multi-instance deployments). When driver is omitted, the orchestrator infers postgres if host is set, otherwise sqlite.

SQLite (default for single-instance):

database:
  driver: sqlite
  path: "s3-orchestrator.db"     # default: s3-orchestrator.db

SQLite requires no external dependencies. The database file is created automatically on first start. Advisory lock-based leader election is replaced by a process-local mutex, so multi-instance deployments are not supported with SQLite.

PostgreSQL (required for multi-instance):

database:
  driver: postgres
  host: "db.example.com"        # required
  port: 5432                     # default: 5432
  database: "s3orchestrator"     # required
  user: "s3orchestrator"         # required
  password: "${DB_PASSWORD}"
  ssl_mode: "require"            # default: require (use "disable" for local dev)
  max_conns: 50                  # default: 50; size to 2-3x max_concurrent_requests
  min_conns: 10                  # default: 5
  max_conn_lifetime: "5m"        # default: 5m

Pool settings (max_conns, min_conns, max_conn_lifetime) control the pgx connection pool. Size max_conns to 2-3x your max_concurrent_requests setting. See Performance Tuning - Connection Pool Sizing for detailed guidance.

routing_strategy

Controls how the orchestrator selects a backend when writing new objects.

routing_strategy: "pack"       # "pack" or "spread" (default: pack)
  • pack (default) — fills the first backend in config order until its quota is full, then overflows to the next. Best for stacking free-tier allocations sequentially.
  • spread — places each object on the backend with the lowest utilization ratio ((bytes_used + orphan_bytes) / bytes_limit). Best for distributing storage evenly across backends.

Both strategies respect quota limits and usage limits — full or over-limit backends are always skipped.

backends

Each backend is an S3-compatible storage service with its own credentials and optional quota.

backends:
  - name: "oci"
    endpoint: "https://namespace.compat.objectstorage.us-phoenix-1.oraclecloud.com"
    region: "us-phoenix-1"
    bucket: "my-oci-bucket"
    access_key_id: "${OCI_ACCESS_KEY}"
    secret_access_key: "${OCI_SECRET_KEY}"
    force_path_style: true
    quota_bytes: 21474836480     # 20 GB

Endpoint URLs by provider:

ProviderEndpoint formatforce_path_style
OCI Object Storagehttps://<namespace>.compat.objectstorage.<region>.oraclecloud.comtrue
Backblaze B2https://s3.<region>.backblazeb2.comtrue
AWS S3https://s3.<region>.amazonaws.comfalse
MinIOhttp://<host>:9000true
Wasabihttps://s3.<region>.wasabisys.comtrue

Quota: Set quota_bytes to limit how much data a backend can hold. Set to 0 or omit for unlimited. Quota is tracked in PostgreSQL and updated atomically with every write/delete. Note that multipart uploads do not reserve quota upfront — temporary parts consume backend storage without being counted against the quota until CompleteMultipartUpload records the final object size. A client uploading many large parts could temporarily exceed a backend’s quota before completion.

Max object size: Some providers impose per-object size limits (e.g. Supabase rejects uploads over 50 MB with 413 EntityTooLarge). Set max_object_size to prevent the orchestrator from routing writes, rebalance moves, or replication copies to a backend when the object exceeds the limit:

    max_object_size: 52428800    # 50 MB (0 = unlimited)

Usage limits: Optional monthly caps on API requests, egress, and ingress per backend:

    api_request_limit: 20000     # monthly API calls (0 = unlimited)
    egress_byte_limit: 1073741824  # 1 GB monthly egress (0 = unlimited)
    ingress_byte_limit: 0        # unlimited ingress

When a backend exceeds a usage limit, writes overflow to the next eligible backend. Limits reset each month automatically.

Unsigned payload: By default, uploads stream directly to backends without buffering the entire body in memory. The AWS SDK normally buffers the request body to compute a SigV4 payload hash (SHA-256), but the orchestrator uses UNSIGNED-PAYLOAD to skip this. Without streaming, large uploads (multipart completion, replication) can cause out-of-memory kills.

For HTTPS endpoints, unsigned payload is enabled by default. For plain HTTP endpoints, it is auto-disabled unless explicitly set — AWS S3 rejects unsigned payloads over HTTP, but most S3-compatible backends (MinIO, R2, etc.) accept them. Set unsigned_payload: true on HTTP backends to enable streaming:

    unsigned_payload: true   # stream uploads without buffering (auto-enabled for HTTPS)

Set unsigned_payload: false to force payload hashing. This buffers the entire object in memory before uploading — only use this if you have a specific compliance requirement for end-to-end payload integrity independent of TLS.

Disable checksum: AWS SDK v2 defaults to sending streaming checksums (CRC64NVME) on uploads. Some S3-compatible providers — notably Google Cloud Storage — reject these with SignatureDoesNotMatch. Set disable_checksum: true on backends that don’t support the AWS checksum headers:

    disable_checksum: true   # required for GCS HMAC interoperability

This sets the SDK’s RequestChecksumCalculation and ResponseChecksumValidation to WhenRequired, disabling automatic checksum injection without affecting SigV4 request signing.

Strip SDK headers: AWS SDK v2 adds headers (amz-sdk-invocation-id, amz-sdk-request, accept-encoding) and a query parameter (x-id) that are included in the SigV4 signed header set. Google Cloud Storage does not include these when verifying the signature, causing SignatureDoesNotMatch errors. Set strip_sdk_headers: true to remove them before request signing:

    strip_sdk_headers: true   # required for GCS HMAC interoperability

For GCS backends, you typically need both disable_checksum: true and strip_sdk_headers: true:

  - name: "gcs"
    endpoint: "https://storage.googleapis.com"
    region: "auto"
    bucket: "my-bucket"
    access_key_id: "GOOG..."
    secret_access_key: "..."
    force_path_style: true
    disable_checksum: true
    strip_sdk_headers: true

Credential source: credential_source selects how the orchestrator obtains credentials for the backend. Default is static, which uses the access_key_id / secret_access_key fields above. Set to default_chain to delegate to the AWS SDK’s default credential chain (env vars, EC2 IMDS, SSO, ~/.aws/credentials, STS assume-role). When default_chain is set, the two key fields must be omitted — leaving stale keys behind is rejected at validation so they cannot silently shadow the SDK-resolved credentials.

Use default_chain when:

  • The orchestrator runs on an EC2 instance with an IAM role attached (IMDS-vended credentials rotate every ~6 hours and cannot be tracked by YAML).
  • Local development uses SSO (aws sso login) instead of long-lived keys.
  • You want the SDK to resolve credentials via STS assume-role chains.
  - name: "aws-prod"
    endpoint: "https://s3.amazonaws.com"
    region: "us-east-1"
    bucket: "my-prod-bucket"
    credential_source: "default_chain"
    # access_key_id / secret_access_key intentionally omitted

Note: the config loader already expands ${ENV_VAR} references at load time, so access_key_id: ${AWS_ACCESS_KEY_ID} covers the env-var case under credential_source: static. Use default_chain for credential sources the loader cannot reach (IMDS, SSO, STS) and for cases where refresh matters.

telemetry

telemetry:
  metrics:
    enabled: true
    path: "/metrics"             # default: /metrics
    # listen: "127.0.0.1:9091"  # serve on separate address (keeps /metrics off the public port)
  tracing:
    enabled: false
    endpoint: "localhost:4317"   # OTLP gRPC endpoint
    insecure: true               # no TLS to collector
    sample_rate: 1.0             # fraction of requests that generate traces (use 0.01–0.1 in production)

Metrics are served on the same port as the S3 API. Tracing exports spans via gRPC OTLP (e.g., to Tempo or Jaeger).

Production sample rate guidance: A sample_rate of 1.0 traces every request, which is appropriate for development and low-traffic deployments. For production workloads above ~100 RPS, reduce to 0.01–0.1 to avoid overwhelming the trace backend with storage, network, and CPU overhead. Metrics and logs are unaffected by sample rate.

circuit_breaker

The circuit breaker is always active. These settings tune its sensitivity.

circuit_breaker:
  failure_threshold: 3           # consecutive DB failures before opening (default: 3)
  open_timeout: "15s"            # delay before probing recovery (default: 15s)
  cache_ttl: "60s"               # key→backend cache TTL during degraded reads (default: 60s)
  parallel_broadcast: false      # fan-out reads to all backends in parallel (default: false)
  degraded_broadcast_parallelism: 0 # cap concurrent probes during parallel broadcast; 0 = no cap (default: 0)

When the database is unreachable, the orchestrator enters degraded mode: reads broadcast to all backends (with caching), writes return 503. The circuit automatically recovers when the database comes back.

By default, degraded reads try each backend sequentially. When parallel_broadcast is enabled, all backends are tried concurrently and the first success wins — reducing worst-case read latency from N * backend_timeout to roughly the fastest backend’s response time. Enable this if read latency during outages is critical, but note that each parallel broadcast sends API requests to all backends simultaneously, which counts against monthly usage limits.

For fleets with many configured backends, set degraded_broadcast_parallelism to cap how many backends are probed at once. With a positive value, probes run as a rolling window: the first N launch immediately and each failure replenishes the slot with the next pending backend, so at most N goroutines (and at most N concurrent backend API calls / TLS handshakes) are in flight at any time. The default of 0 preserves the historical “fan out to every backend at once” behaviour.

The other defaults are sensible for most deployments. Increase cache_ttl if you have many read-heavy clients and want fewer backend round-trips during outages.

backend_circuit_breaker

Per-backend circuit breakers isolate failures at the individual backend level. When a backend’s credentials expire or the provider becomes unreachable, the circuit opens after consecutive failures and the backend is excluded from request routing. A single probe request tests recovery after the timeout elapses. Disabled by default.

backend_circuit_breaker:
  enabled: true
  failure_threshold: 5             # consecutive failures before opening (default: 10)
  open_timeout: "5m"               # delay before probing recovery (default: 5m)

Unlike the database circuit breaker, which triggers degraded mode for the entire system, backend circuit breakers affect only the individual backend. Reads fall back to other replicas, and writes route to other backends with available quota. No extra API calls are made — the breaker trips purely on organic traffic failures.

The s3o_circuit_breaker_state{name="<backend>"} metric tracks each backend’s circuit state (0=closed, 1=open, 2=half-open). Alert on > 0 for individual backends to detect credential or provider issues. Requires a restart to change (not hot-reloadable).

rebalance

Moves objects between backends to optimize storage distribution. Disabled by default — enabling it will generate egress/ingress traffic on your backends.

rebalance:
  enabled: true
  strategy: "pack"               # "pack" or "spread" (default: pack)
  interval: "6h"                 # default: 6h
  batch_size: 100                # objects per run (default: 100)
  threshold: 0.1                 # min utilization spread to trigger (default: 0.1)
  concurrency: 5                 # parallel moves per run (default: 10)
  • pack — fills backends in config order, consolidating free space onto the last backend. Good for maximizing free-tier allocations.
  • spread — equalizes utilization ratios across all backends. Good for distributing load.

Object moves run concurrently within each batch, bounded by concurrency. Increase for faster rebalancing; decrease to reduce backend load.

replication

Creates additional copies of objects on different backends for redundancy.

replication:
  factor: 2                      # copies per object (default: 1 = no replication)
  worker_interval: "5m"          # replication cycle (default: 5m)
  batch_size: 50                 # objects per cycle (default: 50)
  concurrency: 5                 # parallel object replications per cycle (default: 10)
  unhealthy_threshold: "10m"     # grace period before replacing copies on circuit-broken backends (default: 10m)

The replication factor must be <= number of backends. The worker runs once at startup to catch up on any pending replicas, then continues at the configured interval. Reads automatically fail over to replicas if the primary copy is unavailable.

Replication is asynchronous — writes go to a single backend and the replicator creates additional copies in the background. When a client overwrites an existing key, all old copies (including replicas) are removed and a single new copy is written. The replication factor drops to 1 until the next replicator cycle creates the additional copies. If the single backend holding the new copy fails before replication runs, the new version of the object is at risk. For most workloads this window (up to worker_interval) is acceptable. Lowering worker_interval reduces the exposure at the cost of more frequent DB queries and backend I/O.

Health-aware replication: When backend circuit breakers are enabled, the replicator monitors backend health. If a backend’s circuit breaker has been open longer than unhealthy_threshold, copies on that backend are treated as unavailable and replacement copies are created on healthy backends. This prevents sustained outages from silently reducing redundancy. The threshold prevents churn during brief transient failures. Set to 0 to disable health-aware replication (copies on down backends are still counted).

Cleanup Queue

The cleanup queue is always active. Tunables:

cleanup_queue:
  concurrency: 10                # parallel cleanup deletions per tick (default: 10)
  claim_grace_period: 5m         # reclaim stale per-row claims older than this (default: 5m)
  multipart_stale_timeout: 24h   # abort multipart uploads older than this (default: 24h)

multipart_stale_timeout is consumed by the hourly CleanupStaleMultipartUploads sweep — uploads that have been open longer than this are aborted, their parts deleted from the backend, and the multipart rows removed. The default 24h matches the AWS S3 SDK’s default abort behavior; lower it on backends with tight free-tier headroom to recover quota faster.

When any backend object deletion fails during normal operations (PutObject orphan cleanup, DeleteObject, overwrite displaced copies, multipart part cleanup, rebalancer, replicator), the failed deletion is automatically enqueued for retry.

Each enqueued item tracks the object’s size_bytes. On enqueue, the backend’s orphan_bytes counter is incremented so that write routing and replication target selection account for the physically unreleased space. On successful cleanup the row is removed and orphan_bytes is decremented in a single atomic CTE; a worker crash between the two operations cannot leave the counter inconsistent.

Per-row claim pattern. Every row carries claimed_at and claimed_by columns. When a worker tick fetches a batch it stamps each row with the current instance’s identifier and timestamp, gated by FOR UPDATE SKIP LOCKED (Postgres) or SQLite’s intrinsic single-writer serialisation. Two instances ticking concurrently always see disjoint row sets, so a connection death or rolling-deploy overlap that would otherwise let two workers process the same row is now structurally impossible. A claim older than claim_grace_period (default 5m) is reclaimable so a worker that died mid-process does not leave the row stuck; reclaims emit s3o_cleanup_queue_stale_claims_recovered_total and a cleanup_queue.claim_recovered audit event.

The background worker runs every minute and retries with exponential backoff (1 minute to 24 hours). Scheduling a retry clears the row’s claim so it is immediately re-eligible for the next tick. After 10 failed attempts, the row is graduated to the cleanup_dlq table via core.MoveCleanupToDLQ (single transaction: read the row, insert it into cleanup_dlq, delete it from cleanup_queue). orphan_bytes is intentionally NOT decremented during the move because the backend object is still on disk. The DLQ entry retains the full row payload (key, backend, size, reason, last_error) plus an original_id correlation column so an operator can find the original queue entry.

Monitoring:

  • s3o_cleanup_queue_depth staying elevated — orphaned objects are accumulating in the active queue.
  • s3o_cleanup_queue_processed_total{status="exhausted"} — counter increments each time an item exhausts retries.
  • s3o_cleanup_queue_processed_total{status="success_absent"} — counter increments each time a backend DELETE returned 404 and the row was dropped as idempotent success (the backend already agrees the object is gone). A sustained rate here is benign and just means upstream PUTs are silently failing somewhere; spikes are worth correlating with backend health.
  • s3o_cleanup_queue_stale_claims_recovered_total{backend} — non-zero rate means a worker died mid-process or the grace period is too short for realistic worst-case processing time.
  • s3o_cleanup_dlq_depth > 0 — the DLQ holds at least one unrecoverable orphan; alerting here gives operators a direct signal instead of a counter delta.
  • s3o_cleanup_dlq_enqueued_total{backend} — rate of graduations per backend; a single backend dominating means that backend’s delete path is broken.
  • s3o_cleanup_enqueue_failures_total{backend,reason,stage} — orphan-leak blind spot signal. The cleanup-queue itself is durable, but its enqueue path is best-effort: when a backend write succeeds and the DB is then unreachable, the orphan cannot be recorded in cleanup_queue and the only signal is this counter plus the matching storage.OrphanEnqueueFailed audit event. stage="enqueue" is the worst case (the cleanup-queue worker will never see this orphan); stage="orphan_bytes" means the row landed but the quota counter drifts. See the runbook below.
  • s3o_quota_orphan_bytes — elevated values mean backends have significant physically unreleased space (DLQ entries are the long-tail contributors).

Untracked-orphan recovery (cleanup enqueue failed during DB outage). A non-zero rate of s3o_cleanup_enqueue_failures_total{stage="enqueue"} means at least one orphan exists on a backend with no cleanup_queue row. The cleanup-queue worker will not retry it; the storage will leak until reconciled. Recovery workflow:

  1. Query the audit log for event="storage.OrphanEnqueueFailed" to enumerate the specific backend/key/size of each affected orphan during the outage window.
  2. Once DB connectivity is restored, run POST /admin/api/reconcile[?backend=name]. The reconciler walks each backend’s actual key list against object_locations using a bounded-memory sorted-merge and emits S3-only keys to the cleanup path (with a fresh cleanup_queue row this time). This is the same diff machinery that runs on the nightly reconcile interval.
  3. If the audit log indicates more than a handful of failures, target the reconciler at the affected backends specifically rather than waiting for the next scheduled scan.

stage="orphan_bytes" failures do not need step 2 — the cleanup_queue row landed and the worker will eventually delete the object. The quota counter drift is reset when backend_quotas.orphan_bytes is reconciled against cleanup_queue (a periodic safety pass; not yet automated).

Manual cleanup: Inspect DLQ entries and resolve them deliberately. The bytes are still on the backend, so the workflow is delete the object out-of-band, then write off the row + adjust orphan_bytes by the row’s size:

-- View unrecoverable orphans needing manual intervention
SELECT id, original_id, backend_name, object_key, reason, attempts,
       size_bytes, first_enqueued_at, moved_at, last_error
FROM cleanup_dlq
ORDER BY moved_at;

-- After confirming the object is gone (manual S3 delete, reconciler sweep, etc.):
BEGIN;
UPDATE backend_quotas
   SET orphan_bytes = GREATEST(0, orphan_bytes - (SELECT size_bytes FROM cleanup_dlq WHERE id = 42))
 WHERE backend_name = (SELECT backend_name FROM cleanup_dlq WHERE id = 42);
DELETE FROM cleanup_dlq WHERE id = 42;
COMMIT;

-- Or, to push a DLQ entry back through automatic retry (e.g. after fixing the backend):
INSERT INTO cleanup_queue (backend_name, object_key, reason, size_bytes, next_retry, attempts, last_error)
SELECT backend_name, object_key, reason, size_bytes, NOW(), 0, last_error
  FROM cleanup_dlq WHERE id = 42;
DELETE FROM cleanup_dlq WHERE id = 42;

write_path

The write path can run in two modes. Direct mode (enabled: false) writes to the backend and commits the metadata immediately afterward; a crash between the two leaks bytes onto the backend with no DB record. Pending-intent mode (enabled: true, the default) inserts a row into pending_objects before the backend PUT and atomically deletes that row when the metadata commits — so a crash between the PUT and the commit leaves a recoverable intent the background reaper can resolve.

write_path:
  pending_pattern:
    enabled: true        # default: true; PUT-before-COMMIT crash-recovery pattern
    reaper_tick: 1m      # how often PendingReaper sweeps unresolved intents (default: 1m)
    min_age: 5m          # only intents older than this are eligible (default: 5m) — avoids racing in-flight PUTs
    batch_size: 50       # rows claimed per tick (default: 50)

How recovery works. On every tick the PendingReaper worker (internal/worker/pending.go) claims a batch of pending_objects rows older than min_age, HEADs the backend at the recorded key, and resolves each one:

  • HEAD 200 → the backend received the bytes. Promote the intent to a committed object_locations row (pending_reaper.promoted audit event).
  • HEAD 404 → the backend never received the bytes. Drop the intent (pending_reaper.dropped audit event). No orphan exists.
  • Non-404 HEAD error → leave the intent for the next tick. A sustained backend reachability problem here surfaces as s3o_pending_intents_resolved_total{status="ambiguous"}.
  • A later write for the same key already committed → drop the intent as superseded (pending_reaper.superseded).

Why min_age matters. The reaper must not race the foreground write path; if min_age is too short the reaper can interrogate an intent whose backend PUT is still in flight and either prematurely commit it or churn ambiguous resolutions. The 5-minute default is generous; lower it only if you have measured the p99 PUT duration and accept the operational tradeoff.

Monitoring:

  • s3o_pending_intents_enqueued_total — should track the PutObject rate closely.
  • s3o_pending_intents_resolved_total{status}committed is the happy path (synchronous commit succeeded); promoted + dropped are reaper resolutions; ambiguous is the alert.
  • s3o_pending_intents_depth — gauge of unresolved intents. Alert when consistently above batch_size — the reaper is not keeping up (raise batch_size, lower reaper_tick, or add concurrency).
  • Audit events: pending_reaper.promoted / pending_reaper.dropped / pending_reaper.superseded.

When to disable. Don’t, unless you are running an embedded SQLite single-instance demo and trust the OS to flush. The pattern adds one DB write per PUT (cheap) and saves you from one entire class of write-path crash leak.

rate_limit

Per-IP token bucket rate limiting. When enabled, rate limiting applies to both the S3 proxy and the admin API. Requests exceeding the limit receive 429 SlowDown.

rate_limit:
  enabled: true
  requests_per_sec: 100          # token refill rate (default: 100)
  burst: 200                     # max burst size (default: 200)
  cleanup_interval: "1m"         # stale entry eviction interval (default: 1m)
  cleanup_max_age: "5m"          # evict entries not seen within this window (default: 5m)
  trusted_proxies:               # CIDRs whose X-Forwarded-For is trusted
    - "10.0.0.0/8"
    - "172.16.0.0/12"

A background goroutine evicts per-IP entries not seen within cleanup_max_age every cleanup_interval. Under high source-IP cardinality (e.g., DDoS), the map can hold up to cleanup_max_age worth of unique IPs — tune both values down if memory pressure is a concern.

When trusted_proxies is configured, the orchestrator extracts the real client IP from the X-Forwarded-For header using rightmost-untrusted extraction: it walks the XFF chain from right to left, skipping addresses within trusted CIDRs, and uses the first untrusted address for rate limiting. If the direct peer is not in a trusted CIDR, X-Forwarded-For is ignored entirely to prevent spoofing. Without trusted_proxies, the direct connection IP is always used.

Multi-instance note: Rate limits are enforced per-instance. Behind a load balancer with round-robin routing, the effective rate for a given client is requests_per_sec * instance_count. Divide your desired aggregate rate by the number of API instances when configuring.

ui

Built-in web dashboard for operational visibility and management. Disabled by default. Requires authentication via an admin key/secret pair — sessions are HMAC-signed cookies with a 24-hour TTL.

ui:
  enabled: true
  path: "/ui"                          # URL prefix (default: /ui)
  admin_key: "${UI_ADMIN_KEY}"         # access key for dashboard login
  admin_secret: "${UI_ADMIN_SECRET}"   # secret key — plaintext or bcrypt hash
  admin_token: "${UI_ADMIN_TOKEN}"     # separate token for admin API (defaults to admin_key)
  session_secret: "${UI_SESSION_SECRET}" # required — HMAC key for session cookies
  force_secure_cookies: true           # always set Secure flag on cookies (for behind TLS proxy)

admin_key, admin_secret, and session_secret are all required when enabled is true. Generate credentials the same way as bucket credentials:

echo "Admin Key: $(openssl rand -hex 10 | tr '[:lower:]' '[:upper:]')"
echo "Admin Secret: $(openssl rand -base64 30)"

Bcrypt-hashed secrets: For bare-metal deployments where the config file is at rest on disk, you can store admin_secret as a bcrypt hash instead of plaintext. The orchestrator detects bcrypt hashes automatically (they start with $2). Generate one with:

htpasswd -nbBC 10 "" 'your-secret' | cut -d: -f2

Both plaintext and bcrypt secrets are fully supported — no config migration needed.

Session secret: Session keys are derived deterministically from session_secret using HMAC-SHA256, so sessions survive restarts. For multi-instance deployments behind a load balancer, all instances sharing the same session_secret will accept each other’s sessions. Generate a value with:

openssl rand -hex 32

session_secret is independent of admin_secret — rotating the admin password does not invalidate active sessions, and vice versa.

usage_flush

Controls how often usage counters are flushed to the database. When adaptive flushing is enabled, the interval shortens automatically when any backend approaches a usage limit, improving enforcement accuracy.

usage_flush:
  interval: "30s"            # base flush interval (default: 30s)
  adaptive_enabled: true     # shorten interval when near limits (default: false)
  adaptive_threshold: 0.8    # usage ratio to trigger fast flush (default: 0.8)
  fast_interval: "5s"        # interval when near limits (default: 5s)
  • interval — how often counters are flushed under normal conditions. Lower values reduce staleness but increase database writes.
  • adaptive_enabled — when true, the flush interval drops to fast_interval whenever any backend’s effective usage exceeds adaptive_threshold of its configured limit.
  • adaptive_threshold — the ratio (0–1 exclusive) at which fast flushing kicks in. At 0.8, a backend at 80% of any usage limit triggers the fast interval.
  • fast_interval — must be less than interval. Used when adaptive flushing detects a backend near its limits.

Multi-instance note: Without Redis, each instance accumulates usage counters in memory between flushes. With N instances, the enforcement margin near limits is up to N * interval worth of unaccounted operations. Adaptive flushing reduces this near limits but doesn’t eliminate it. For tighter enforcement, configure Redis shared counters to eliminate the cross-instance blind spot entirely, or reduce interval and run fewer API instances.

redis

Optional shared usage counters for multi-instance deployments. When configured, all instances share usage counters via Redis instead of tracking them independently in memory. This eliminates the cross-instance blind spot between PostgreSQL flushes.

redis:
  address: "redis.example.com:6379"  # host:port (required when section is present)
  password: "${REDIS_PASSWORD}"       # AUTH password (omit for no auth)
  db: 0                               # Redis database number (default: 0)
  tls: false                          # enable TLS (default: false)
  key_prefix: "s3orch"                # namespace for multi-tenant Redis (default: s3orch)
  failure_threshold: 3                # consecutive failures before local fallback (default: 3)
  open_timeout: "15s"                 # delay before probing Redis recovery (default: 15s)
  • address — required when the redis section is present. The orchestrator PINGs Redis on startup and fails hard if unreachable.
  • key_prefix — namespaces all Redis keys. Use different prefixes if multiple orchestrator deployments share one Redis instance.
  • failure_threshold and open_timeout — control the circuit breaker that falls back to local counters when Redis is unavailable.

When Redis is active, the usage flush service acquires a PostgreSQL advisory lock so only one instance performs the destructive GETSET + flush-to-PG operation. When Redis is in fallback (or not configured), each instance flushes independently without a lock.

A background health probe runs every 5 seconds while the breaker is open: it PINGs Redis and, on success, syncs the accumulated local-counter deltas back via an additive INCRBY pipeline (no DEL — keys from before the outage expire via TTL) and recloses the breaker. The breaker recovery is clean: the failure counter is zeroed so the system tolerates the configured failure_threshold of new transient errors before tripping again. No process restart is required after a Redis outage.

Redis is not reloadable — changing Redis settings requires a restart.

lifecycle

Automatically deletes objects whose key matches a prefix and whose age exceeds the configured expiration. Useful for temporary uploads, staging artifacts, or anything with a known retention period.

lifecycle:
  rules:
    - prefix: "tmp/"
      expiration_days: 7
    - prefix: "uploads/staging/"
      expiration_days: 1
  • prefix — key prefix to match (required, must be non-empty).
  • expiration_days — delete objects older than this many days (required, must be > 0).
  • Omit the lifecycle section or leave rules empty to disable lifecycle entirely.
  • Rules are evaluated every hour by a background worker with an advisory lock.
  • Deletions go through the standard DeleteObject path — all copies removed, quotas decremented, failed deletes enqueued to the cleanup queue.
  • Hot-reloadable via SIGHUP.

encryption

Server-side envelope encryption with chunked AES-256-GCM. When enabled, objects are encrypted before being stored on backends and decrypted transparently on read. Exactly one key source is required.

encryption:
  enabled: true
  chunk_size: 65536                    # default: 64KB (range: 4KB–1MB, must be power of 2)
  master_key: "${ENCRYPTION_KEY}"      # base64-encoded 256-bit key

Generating a master key:

openssl rand -base64 32

Key source options — exactly one of the following must be set:

SourceConfig fieldWhen to use
Inlinemaster_keyBase64-encoded 256-bit key in config or env var. Simplest option.
Filemaster_key_filePath to a file containing exactly 32 raw bytes. Good for bare-metal with config management.
Vault TransitvaultDelegate key wrapping/unwrapping to HashiCorp Vault. Best for production with HSM-backed key management.

Vault Transit configuration:

encryption:
  enabled: true
  vault:
    address: "http://vault.service.consul:8200"
    token: "${VAULT_TOKEN}"
    key_name: "s3-orchestrator"
    mount_path: "transit"     # default: transit

The Vault Transit engine handles wrapping and unwrapping DEKs — the orchestrator never sees the master key material. The key_name must reference an existing key in the Transit engine.

Key rotation support:

When rotating to a new master key, move the old key to previous_keys so existing objects can still be decrypted:

encryption:
  enabled: true
  master_key: "${NEW_ENCRYPTION_KEY}"         # new primary key
  previous_keys:
    - "${OLD_ENCRYPTION_KEY}"                 # old key, kept for unwrapping

After updating the config, call the rotate-encryption-key admin API to re-wrap all DEKs with the new key. See Rotating encryption keys below.

Important notes:

  • Encryption is not reloadable — changing encryption settings requires a restart.
  • The chunk_size must stay the same for the lifetime of the data. Changing it after objects are encrypted will make those objects unreadable.
  • Encrypted objects are slightly larger than their plaintext (header + per-chunk overhead). The exact overhead is: 32 bytes (header) + 28 bytes per chunk (nonce + auth tag).

integrity

SHA-256 content hashing for data integrity verification. When enabled, objects are checksummed on write and the hash is stored alongside the object location in PostgreSQL.

integrity:
  enabled: true
  verify_on_read: true               # Hash-check every GET response as it streams
  verify_on_replicate: true          # Verify hash when creating replicas (default: true)
  scrubber_interval: "6h"            # Background verification interval (0 = disabled)
  scrubber_batch_size: 100           # Objects per scrub cycle

How it works:

  • Write path: SHA-256 is computed on the plaintext body (before encryption) and stored in object_locations.content_hash.
  • Read path (verify_on_read): A VerifyingReader wraps the response body and computes the hash as data streams to the client. On mismatch at EOF, the corrupted copy is enqueued for cleanup.
  • Scrubber: A background worker periodically reads random objects from backends, decrypts if needed, and verifies their hash. Corrupted copies are enqueued for cleanup. Each read counts against the backend’s usage quota.
  • Backfill: Objects written before integrity was enabled have no stored hash. Use admin backfill-checksums to read those objects and compute their hashes.

Integrity is hot-reloadable — changes take effect on SIGHUP without a restart.

cache

Optional in-memory LRU cache for full GET responses. Reduces backend API calls and egress by serving repeated reads from memory. Per-instance only — not shared across instances.

cache:
  enabled: true
  max_size: "256MB"            # total cache capacity (default: 256MB)
  max_object_size: "10MB"      # largest cacheable object (default: 10MB)
  ttl: "5m"                    # per-entry time-to-live (default: 5m)
  • max_size — total memory the cache may consume. Size this based on available container memory after accounting for the Go heap, connection pools, and streaming buffers. A good starting point is 10-25% of the container’s memory allocation.
  • max_object_size — objects larger than this are never admitted to the cache. Prevents a single large object from evicting many smaller frequently-accessed objects. Set this below the typical “hot” object size in your workload.
  • ttl — maximum time an entry stays cached before automatic expiry. In multi-instance deployments, this bounds how stale a cached object can be when writes happen on another instance. Lower values reduce staleness at the cost of more backend requests.

Cache entries are automatically invalidated on PutObject, DeleteObject, CopyObject, DeleteObjects, and CompleteMultipartUpload. Range requests bypass the cache on miss but are served from cache on hit.

When to enable:

  • Read-heavy workloads where the same objects are fetched repeatedly (thumbnails, config files, assets)
  • Backends with per-request API charges or egress costs
  • High-latency backends where caching improves P99 latency

When to skip:

  • Write-heavy workloads with few repeated reads
  • Objects are too large to fit meaningfully in memory
  • Single-instance with very low read traffic

The cache is not hot-reloadable — changing cache settings requires a restart. When encryption is enabled, the cache stores post-decryption plaintext.

Metrics:

MetricLabelsDescription
s3o_integrity_checks_totaloperationHash verifications performed (read, scrub)
s3o_integrity_errors_totaloperationHash mismatches detected (read, scrub)

When enabled, the dashboard is served at {path}/ on the same port as the S3 API.

All dashboard responses include security headers (X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Content-Security-Policy). The dashboard requires authentication via the configured admin_key/admin_secret — unauthenticated requests are redirected to the login page (HTML) or receive 401 (API).

Configuration hot-reload

The orchestrator supports hot-reloading a subset of configuration by sending SIGHUP to the running process. This lets you update credentials, quotas, rate limits, and other operational settings without restarting the service or dropping client connections.

kill -HUP $(pidof s3-orchestrator)

Reloadable vs non-reloadable settings

SettingReloadableNotes
buckets (credentials, limits)YesCredentials and max_multipart_uploads take effect immediately
rate_limitYesNew visitors get updated rates; existing per-IP limiters expire naturally
backends[].quota_bytesYesSynced to database on reload
backends[].api_request_limitYes
backends[].egress_byte_limitYes
backends[].ingress_byte_limitYes
rebalanceYesStrategy, interval, threshold, concurrency, enabled/disabled
replicationYesFactor, worker interval, batch size
usage_flushYesInterval, adaptive enabled/threshold/fast interval
lifecycleYesRules (prefix, expiration_days)
integrityYesEnabled, verify_on_read, scrubber interval/batch size
server.listen_addrNoRequires restart
server.max_concurrent_requestsNoRequires restart
server.max_concurrent_readsNoRequires restart
server.max_concurrent_writesNoRequires restart
server.load_shed_thresholdNoRequires restart
server.admission_waitNoRequires restart
server timeoutsNoread_header_timeout, read_timeout, write_timeout, idle_timeout, shutdown_delay
server.tlsNoRequires restart
databaseNoRequires restart
telemetryNoRequires restart
circuit_breakerNoRequires restart
backend_circuit_breakerNoRequires restart
uiNoRequires restart
encryptionNoRequires restart
cacheNoRequires restart
redisNoRequires restart
routing_strategyNoRequires restart
reconcileNoRequires restart
backends (structural: endpoint, credentials, count)NoRequires restart

On a successful reload, the orchestrator logs each reloaded section:

{"level":"INFO","msg":"SIGHUP received, reloading configuration","path":"config.yaml"}
{"level":"INFO","msg":"Reloaded bucket credentials","buckets":2}
{"level":"INFO","msg":"Reloaded rate limits","requests_per_sec":100,"burst":200}
{"level":"INFO","msg":"Reloaded backend quota limits"}
{"level":"INFO","msg":"Reloaded backend usage limits"}
{"level":"INFO","msg":"Reloaded rebalance/replication/usage-flush config"}
{"level":"INFO","msg":"Configuration reload complete"}

If the new config file is invalid, the orchestrator keeps the current configuration and logs the error:

{"level":"ERROR","msg":"Config reload failed, keeping current config","error":"invalid config: ..."}

Non-reloadable field changes are logged as warnings but do not prevent the reload of other settings:

{"level":"WARN","msg":"Config field changed but requires restart to take effect","field":"server.listen_addr"}