s3-orchestrator

Performance Tuning

This guide covers configuration knobs that affect throughput, latency, and resource usage.

Connection Pool Sizing

database:
  max_conns: 50          # max pool connections (default: 50)
  min_conns: 10           # min idle connections (default: 10)
  max_conn_lifetime: "5m" # max connection age (default: 5m)

Sizing Formula

Each S3 request uses at least one database connection. Background workers (rebalancer, replicator, cleanup, usage flush) each hold a connection during their tick. Advisory lock acquisition uses a dedicated connection.

max_conns = max_concurrent_requests + (active_workers × 2) + 10
Workloadmax_concurrent_requestsRecommended max_conns
Dev / testing1050 (default)
Low traffic (<100 RPS)5050 (default)
Medium traffic (100–500 RPS)100100–150
High traffic (500+ RPS)200+200+

Guidelines

  • Multi-instance deployment: divide your PostgreSQL max_connections across instances. Each instance’s max_conns should be (max_connections - 20) / instance_count (reserve 20 for superuser and monitoring connections).
  • min_conns keeps idle connections warm. Set it to roughly half of max_conns to avoid connection setup latency on traffic spikes.
  • max_conn_lifetime rotates connections to pick up DNS changes and distribute load across replicas. The 5-minute default works well in most environments.
  • The orchestrator does not export pool-level Prometheus metrics for the pgx connection pool. To check for saturation, query PostgreSQL directly (SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE 's3-orchestrator%') or read the per-process pg_stat_activity row. Orchestrator-side request saturation is best inferred from s3o_inflight_requests and s3o_admission_rejections_total.

Admission Control

server:
  max_concurrent_requests: 1000  # default: 1000 (applied when split pools are not configured)
  # max_concurrent_reads: 0     # separate limit for GET/HEAD (0 = use global)
  # max_concurrent_writes: 0    # separate limit for PUT/POST/DELETE (0 = use global)
  # load_shed_threshold: 0      # 0.0-1.0, start shedding at this capacity ratio (0 = disabled)
  # admission_wait: "0s"        # brief wait before rejection (0 = instant)

Limits the number of S3 requests being processed concurrently. When the limit is reached, new requests are rejected with 503 SlowDown and a Retry-After: 1 header instead of queueing and potentially exhausting backend connections and goroutines.

Global vs split pools

By default, max_concurrent_requests applies a single limit to all operations. For workloads where write storms should not starve reads, set max_concurrent_reads and max_concurrent_writes instead. When both are set, reads (GET, HEAD) and writes (PUT, POST, DELETE) get independent semaphores and cannot block each other.

Active load shedding

When load_shed_threshold is set (e.g. 0.8), the server begins probabilistically rejecting requests before the hard limit. The rejection probability ramps linearly from 0% at the threshold to 100% at full capacity. This provides a smooth degradation curve instead of a cliff at the admission limit.

Admission wait

When admission_wait is set (e.g. 50ms), requests briefly wait for a slot to free up before being rejected. This smooths micro-bursts where a slot opens up within milliseconds. The request’s context deadline is also respected — whichever fires first wins. Set to 0 (default) for instant rejection.

When to set this

  • Default (0): no limit. Fine for low-traffic single-instance deployments.
  • Recommended starting point: set to 2-3x your database.max_conns value. Every S3 operation needs at least one database query, so there’s no benefit admitting more requests than the connection pool can serve.
  • Split pools: if your workload is read-heavy, set max_concurrent_reads higher than max_concurrent_writes to protect read availability during upload bursts.
  • Load shedding: start with load_shed_threshold: 0.8 and adjust based on s3o_load_shed_total rates.
  • Admission wait: 50ms to 100ms is a good starting point for smoothing bursts without adding noticeable latency.

Monitoring

  • s3o_admission_rejections_total — requests rejected at the hard admission limit
  • s3o_load_shed_total — requests probabilistically shed before the hard limit
  • s3o_early_rejections_total — uploads rejected before body transmission (no backend capacity)
  • s3o_inflight_requests — gauge of currently processing requests by method. Use this to find the right limit value.

Backend Timeout

server:
  backend_timeout: "30s"

This timeout applies to each individual S3 API call to a backend (GET, PUT, HEAD, DELETE). It does not cover the full request lifecycle — just the backend leg.

Provider-Specific Recommendations

ProviderRecommended TimeoutNotes
MinIO (local)5sLocal network, should be very fast
AWS S315-30sGenerally fast, occasional slow first-byte for large objects
Backblaze B230sConsistent but not the fastest
OCI Object Storage30-60sCan have slower first-byte times, especially for large objects
Cloudflare R215-30sGenerally fast

If you see frequent context deadline exceeded errors in logs, increase the timeout. If backends are consistently fast, lowering it helps fail fast when a backend is unhealthy.

HTTP Server Timeouts

server:
  read_header_timeout: "10s"  # protects against slowloris attacks
  read_timeout: "5m"          # total time to read request (including body)
  write_timeout: "5m"         # total time to write response
  idle_timeout: "120s"        # keep-alive idle time

Large Object Workloads

For environments that regularly handle multi-gigabyte objects:

  • Increase read_timeout and write_timeout to 15m or 30m. A 5 GB upload at 100 MB/s takes ~50 seconds, but slower networks or larger objects need more headroom.
  • read_header_timeout should stay low (10s) since it only covers headers, not the body.
  • idle_timeout at 120s is generous for most clients. Lower to 60s if you want to reclaim connections faster.

High-Connection Environments

If you serve thousands of concurrent clients:

  • Lower idle_timeout to 30-60s to reduce connection count.
  • Monitor open file descriptors and increase ulimits if needed.

OS and Container Tuning

File descriptor limits

The orchestrator holds many concurrent connections: client requests, backend S3 calls, database pool, and background workers (replicator, rebalancer, cleanup). Set the container’s nofile ulimit to at least 65535 to avoid fd exhaustion under load.

Nomad:

config {
  ulimit {
    nofile = "65535:65535"
  }
}

Kubernetes: Set at the node level or via an init container — pod-level ulimits are not directly configurable in the pod spec.

For hosts running the orchestrator under high concurrency, tune these kernel parameters:

SysctlValuePurpose
net.core.somaxconn65535Pending connection queue size
net.ipv4.ip_local_port_range10000 65535Ephemeral port range for outbound S3 connections
net.ipv4.tcp_tw_reuse1Reuse TIME_WAIT sockets during connection churn
net.ipv4.tcp_fin_timeout15Faster socket reclamation after close

Go runtime: GOMEMLIMIT

Set GOMEMLIMIT to ~90% of the container’s memory allocation so the Go garbage collector can make informed collection decisions. Without it, the runtime has no awareness of the container’s memory ceiling, which can lead to excessive GC (wasted CPU) or insufficient GC (OOM kill).

# Example for a 1024 MB container
GOMEMLIMIT=922MiB

The default GOGC=100 provides smooth GC behavior with GOMEMLIMIT set. Setting GOGC=off minimizes GC CPU overhead but risks latency spikes from large collections — only use it after profiling your workload.

Backend HTTP Transport

Each S3 backend gets a dedicated http.Transport with tuned connection pool settings. These are not configurable via YAML – they are compiled defaults sized for the orchestrator’s concurrent workload (rebalancer, replicator, parallel PUTs/GETs).

SettingValuePurpose
MaxIdleConns100Idle connection pool size per backend
MaxIdleConnsPerHost100Matches MaxIdleConns since each transport talks to one host
MaxConnsPerHost200Hard cap on active connections per backend (prevents unbounded growth during outages)
IdleConnTimeout60sRecycles idle connections, forcing fresh DNS on reconnect
KeepAlive30sTCP keepalive probes to detect stale connections
DialTimeout10sConnection establishment timeout
TLSHandshakeTimeout10sTLS negotiation timeout
ResponseHeaderTimeout30sTime to read response headers from a backend

DNS Freshness

The 60-second IdleConnTimeout addresses DNS staleness for endpoints resolved via Consul DNS or load balancers. When an idle connection is discarded, the next request triggers a fresh DNS lookup, allowing the orchestrator to follow backend endpoint changes without restarts.

Buffer Pool

All streaming operations (GET proxy, PUT body buffering, CopyObject, multipart assembly, UI downloads) use a shared sync.Pool of 32 KB byte buffers via io.CopyBuffer instead of io.Copy. This eliminates per-call buffer allocations that create GC pressure under high concurrency.

The CopyObject and multipart assembly streaming paths additionally wrap io.PipeWriter in pooled 64 KB bufio.Writer buffers, batching small writes to reduce syscall frequency on these background operations. Both pools are sized automatically and require no configuration.

Routing Strategy

routing_strategy: "pack"  # or "spread"

pack (default)

Fills backends in config order. When the first backend reaches its quota, writes overflow to the next.

Best for:

  • Stacking free-tier allocations (fill one provider, then the next)
  • Cost optimization (use the cheapest backend first)
  • Environments where you want predictable backend utilization

spread

Places each write on the backend with the lowest utilization ratio ((bytes_used + orphan_bytes) / bytes_limit).

Best for:

  • Distributing read load across backends (objects spread evenly)
  • Maximizing aggregate throughput (parallel reads from multiple providers)
  • Environments where all backends have similar performance characteristics

Rebalancer Tuning

rebalance:
  enabled: true
  strategy: "spread"       # or "pack"
  interval: "6h"
  batch_size: 100
  threshold: 0.1           # 10% utilization spread before triggering
  concurrency: 5

Tradeoffs

SettingHigher ValueLower Value
batch_sizeMore objects moved per run, more API callsFewer moves, less API cost
concurrencyFaster rebalancing, more backend loadSlower but gentler on backends
thresholdOnly rebalances when significantly unevenRebalances more aggressively
intervalLess frequent checksMore responsive to changes

Cost-Conscious Environments

If backends charge per API request (OCI, B2):

  • Set batch_size: 25-50 to limit API calls per rebalance run
  • Set concurrency: 2-3 to reduce parallel operations
  • Set threshold: 0.2 to only trigger when utilization is notably uneven
  • Set interval: "12h" or "24h" to reduce frequency

Performance-First Environments

If backends are local or have generous API allowances:

  • Set batch_size: 500 for faster convergence
  • Set concurrency: 10-20 for maximum parallelism
  • Set threshold: 0.05 for tighter balance

Replication Worker

replication:
  factor: 2
  worker_interval: "5m"
  batch_size: 50
  concurrency: 5
  unhealthy_threshold: "10m"
  • worker_interval controls how often the worker checks for under-replicated objects. Lower intervals mean faster catch-up after writes but more database queries.
  • batch_size limits objects processed per worker tick. Higher values catch up faster but create more backend I/O.
  • concurrency controls how many objects are replicated in parallel within each tick. Higher values speed up convergence but increase backend load.
  • unhealthy_threshold sets the grace period before the replicator treats copies on a circuit-broken backend as unavailable and creates replacements on healthy backends. Set higher to avoid churn during brief outages; set lower for faster redundancy recovery. Requires backend_circuit_breaker.enabled: true.
  • The worker runs a catch-up pass on startup, so initial replication doesn’t wait for the first interval.

For large backlogs (e.g., enabling replication on an existing dataset), temporarily increase batch_size to 500, set concurrency: 10, and lower worker_interval to "30s", then revert after catch-up.

Cleanup Worker

cleanup_queue:
  concurrency: 10
  • concurrency controls how many orphaned objects are deleted in parallel per worker tick (default: 10). Higher values clear the queue faster during drain operations or sustained backend recovery, but increase backend load.

Usage Flush

usage_flush:
  interval: "30s"           # base flush interval
  adaptive_enabled: false   # default false; set true to shorten interval near limits
  adaptive_threshold: 0.8   # 80% of limit triggers fast flush (when adaptive_enabled: true)
  fast_interval: "5s"       # interval when near limits (when adaptive_enabled: true)

The orchestrator tracks API requests, egress, and ingress in memory and periodically flushes to the database. This is a tradeoff between accuracy and database load:

  • Lower interval: more accurate usage tracking, more database writes
  • Higher interval: less database load, but usage enforcement has more lag
  • Adaptive flushing automatically switches to fast_interval when any backend exceeds adaptive_threshold of its limits. This keeps enforcement tight near limits without paying the cost of frequent flushes during normal operation.

For environments without usage limits, the flush still runs (it powers the dashboard and metrics) but accuracy is less critical. The 30-second default is fine.

Redis Shared Counters

In multi-instance deployments, each instance maintains independent counters by default. Redis replaces per-instance atomics with shared counters, eliminating the cross-instance blind spot:

redis:
  address: "redis.example.com:6379"
  key_prefix: "s3orch"
  failure_threshold: 3
  open_timeout: "15s"

Redis adds ~0.2ms per write operation (pipelined INCRBY) and ~0.2ms per read (GET pipeline). For most workloads the latency is negligible compared to backend S3 calls. If Redis becomes unavailable, the circuit breaker falls back to local counters automatically — no manual intervention needed.

When to use Redis:

  • Running 2+ instances with usage limits enabled
  • High write throughput where the 30s flush gap could cause meaningful overshoot

When Redis is unnecessary:

  • Single instance deployments
  • No usage limits configured
  • Low throughput where the flush interval provides sufficient accuracy

Backend Circuit Breaker

backend_circuit_breaker:
  enabled: true
  failure_threshold: 5   # consecutive failures before opening (default: 10)
  open_timeout: "5m"     # delay before probing recovery (default: 5m)

Per-backend circuit breakers stop sending traffic to a backend after failure_threshold consecutive errors (expired credentials, network failures, provider outages). The circuit opens immediately — no timeout waiting — and all subsequent requests route around the failed backend. After open_timeout, the next organic request is allowed through as a probe.

Tuning Guidelines

SettingHigher ValueLower Value
failure_thresholdMore tolerant of transient errors, slower to openFaster detection, but may trip on brief hiccups
open_timeoutLess probing traffic, slower recovery detectionFaster recovery, more probe requests to a potentially broken backend

Provider-Specific Recommendations

ProviderThresholdTimeoutNotes
AWS S3 / R255mDefaults work well; outages are rare
OCI Object Storage35mOCI can return auth errors in bursts during credential rotation
Backblaze B255mB2 maintenance windows are short
MinIO (local)330sLocal failures are usually configuration issues — probe quickly

When to Enable

  • Multi-backend with replication — highly recommended. Reads fail over to replicas on healthy backends, writes route to other backends, and the replication worker creates replacement copies after a sustained outage (replication.unhealthy_threshold). A single backend failure becomes invisible to clients.
  • Single backend — less useful. The circuit opens but there’s nowhere to fail over to. Requests return ErrBackendUnavailable instead of timing out, which is still an improvement (faster failure).
  • Cost-sensitive environments — the probe request after open_timeout is a real S3 API call. With a 5-minute timeout, that’s at most 12 probes/hour to a down backend. Health-aware replication also creates additional copies during sustained outages, which consume backend I/O and storage.

Monitoring

  • s3o_circuit_breaker_state{name="<backend>"} — 0=closed, 1=open, 2=half-open
  • s3o_circuit_breaker_transitions_total{name="<backend>"} — state change counter

A sustained state=1 for a backend means it’s been unreachable. Check the backend’s credentials and connectivity.

Rate Limiter Memory

rate_limit:
  enabled: true
  requests_per_sec: 100
  burst: 200
  cleanup_interval: "1m"   # how often stale entries are evicted (default: 1m)
  cleanup_max_age: "5m"    # entries not seen within this window are evicted (default: 5m)

The per-IP rate limiter stores a token bucket for every unique client IP. A background goroutine sweeps the map every cleanup_interval and removes entries not seen within cleanup_max_age.

Memory under attack

Under sustained high-cardinality traffic (e.g., DDoS with spoofed source IPs), the map can accumulate up to cleanup_max_age / cleanup_interval sweeps’ worth of unique IPs before eviction catches up. Each entry is roughly 100 bytes, so 1 million unique IPs ≈ 100 MB.

To limit memory growth under attack:

  • Lower cleanup_max_age to 1m or 2m — entries are evicted faster at the cost of re-creating limiters for legitimate clients who pause briefly.
  • Lower cleanup_interval to 30s — sweeps run more frequently, keeping the high-water mark lower.

Monitoring

  • s3o_rate_limit_rejections_total — counter of rejected requests. A sustained non-zero rate means the limiter is actively throttling traffic.

Tracing Sample Rate

The telemetry.sample_rate controls what fraction of requests generate OpenTelemetry trace spans exported to the collector (Tempo, Jaeger, etc.). This does not affect Prometheus metrics or structured logs — only distributed traces.

telemetry:
  sample_rate: 0.1   # 10% of requests traced
WorkloadRecommended RateRationale
Development1.0Trace every request for debugging
Staging0.1Enough to see patterns; 10x less volume
Production < 100 RPS0.1~10 traces/sec is manageable
Production 100–1000 RPS0.01~1–10 traces/sec
Production > 1000 RPS0.001–0.01Reduce collector storage and CPU

At 1.0 with 1000 RPS, the trace collector receives ~1000 spans/sec per request hop (multiple spans per request: HTTP → manager → backend). This can generate gigabytes of trace data per day and significant CPU overhead for serialization and export.

Prometheus metrics always capture 100% of requests regardless of sample rate — use metrics for alerting and dashboards, traces for debugging specific requests.

Load Testing

The loadtest/ directory contains tools for benchmarking the orchestrator under realistic S3 traffic. All tools handle SigV4 authentication automatically.

Tools

vegeta (Go) — constant-rate latency profiling. Best for answering “what is P99 latency at N requests/second?”

k6 — scenario-based workflow simulation. Best for answering “how does the system behave under realistic mixed traffic patterns?”

Quick Start

Start the demo environment, then run the Make targets:

make nomad-demo   # or make kubernetes-demo

# Constant-rate PUT throughput
make loadtest-put LOADTEST_RATE=200 LOADTEST_SIZE=4096

# Constant-rate GET latency (pre-seeds objects first)
make loadtest-get LOADTEST_RATE=500 LOADTEST_SEED=1000

# Mixed PUT/GET workload
make loadtest-mixed LOADTEST_RATE=300 LOADTEST_DURATION=2m

# k6 burst test (admission control / load shedding)
make loadtest-burst

# k6 mixed CRUD workflow
make loadtest-k6

The vegeta targets accept these variables:

VariableDefaultDescription
LOADTEST_RATE100Requests per second
LOADTEST_DURATION30sTest duration
LOADTEST_SIZE1024Object size in bytes
LOADTEST_SEED100Objects to pre-upload for GET/mixed
LOADTEST_WORKERS10Concurrent workers
LOADTEST_ENDPOINThttp://localhost:9000S3 endpoint
LOADTEST_BUCKETphotosTarget bucket

Interpreting Results

vegeta output:

Requests      [total, rate, throughput]         3000, 100.03, 100.02
Latencies     [min, mean, 50, 90, 95, 99, max]  3.2ms, 4.2ms, 4.1ms, 4.8ms, 5.1ms, 6.6ms, 12.3ms
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:3000
  • P99 latency is the most important number for production sizing. If P99 exceeds your SLA target, reduce the request rate or add instances.
  • Success ratio below 100% means requests were rejected (429 rate limit or 503 admission control). Check Status Codes to see which.
  • Throughput vs rate: if throughput is significantly lower than the configured rate, the server cannot keep up.

k6 output:

  • put_success / get_success — per-operation success rates
  • shed_503 (burst test) — number of requests rejected by admission control
  • http_req_duration — latency percentiles across all operations

What to Watch in Grafana

While load tests run, the Grafana dashboard shows system behavior in real time:

  • Quota & Storage — per-backend utilization during writes
  • Request Performance — latency percentiles, status code distribution, request rate
  • Circuit Breaker — backend health state transitions under load
  • Replication — pending replica count and copy duration after write bursts
  • Cleanup Queue — orphaned objects queued for cleanup after failures

Baseline Workflow

To compare before/after a code change:

  1. Start clean: make clean && make nomad-demo
  2. Run the baseline: make loadtest-put LOADTEST_RATE=500 LOADTEST_DURATION=1m
  3. Save the output (latencies, throughput, success ratio)
  4. Apply the code change, rebuild: make clean && make nomad-demo
  5. Run the same test with identical parameters
  6. Compare P50/P95/P99 latencies and throughput

Sizing Guidelines

These reference numbers were measured on a single instance with three local MinIO backends, replication factor 2, and AES encryption enabled:

WorkloadRateP99 LatencyCPUNotes
PUT 1KB100/s~7ms~50%Steady with replication spikes
GET 1KB100/s~3ms~50%Includes decryption
Mixed 1KB300/s~8ms~140%50/50 PUT/GET
Burst (100 VUs)~280/s~96ms150-200% spikesk6 scenario, all PUTs

CPU percentages are relative to the container’s CPU allocation. Memory remained flat across all tests at 1KB object sizes — the streaming architecture avoids buffering objects in memory. Larger objects (1MB+) will increase memory usage proportionally to concurrency.

Object Data Cache

The optional in-memory LRU cache (cache: config section) trades memory for reduced backend API calls and egress. When the same objects are read repeatedly, cache hits avoid a full round-trip to the storage backend.

cache:
  enabled: true
  max_size: "256MB"
  max_object_size: "10MB"
  ttl: "5m"

Memory vs Egress Tradeoff

The cache consumes container memory proportional to max_size. Size it based on available headroom after accounting for the Go heap, connection pools, buffer pools, and streaming concurrency:

Container MemoryRecommended max_sizeNotes
512 MB64–128 MBLeave room for GC and streaming buffers
1 GB128–256 MBGood for most workloads
2+ GB256–512 MBLarge working sets or many concurrent readers

When GOMEMLIMIT is set (recommended — see Go runtime: GOMEMLIMIT), include the cache’s max_size in your calculation:

# Example: 1024 MB container, 256 MB cache
GOMEMLIMIT=$(( (1024 - 256) * 90 / 100 ))MiB   # ~691 MiB for Go heap

When Caching Helps Most

  • Read-heavy workloads — a small set of objects serves the majority of reads (thumbnails, config files, static assets). A high s3o_cache_hits_total / (hits + misses) ratio confirms the cache is effective.
  • Egress-limited backends — backends with monthly egress caps (OCI free tier, B2 free tier). Each cache hit avoids an egress charge.
  • High-latency backends — caching eliminates backend round-trips, improving P50 and P99 latency for hot objects.

When Caching Adds Little Value

  • Write-heavy or write-once-read-once workloads — objects are rarely read more than once, so the cache turns over constantly with minimal hits.
  • Objects too large for the cache — if most objects exceed max_object_size, they bypass the cache entirely.
  • Uniform access patterns — when reads are evenly distributed across many unique objects, the working set exceeds max_size and eviction rates stay high.

Monitoring

  • s3o_cache_hits_total / s3o_cache_misses_total — compute the hit ratio. Below ~50%, the cache may be undersized or the workload is not a good fit.
  • s3o_cache_evictions_total — a sustained high eviction rate means objects are being evicted before they can be re-read. Increase max_size or lower max_object_size to fit more objects.
  • s3o_cache_size_bytes — if this consistently sits well below max_size, the cache is oversized and you can reclaim memory.
  • s3o_cache_entries — useful alongside cache_size_bytes to understand average cached object size.

Multi-Instance Staleness

The cache is per-instance and not shared. In multi-instance deployments, a write on instance A does not invalidate the cached copy on instance B. The ttl setting bounds how long a stale entry can be served — lower TTL values reduce the staleness window at the cost of more backend requests after expiry. For workloads that require strict read-after-write consistency across instances, either disable the cache or set a very low TTL.