s3-orchestrator

Performance envelope

Performance envelope

This document is a runbook + results template for characterising the orchestrator’s performance envelope (closes #367). The tooling in loadtest/ produces the per-scenario JSON matrices referenced below; the results tables are placeholders that operators fill in after running the suite on representative hardware. Numbers without a hardware fingerprint are meaningless, so each table block carries an “Environment” line.

How to run

Prerequisites

  • Working demo or production environment with at least one configured backend and the database reachable from the orchestrator
  • make and go in PATH (vegeta loadtest builds via Makefile target)
  • k6 for the multipart and burst scenarios
  • Admin token from ui.admin_token (or S3O_ADMIN_TOKEN env var) so the loadtest binary can call POST /admin/api/cache/flush between cache-cold runs

Scenario inventory

Five scenarios produce the matrix:

  1. Sustained PUT at varying object sizes

    make loadtest-put LOADTEST_SIZES=1024,1048576,104857600 \
      LOADTEST_RATE=200 LOADTEST_DURATION=60s \
      LOADTEST_OUTPUT_JSON=put-sweep.json
  2. Sustained GET, cache-warm baseline (second run after a warmup)

    make loadtest-get LOADTEST_SEED=1000 \
      LOADTEST_SIZES=1024,1048576 \
      LOADTEST_RATE=500 LOADTEST_DURATION=60s \
      LOADTEST_OUTPUT_JSON=get-warm.json
  3. Sustained GET, cache-cold (flush between each size)

    ./loadtest/s3-loadtest \
      -op get -rate 500 -duration 60s \
      -sizes 1024,1048576 \
      -seed 1000 \
      -cache-flush-before -admin-token "$S3O_ADMIN_TOKEN" \
      -output-json get-cold.json
  4. Mixed PUT/GET/DELETE across rates (saturation-find ramp)

    ./loadtest/s3-loadtest \
      -op mixed -rate 100 -ramp-to 2000 -ramp-step 200 \
      -ramp-error-threshold 0.05 \
      -duration 30s -seed 500 \
      -output-json mixed-ramp.json
  5. List performance at increasing namespace sizes

    make loadtest-listobjects LOADTEST_SEED=10000   LOADTEST_OUTPUT_JSON=list-10k.json
    make loadtest-listobjects LOADTEST_SEED=100000  LOADTEST_OUTPUT_JSON=list-100k.json
    make loadtest-listobjects LOADTEST_SEED=1000000 LOADTEST_OUTPUT_JSON=list-1m.json
  6. Concurrent multipart

    make loadtest-multipart LOADTEST_MPU_CONCURRENCY=10
    make loadtest-multipart LOADTEST_MPU_CONCURRENCY=50
    make loadtest-multipart LOADTEST_MPU_CONCURRENCY=100

Out-of-band capture

The vegeta binary captures latency, throughput, and error rate via vegeta’s own metrics. The remaining signals come from existing surfaces (no in-tree capture pipeline; running a profiler shouldn’t require new code):

SignalSourceMethod
Postgres pool utilizationpg_stat_activitypsql -c "SELECT count(*) FROM pg_stat_activity WHERE application_name='s3-orchestrator'" sampled before and after each run
Orchestrator goroutine + heap/debug/pprof/goroutine, /debug/pprof/heapcurl http://orch:9000/debug/pprof/heap > heap.pprof at saturation
Container CPU + RSSdocker stats or cgroup /sys/fs/cgroup/...One sample per scenario step
Cache hit / miss / size/metricsPrometheus scrape diff between t=0 and t=end

Hardware fingerprint

Record the host’s specs in the Environment row of each results table. The loadtest binary already embeds runtime.GOOS, runtime.GOARCH, runtime.NumCPU(), and the Go version into the JSON output’s hardware block; copy that block plus the actual machine model into each section so the numbers stay interpretable.

Results

Scenario 1 - Sustained PUT

Environment: fill from put-sweep.json -> hardware + host model

SizeRPS achievedMB/sP50 msP95 msP99 msErr %
1 KBTBDTBDTBDTBDTBDTBD
1 MBTBDTBDTBDTBDTBDTBD
100 MBTBDTBDTBDTBDTBDTBD

Scenario 2 - Sustained GET (warm vs cold)

Environment: fill

SizeCold P95 msWarm P95 msCold MB/sWarm MB/sCache value (warm/cold latency)
1 KBTBDTBDTBDTBDTBD
1 MBTBDTBDTBDTBDTBD

Scenario 3 - Mixed saturation ramp

Environment: fill

Requested RPSAchieved RPSP95 msErr %
100TBDTBDTBD
TBDTBDTBD

Saturation point: fill from mixed-ramp.json -> saturation_rps

Scenario 4 - List performance

Environment: fill

Namespace sizeP50 msP95 msP99 msPagination pages hit cap
10 KTBDTBDTBDTBD
100 KTBDTBDTBDTBD
1 MTBDTBDTBDTBD

The “pages hit cap” column reads the s3o_list_pages_capped_total counter delta over the run; non-zero values indicate listObjectsMaxPages is firing at this scale.

Scenario 5 - Concurrent multipart

Environment: fill

ConcurrencyCompleted uploads / minCreate P95 msPart P95 msComplete P95 msErr %
10TBDTBDTBDTBDTBD
50TBDTBDTBDTBDTBD
100TBDTBDTBDTBDTBD

Bottlenecks

After populating the tables above, identify the saturation cause per scenario. Typical candidates:

  • Backend round-trip latency dominates at small object sizes
  • Network throughput to backends caps MB/s at large object sizes
  • Postgres connection pool exhaustion appears as a spike in P95 with low CPU on the orchestrator
  • Cache thrashing when cache-warm and cache-cold P95 converge (working set exceeds cache size)
  • Admission control kicks in via s3o_admission_rejections_total and s3o_load_shed_total non-zero
  • Multipart Postgres contention on the per-uploadId advisory lock at high concurrency

Known bottleneck: backend_quotas row contention

PUT throughput is bounded by row-level lock contention on backend_quotas. Every successful write transaction holds an UPDATE backend_quotas SET bytes_used = bytes_used + $size WHERE backend_name = $name lock for the duration of the commit, so all concurrent writes to the same backend serialize on a single row.

Diagnosis pattern in pg_stat_activity:

state | count | wait_event_type | wait_event
------+-------+-----------------+----------------
active |   200 | Lock            | transactionid

with the blocked query being IncrementQuota (on the steady-state write path) or DecrementQuota (when displaced-copy cleanup is running for overwrites). Symptom on the client side: P50 stays sub-ms but P95/P99 blow out to seconds — most requests are fast, but a tail queues behind the row lock and admission control sheds them.

Observed wall on the local Nomad demo (3-backend spread, 1 KB objects, max_concurrent_writes: 1000, max_conns: 200): ~500 PUT/s before load shedding fires. Pool size and Postgres max_connections do not move the wall — only the count of contended rows does.

Mitigations available today:

  • Add backends to spread writes across more backend_quotas rows (linear scaling of the per-row write rate)
  • Cap concurrent writes per backend below the lock-serialization rate to avoid the wait queue building up

Architectural fix (not in this branch): batch quota deltas in memory and flush periodically, mirroring the usage_flush worker pattern, so the hot row is updated O(1) times per flush interval instead of once per write.

After running the suite, fill in:

TierBackendsDB poolCache max_sizeMax concurrent requestsNotes
Small (< 100 obj/s)TBDTBDTBDTBDTBD
Medium (100-1k obj/s)TBDTBDTBDTBDTBD
Large (> 1k obj/s)TBDTBDTBDTBDTBD