
Performance envelope
Performance envelope
This document is a runbook + results template for characterising
the orchestrator’s performance envelope (closes #367). The tooling in
loadtest/ produces the per-scenario JSON matrices referenced below;
the results tables are placeholders that operators fill in after
running the suite on representative hardware. Numbers without a
hardware fingerprint are meaningless, so each table block carries an
“Environment” line.
How to run
Prerequisites
- Working demo or production environment with at least one configured backend and the database reachable from the orchestrator
makeandgoin PATH (vegeta loadtest builds via Makefile target)k6for the multipart and burst scenarios- Admin token from
ui.admin_token(orS3O_ADMIN_TOKENenv var) so the loadtest binary can callPOST /admin/api/cache/flushbetween cache-cold runs
Scenario inventory
Five scenarios produce the matrix:
Sustained PUT at varying object sizes
Sustained GET, cache-warm baseline (second run after a warmup)
Sustained GET, cache-cold (flush between each size)
Mixed PUT/GET/DELETE across rates (saturation-find ramp)
List performance at increasing namespace sizes
Concurrent multipart
Out-of-band capture
The vegeta binary captures latency, throughput, and error rate via vegeta’s own metrics. The remaining signals come from existing surfaces (no in-tree capture pipeline; running a profiler shouldn’t require new code):
| Signal | Source | Method |
|---|---|---|
| Postgres pool utilization | pg_stat_activity | psql -c "SELECT count(*) FROM pg_stat_activity WHERE application_name='s3-orchestrator'" sampled before and after each run |
| Orchestrator goroutine + heap | /debug/pprof/goroutine, /debug/pprof/heap | curl http://orch:9000/debug/pprof/heap > heap.pprof at saturation |
| Container CPU + RSS | docker stats or cgroup /sys/fs/cgroup/... | One sample per scenario step |
| Cache hit / miss / size | /metrics | Prometheus scrape diff between t=0 and t=end |
Hardware fingerprint
Record the host’s specs in the Environment row of each results
table. The loadtest binary already embeds runtime.GOOS,
runtime.GOARCH, runtime.NumCPU(), and the Go version into the
JSON output’s hardware block; copy that block plus the actual
machine model into each section so the numbers stay interpretable.
Results
Scenario 1 - Sustained PUT
Environment: fill from put-sweep.json -> hardware + host model
| Size | RPS achieved | MB/s | P50 ms | P95 ms | P99 ms | Err % |
|---|---|---|---|---|---|---|
| 1 KB | TBD | TBD | TBD | TBD | TBD | TBD |
| 1 MB | TBD | TBD | TBD | TBD | TBD | TBD |
| 100 MB | TBD | TBD | TBD | TBD | TBD | TBD |
Scenario 2 - Sustained GET (warm vs cold)
Environment: fill
| Size | Cold P95 ms | Warm P95 ms | Cold MB/s | Warm MB/s | Cache value (warm/cold latency) |
|---|---|---|---|---|---|
| 1 KB | TBD | TBD | TBD | TBD | TBD |
| 1 MB | TBD | TBD | TBD | TBD | TBD |
Scenario 3 - Mixed saturation ramp
Environment: fill
| Requested RPS | Achieved RPS | P95 ms | Err % |
|---|---|---|---|
| 100 | TBD | TBD | TBD |
| … | TBD | TBD | TBD |
Saturation point: fill from mixed-ramp.json -> saturation_rps
Scenario 4 - List performance
Environment: fill
| Namespace size | P50 ms | P95 ms | P99 ms | Pagination pages hit cap |
|---|---|---|---|---|
| 10 K | TBD | TBD | TBD | TBD |
| 100 K | TBD | TBD | TBD | TBD |
| 1 M | TBD | TBD | TBD | TBD |
The “pages hit cap” column reads the
s3o_list_pages_capped_total counter delta over the run; non-zero
values indicate listObjectsMaxPages is firing at this scale.
Scenario 5 - Concurrent multipart
Environment: fill
| Concurrency | Completed uploads / min | Create P95 ms | Part P95 ms | Complete P95 ms | Err % |
|---|---|---|---|---|---|
| 10 | TBD | TBD | TBD | TBD | TBD |
| 50 | TBD | TBD | TBD | TBD | TBD |
| 100 | TBD | TBD | TBD | TBD | TBD |
Bottlenecks
After populating the tables above, identify the saturation cause per scenario. Typical candidates:
- Backend round-trip latency dominates at small object sizes
- Network throughput to backends caps MB/s at large object sizes
- Postgres connection pool exhaustion appears as a spike in P95 with low CPU on the orchestrator
- Cache thrashing when cache-warm and cache-cold P95 converge (working set exceeds cache size)
- Admission control kicks in via
s3o_admission_rejections_totalands3o_load_shed_totalnon-zero - Multipart Postgres contention on the per-uploadId advisory lock at high concurrency
Known bottleneck: backend_quotas row contention
PUT throughput is bounded by row-level lock contention on
backend_quotas. Every successful write transaction holds an
UPDATE backend_quotas SET bytes_used = bytes_used + $size WHERE backend_name = $name lock for the duration of the commit, so all
concurrent writes to the same backend serialize on a single row.
Diagnosis pattern in pg_stat_activity:
with the blocked query being IncrementQuota (on the steady-state
write path) or DecrementQuota (when displaced-copy cleanup is
running for overwrites). Symptom on the client side: P50 stays sub-ms
but P95/P99 blow out to seconds — most requests are fast, but a tail
queues behind the row lock and admission control sheds them.
Observed wall on the local Nomad demo (3-backend spread, 1 KB objects,
max_concurrent_writes: 1000, max_conns: 200): ~500 PUT/s before
load shedding fires. Pool size and Postgres max_connections do not
move the wall — only the count of contended rows does.
Mitigations available today:
- Add backends to spread writes across more
backend_quotasrows (linear scaling of the per-row write rate) - Cap concurrent writes per backend below the lock-serialization rate to avoid the wait queue building up
Architectural fix (not in this branch): batch quota deltas in memory
and flush periodically, mirroring the usage_flush worker pattern,
so the hot row is updated O(1) times per flush interval instead of
once per write.
Recommended configuration per scale tier
After running the suite, fill in:
| Tier | Backends | DB pool | Cache max_size | Max concurrent requests | Notes |
|---|---|---|---|---|---|
| Small (< 100 obj/s) | TBD | TBD | TBD | TBD | TBD |
| Medium (100-1k obj/s) | TBD | TBD | TBD | TBD | TBD |
| Large (> 1k obj/s) | TBD | TBD | TBD | TBD | TBD |