photoncloud-monorepo/docs/storage-benchmarks.md

7.3 KiB

Storage Benchmarks

Generated on 2026-03-27T12:08:47+09:00 with:

nix run ./nix/test-cluster#cluster -- bench-storage

CoronaFS

Cluster network baseline, measured with iperf3 from node04 to node01 before the storage tests:

Metric Result
TCP throughput 45.92 MiB/s
TCP retransmits 193

Measured from node04. Local worker disk is the baseline. CoronaFS now has two relevant data paths in the lab: the controller export sourced from node01, and the node-local export materialized onto the worker that actually attaches the mutable VM disk.

Metric Local Disk Controller Export Node-local Export
Sequential write 679.05 MiB/s 30.35 MiB/s 395.06 MiB/s
Sequential read 2723.40 MiB/s 42.70 MiB/s 709.14 MiB/s
4k random read 16958 IOPS 2034 IOPS 5087 IOPS
4k queued random read (iodepth=32) 106026 IOPS 14261 IOPS 28898 IOPS

Queue-depth profile (libaio, iodepth=32) from the same worker:

Metric Local Disk Controller Export Node-local Export
Depth-32 write 3417.45 MiB/s 39.26 MiB/s 178.04 MiB/s
Depth-32 read 12996.47 MiB/s 55.71 MiB/s 112.88 MiB/s

Node-local materialization timing and target-node steady-state read path:

Metric Result
Node04 materialize latency 9.23 s
Node05 materialize latency 5.82 s
Node05 node-local sequential read 709.14 MiB/s

PlasmaVMC now prefers the worker-local CoronaFS export for mutable node-local volumes, even when the underlying materialization is a qcow2 overlay. The VM runtime section below is therefore the closest end-to-end proxy for real local-attach VM I/O, while the node-local export numbers remain useful for CoronaFS service consumers and for diagnosing exporter overhead.

LightningStor

Measured from node03 against the S3-compatible endpoint on node01. The object path exercised the distributed backend with replication across the worker storage nodes.

Cluster network baseline for this client, measured with iperf3 from node03 to node01 before the storage tests:

Metric Result
TCP throughput 45.99 MiB/s
TCP retransmits 207

Large-object path

Metric Result
Object size 256 MiB
Upload throughput 18.20 MiB/s
Download throughput 39.21 MiB/s

Small-object batch

Measured as 32 objects of 4 MiB each (128 MiB total).

Metric Result
Batch upload throughput 18.96 MiB/s
Batch download throughput 39.88 MiB/s
PUT rate 4.74 objects/s
GET rate 9.97 objects/s

Parallel small-object batch

Measured as the same 32 objects of 4 MiB each, but with 8 concurrent client jobs from node03.

Metric Result
Parallel batch upload throughput 16.23 MiB/s
Parallel batch download throughput 26.07 MiB/s
Parallel PUT rate 4.06 objects/s
Parallel GET rate 6.52 objects/s

VM Image Path

Measured against the PlasmaVMC -> LightningStor artifact -> CoronaFS-backed managed volume clone path on node01.

Metric Result
Guest image artifact size 2017 MiB
Guest image virtual size 4096 MiB
CreateImage latency 66.49 s
First image-backed CreateVolume latency 16.86 s
Second image-backed CreateVolume latency 0.12 s

VM Runtime Path

Measured against the real StartVm -> qemu attach -> guest boot -> guest fio path on a worker node, using a CoronaFS-backed root disk and data disk.

Metric Result
StartVm to qemu attach 0.60 s
StartVm to guest benchmark result 35.69 s
Guest sequential write 123.49252223968506 MiB/s
Guest sequential read 1492.7113695144653 MiB/s
Guest 4k random read 25550 IOPS

Assessment

  • CoronaFS controller-export reads are currently 1.6% of the measured local-disk baseline on this nested-QEMU lab cluster.
  • CoronaFS controller-export 4k random reads are currently 12.0% of the measured local-disk baseline.
  • CoronaFS controller-export queued 4k random reads are currently 13.5% of the measured local queued-random-read baseline.
  • CoronaFS controller-export sequential reads are currently 93.0% of the measured node04->node01 TCP baseline, which isolates the centralized source path from raw cluster-network limits.
  • CoronaFS controller-export depth-32 reads are currently 0.4% of the local depth-32 baseline.
  • CoronaFS node-local reads are currently 26.0% of the measured local-disk baseline, which is the more relevant steady-state signal for mutable VM disks after attachment.
  • CoronaFS node-local 4k random reads are currently 30.0% of the measured local-disk baseline.
  • CoronaFS node-local queued 4k random reads are currently 27.3% of the measured local queued-random-read baseline.
  • CoronaFS node-local depth-32 reads are currently 0.9% of the local depth-32 baseline.
  • The target worker's node-local read path is 26.0% of the measured local sequential-read baseline after materialization, which is the better proxy for restart and migration steady state than the old shared-export read.
  • PlasmaVMC now attaches writable node-local volumes through the worker-local CoronaFS export, so the guest-runtime section should be treated as the real local VM steady-state path rather than the node-local export numbers alone.
  • CoronaFS single-depth writes remain sensitive to the nested-QEMU/VDE lab transport, so the queued-depth and guest-runtime numbers are still the more reliable proxy for real VM workload behavior than the single-stream write figure alone.
  • The central export path is now best understood as a source/materialization path; the worker-local export is the path that should determine VM-disk readiness going forward.
  • LightningStor's replicated S3 path is working correctly, but 18.20 MiB/s upload and 39.21 MiB/s download are still lab-grade numbers rather than strong object-store throughput.
  • LightningStor large-object downloads are currently 85.3% of the same node04->node01 TCP baseline, which indicates how much of the headroom is being lost above the raw network path.
  • The current S3 frontend tuning baseline is the built-in 16 MiB streaming threshold with multipart PUT/FETCH concurrency of 8; that combination is the best default observed on this lab cluster so far.
  • LightningStor uploads should be read against the replication write quorum and the same ~45.99 MiB/s lab network ceiling; this environment still limits end-to-end throughput well before modern bare-metal NICs would.
  • LightningStor's small-object batch path is also functional, but 4.74 PUT/s and 9.97 GET/s still indicate a lab cluster rather than a tuned object-storage deployment.
  • The parallel small-object profile is the more relevant control-plane/object-ingest signal; it currently reaches 4.06 PUT/s and 6.52 GET/s.
  • The VM image section measures clone/materialization cost, not guest runtime I/O.
  • The PlasmaVMC local image-backed clone fast path is now active again; a 0.12 s second clone indicates the CoronaFS qcow2 backing-file path is being hit on node01 rather than falling back to eager raw materialization.
  • The VM runtime section is the real PlasmaVMC + CoronaFS + QEMU virtio-blk + guest kernel path; use it to judge whether QEMU/NBD tuning is helping.
  • The local sequential-write baseline is noisy in this environment, so the read and random-read deltas are the more reliable signal.