photoncloud-monorepo/docs/storage-benchmarks.md

# Storage Benchmarks

Generated on 2026-03-27T12:08:47+09:00 with:

```bash
nix run ./nix/test-cluster#cluster -- bench-storage
```

## CoronaFS

Cluster network baseline, measured with `iperf3` from `node04` to `node01` before the storage tests:

| Metric | Result |
|---|---:|
| TCP throughput | 45.92 MiB/s |
| TCP retransmits | 193 |

Measured from `node04`.
Local worker disk is the baseline. CoronaFS now has two relevant data paths in the lab: the controller export sourced from `node01`, and the node-local export materialized onto the worker that actually attaches the mutable VM disk.

| Metric | Local Disk | Controller Export | Node-local Export |
|---|---:|---:|---:|
| Sequential write | 679.05 MiB/s | 30.35 MiB/s | 395.06 MiB/s |
| Sequential read | 2723.40 MiB/s | 42.70 MiB/s | 709.14 MiB/s |
| 4k random read | 16958 IOPS | 2034 IOPS | 5087 IOPS |
| 4k queued random read (`iodepth=32`) | 106026 IOPS | 14261 IOPS | 28898 IOPS |

Queue-depth profile (`libaio`, `iodepth=32`) from the same worker:

| Metric | Local Disk | Controller Export | Node-local Export |
|---|---:|---:|---:|
| Depth-32 write | 3417.45 MiB/s | 39.26 MiB/s | 178.04 MiB/s |
| Depth-32 read | 12996.47 MiB/s | 55.71 MiB/s | 112.88 MiB/s |

Node-local materialization timing and target-node steady-state read path:

| Metric | Result |
|---|---:|
| Node04 materialize latency | 9.23 s |
| Node05 materialize latency | 5.82 s |
| Node05 node-local sequential read | 709.14 MiB/s |

PlasmaVMC now prefers the worker-local CoronaFS export for mutable node-local volumes, even when the underlying materialization is a qcow2 overlay. The VM runtime section below is therefore the closest end-to-end proxy for real local-attach VM I/O, while the node-local export numbers remain useful for CoronaFS service consumers and for diagnosing exporter overhead.

## LightningStor

Measured from `node03` against the S3-compatible endpoint on `node01`.
The object path exercised the distributed backend with replication across the worker storage nodes.

Cluster network baseline for this client, measured with `iperf3` from `node03` to `node01` before the storage tests:

| Metric | Result |
|---|---:|
| TCP throughput | 45.99 MiB/s |
| TCP retransmits | 207 |

### Large-object path

| Metric | Result |
|---|---:|
| Object size | 256 MiB |
| Upload throughput | 18.20 MiB/s |
| Download throughput | 39.21 MiB/s |

### Small-object batch

Measured as 32 objects of 4 MiB each (128 MiB total).

| Metric | Result |
|---|---:|
| Batch upload throughput | 18.96 MiB/s |
| Batch download throughput | 39.88 MiB/s |
| PUT rate | 4.74 objects/s |
| GET rate | 9.97 objects/s |

### Parallel small-object batch

Measured as the same 32 objects of 4 MiB each, but with 8 concurrent client jobs from `node03`.

| Metric | Result |
|---|---:|
| Parallel batch upload throughput | 16.23 MiB/s |
| Parallel batch download throughput | 26.07 MiB/s |
| Parallel PUT rate | 4.06 objects/s |
| Parallel GET rate | 6.52 objects/s |

## VM Image Path

Measured against the `PlasmaVMC -> LightningStor artifact -> CoronaFS-backed managed volume` clone path on `node01`.

| Metric | Result |
|---|---:|
| Guest image artifact size | 2017 MiB |
| Guest image virtual size | 4096 MiB |
| `CreateImage` latency | 66.49 s |
| First image-backed `CreateVolume` latency | 16.86 s |
| Second image-backed `CreateVolume` latency | 0.12 s |

## VM Runtime Path

Measured against the real `StartVm -> qemu attach -> guest boot -> guest fio` path on a worker node, using a CoronaFS-backed root disk and data disk.

| Metric | Result |
|---|---:|
| `StartVm` to qemu attach | 0.60 s |
| `StartVm` to guest benchmark result | 35.69 s |
| Guest sequential write | 123.49252223968506 MiB/s |
| Guest sequential read | 1492.7113695144653 MiB/s |
| Guest 4k random read | 25550 IOPS |

## Assessment

- CoronaFS controller-export reads are currently 1.6% of the measured local-disk baseline on this nested-QEMU lab cluster.
- CoronaFS controller-export 4k random reads are currently 12.0% of the measured local-disk baseline.
- CoronaFS controller-export queued 4k random reads are currently 13.5% of the measured local queued-random-read baseline.
- CoronaFS controller-export sequential reads are currently 93.0% of the measured node04->node01 TCP baseline, which isolates the centralized source path from raw cluster-network limits.
- CoronaFS controller-export depth-32 reads are currently 0.4% of the local depth-32 baseline.
- CoronaFS node-local reads are currently 26.0% of the measured local-disk baseline, which is the more relevant steady-state signal for mutable VM disks after attachment.
- CoronaFS node-local 4k random reads are currently 30.0% of the measured local-disk baseline.
- CoronaFS node-local queued 4k random reads are currently 27.3% of the measured local queued-random-read baseline.
- CoronaFS node-local depth-32 reads are currently 0.9% of the local depth-32 baseline.
- The target worker's node-local read path is 26.0% of the measured local sequential-read baseline after materialization, which is the better proxy for restart and migration steady state than the old shared-export read.
- PlasmaVMC now attaches writable node-local volumes through the worker-local CoronaFS export, so the guest-runtime section should be treated as the real local VM steady-state path rather than the node-local export numbers alone.
- CoronaFS single-depth writes remain sensitive to the nested-QEMU/VDE lab transport, so the queued-depth and guest-runtime numbers are still the more reliable proxy for real VM workload behavior than the single-stream write figure alone.
- The central export path is now best understood as a source/materialization path; the worker-local export is the path that should determine VM-disk readiness going forward.
- LightningStor's replicated S3 path is working correctly, but 18.20 MiB/s upload and 39.21 MiB/s download are still lab-grade numbers rather than strong object-store throughput.
- LightningStor large-object downloads are currently 85.3% of the same node04->node01 TCP baseline, which indicates how much of the headroom is being lost above the raw network path.
- The current S3 frontend tuning baseline is the built-in 16 MiB streaming threshold with multipart PUT/FETCH concurrency of 8; that combination is the best default observed on this lab cluster so far.
- LightningStor uploads should be read against the replication write quorum and the same ~45.99 MiB/s lab network ceiling; this environment still limits end-to-end throughput well before modern bare-metal NICs would.
- LightningStor's small-object batch path is also functional, but 4.74 PUT/s and 9.97 GET/s still indicate a lab cluster rather than a tuned object-storage deployment.
- The parallel small-object profile is the more relevant control-plane/object-ingest signal; it currently reaches 4.06 PUT/s and 6.52 GET/s.
- The VM image section measures clone/materialization cost, not guest runtime I/O.
- The PlasmaVMC local image-backed clone fast path is now active again; a 0.12 s second clone indicates the CoronaFS qcow2 backing-file path is being hit on node01 rather than falling back to eager raw materialization.
- The VM runtime section is the real `PlasmaVMC + CoronaFS + QEMU virtio-blk + guest kernel` path; use it to judge whether QEMU/NBD tuning is helping.
- The local sequential-write baseline is noisy in this environment, so the read and random-read deltas are the more reliable signal.