photoncloud-monorepo/docs/por/T013-vm-chainfire-persistence/schema.md
centra a7ec7e2158 Add T026 practical test + k8shost to flake + workspace files
- Created T026-practical-test task.yaml for MVP smoke testing
- Added k8shost-server to flake.nix (packages, apps, overlays)
- Staged all workspace directories for nix flake build
- Updated flake.nix shellHook to include k8shost

Resolves: T026.S1 blocker (R8 - nix submodule visibility)
2025-12-09 06:07:50 +09:00

138 lines
4 KiB
Markdown

# PlasmaVMC ChainFire Key Schema
**Date:** 2025-12-08
**Task:** T013 S1
**Status:** Design Complete
## Key Layout
### VM Metadata
```
Key: /plasmavmc/vms/{org_id}/{project_id}/{vm_id}
Value: JSON-serialized VirtualMachine (plasmavmc_types::VirtualMachine)
```
### VM Handle
```
Key: /plasmavmc/handles/{org_id}/{project_id}/{vm_id}
Value: JSON-serialized VmHandle (plasmavmc_types::VmHandle)
```
### Lock Key (for atomic operations)
```
Key: /plasmavmc/locks/{org_id}/{project_id}/{vm_id}
Value: JSON-serialized LockInfo { timestamp: u64, node_id: String }
TTL: 30 seconds (via ChainFire lease)
```
## Key Structure Rationale
1. **Prefix-based organization**: `/plasmavmc/` namespace isolates PlasmaVMC data
2. **Tenant scoping**: `{org_id}/{project_id}` ensures multi-tenancy
3. **Resource separation**: Separate keys for VM metadata and handles enable independent updates
4. **Lock mechanism**: Uses ChainFire lease TTL for distributed locking without manual cleanup
## Serialization
- **Format**: JSON (via `serde_json`)
- **Rationale**: Human-readable, debuggable, compatible with existing `PersistedState` structure
- **Alternative considered**: bincode (rejected for debuggability)
## Atomic Write Strategy
### Option 1: Transaction-based (Preferred)
Use ChainFire transactions to atomically update VM + handle:
```rust
// Pseudo-code
let txn = TxnRequest {
compare: vec![Compare {
key: lock_key,
result: CompareResult::Equal,
target: CompareTarget::Version(0), // Lock doesn't exist
}],
success: vec![
RequestOp { request: Some(Request::Put(vm_put)) },
RequestOp { request: Some(Request::Put(handle_put)) },
RequestOp { request: Some(Request::Put(lock_put)) },
],
failure: vec![],
};
```
### Option 2: Lease-based Locking (Fallback)
1. Acquire lease (30s TTL)
2. Put lock key with lease_id
3. Update VM + handle
4. Release lease (or let expire)
## Fallback Behavior
### File Fallback Mode
- **Trigger**: `PLASMAVMC_STORAGE_BACKEND=file` or `PLASMAVMC_CHAINFIRE_ENDPOINT` unset
- **Behavior**: Use existing file-based persistence (`PLASMAVMC_STATE_PATH`)
- **Locking**: File-based lockfile (`{state_path}.lock`) with `flock()` or atomic rename
### Migration Path
1. On startup, if ChainFire unavailable and file exists, load from file
2. If ChainFire available, prefer ChainFire; migrate file → ChainFire on first write
3. File fallback remains for development/testing without ChainFire cluster
## Configuration
### Environment Variables
- `PLASMAVMC_STORAGE_BACKEND`: `chainfire` (default) | `file`
- `PLASMAVMC_CHAINFIRE_ENDPOINT`: ChainFire gRPC endpoint (e.g., `http://127.0.0.1:50051`)
- `PLASMAVMC_STATE_PATH`: File fallback path (default: `/var/run/plasmavmc/state.json`)
- `PLASMAVMC_LOCK_TTL_SECONDS`: Lock TTL (default: 30)
### Config File (Future)
```toml
[storage]
backend = "chainfire" # or "file"
chainfire_endpoint = "http://127.0.0.1:50051"
state_path = "/var/run/plasmavmc/state.json"
lock_ttl_seconds = 30
```
## Operations
### Create VM
1. Generate `vm_id` (UUID)
2. Acquire lock (transaction or lease)
3. Put VM metadata key
4. Put VM handle key
5. Release lock
### Update VM
1. Acquire lock
2. Get current VM (verify exists)
3. Put updated VM metadata
4. Put updated handle (if changed)
5. Release lock
### Delete VM
1. Acquire lock
2. Delete VM metadata key
3. Delete VM handle key
4. Release lock
### Load on Startup
1. Scan prefix `/plasmavmc/vms/{org_id}/{project_id}/`
2. For each VM key, extract `vm_id`
3. Load VM metadata
4. Load corresponding handle
5. Populate in-memory DashMap
## Error Handling
- **ChainFire unavailable**: Fall back to file mode (if configured)
- **Lock contention**: Retry with exponential backoff (max 3 retries)
- **Serialization error**: Log and return error (should not happen)
- **Partial write**: Transaction rollback ensures atomicity
## Testing Considerations
- Unit tests: Mock ChainFire client
- Integration tests: Real ChainFire server (env-gated)
- Fallback tests: Disable ChainFire, verify file mode works
- Lock tests: Concurrent operations, verify atomicity