# T036 VM Cluster Deployment - Configuration Guide This document describes the node configurations prepared for the 3-node PlasmaCloud test cluster. ## Overview **Goal:** Deploy and validate a 3-node PlasmaCloud cluster using T032 bare-metal provisioning tools in a VM environment. **Deployment Profile:** Control-plane (all 8 PlasmaCloud services on each node) **Cluster Mode:** Bootstrap (3-node Raft quorum initialization) ## Node Configurations ### Network Topology | Node | IP | Hostname | MAC | Role | |------|-----|----------|-----|------| | node01 | 192.168.100.11 | node01.plasma.local | 52:54:00:00:01:01 | control-plane | | node02 | 192.168.100.12 | node02.plasma.local | 52:54:00:00:01:02 | control-plane | | node03 | 192.168.100.13 | node03.plasma.local | 52:54:00:00:01:03 | control-plane | **Network:** 192.168.100.0/24 (QEMU multicast socket: 230.0.0.1:1234) **Gateway:** 192.168.100.1 (PXE server) ### Directory Structure ``` T036-vm-cluster-deployment/ ├── DEPLOYMENT.md (this file) ├── task.yaml ├── node01/ │ ├── configuration.nix # NixOS system configuration │ ├── disko.nix # Disk partitioning layout │ └── secrets/ │ ├── cluster-config.json # Raft cluster configuration │ ├── ca.crt # [S3] CA certificate (to be added) │ ├── node01.crt # [S3] Node certificate (to be added) │ ├── node01.key # [S3] Node private key (to be added) │ └── README.md # Secrets documentation ├── node02/ (same structure) └── node03/ (same structure) ``` ## Configuration Details ### Control-Plane Services (Enabled on All Nodes) 1. **Chainfire** - Distributed configuration (ports: 2379/2380/2381) 2. **FlareDB** - KV database (ports: 2479/2480) 3. **IAM** - Identity management (port: 8080) 4. **PlasmaVMC** - VM control plane (port: 8081) 5. **PrismNET** - SDN controller (port: 8082) 6. **FlashDNS** - DNS server (port: 8053) 7. **FiberLB** - Load balancer (port: 8084) 8. **LightningStor** - Block storage (port: 8085) 9. **K8sHost** - Kubernetes component (port: 8086) ### Disk Layout (disko.nix) All nodes use identical single-disk LVM layout: - **Device:** `/dev/vda` (100GB QCOW2) - **Partitions:** - ESP (boot): 512MB, FAT32, mounted at `/boot` - LVM Physical Volume: Remaining space (~99.5GB) - **LVM Volume Group:** `pool` - `root` LV: 80GB, ext4, mounted at `/` - `data` LV: ~19.5GB, ext4, mounted at `/var/lib` ### Cluster Configuration (cluster-config.json) All nodes configured for **bootstrap mode** (3-node simultaneous initialization): ```json { "bootstrap": true, "initial_peers": ["node01:2380", "node02:2380", "node03:2380"], "flaredb_peers": ["node01:2480", "node02:2480", "node03:2480"] } ``` **Key Points:** - All 3 nodes have `bootstrap: true` (Raft bootstrap cluster) - `leader_url` points to node01 (first node) for reference - `initial_peers` identical on all nodes (required for bootstrap) - First-boot automation will initialize cluster automatically ### First-Boot Automation Enabled on all nodes via `services.first-boot-automation`: 1. Wait for local service health (Chainfire, FlareDB, IAM) 2. Detect bootstrap mode (`bootstrap: true`) 3. Skip cluster join (bootstrap nodes auto-form cluster via `initial_peers`) 4. Create marker files (`.chainfire-initialized`, `.flaredb-initialized`) 5. Run health checks **Expected Behavior:** - All 3 nodes start simultaneously - Raft consensus auto-elects leader - Cluster operational within 30-60 seconds ## Next Steps (After S4) ### S3: TLS Certificate Generation (PeerA) Generate certificates and copy to each node's `secrets/` directory: ```bash # Generate CA and node certificates (see T032 QUICKSTART) cd /home/centra/cloud/baremetal/tls ./generate-ca.sh ./generate-node-cert.sh node01.plasma.local 192.168.100.11 ./generate-node-cert.sh node02.plasma.local 192.168.100.12 ./generate-node-cert.sh node03.plasma.local 192.168.100.13 # Copy to node configuration directories cp ca.crt docs/por/T036-vm-cluster-deployment/node01/secrets/ cp node01.crt node01.key docs/por/T036-vm-cluster-deployment/node01/secrets/ # Repeat for node02 and node03 ``` ### S5: Cluster Provisioning (PeerA + PeerB) Deploy using nixos-anywhere: ```bash cd /home/centra/cloud # Start VMs (S1 - already done by PeerA) # VMs should be running and accessible via PXE network # Deploy all 3 nodes in parallel for node in node01 node02 node03; do nixos-anywhere --flake docs/por/T036-vm-cluster-deployment/$node \ root@$node.plasma.local & done wait # Monitor first-boot logs ssh root@node01.plasma.local 'journalctl -u chainfire-cluster-join.service -f' ``` ### S6: Cluster Validation (Both) Verify cluster health: ```bash # Check Chainfire cluster curl -k https://192.168.100.11:2379/admin/cluster/members | jq # Expected: 3 members, all healthy, leader elected # Check FlareDB cluster curl -k https://192.168.100.11:2479/admin/cluster/members | jq # Test CRUD operations curl -k -X PUT https://192.168.100.11:2479/api/v1/kv/test-key \ -H "Content-Type: application/json" \ -d '{"value": "hello-cluster"}' curl -k https://192.168.100.11:2479/api/v1/kv/test-key # Verify data replicated to all nodes curl -k https://192.168.100.12:2479/api/v1/kv/test-key curl -k https://192.168.100.13:2479/api/v1/kv/test-key ``` ## Coordination with PeerA **PeerA Status (from S1):** - ✅ VM infrastructure created (QEMU multicast socket) - ✅ Disk images created (node01/02/03.qcow2, pxe-server.qcow2) - ✅ Launch scripts ready - ⏳ S2 (PXE Server) - Waiting on Full PXE decision (Foreman MID: 000620) - ⏳ S3 (TLS Certs) - Pending **PeerB Status (S4):** - ✅ Node configurations complete (configuration.nix, disko.nix) - ✅ Cluster configs ready (cluster-config.json) - ✅ TLS directory structure prepared - ⏳ Awaiting S3 certificates from PeerA **Dependency Flow:** ``` S1 (VMs) → S2 (PXE) → S3 (TLS) → S4 (Configs) → S5 (Provision) → S6 (Validate) PeerA PeerA PeerA PeerB Both Both ``` ## Configuration Files Reference ### configuration.nix - Imports: `hardware-configuration.nix`, `disko.nix`, `nix/modules/default.nix` - Network: Static IP, hostname, firewall rules - Services: All control-plane services enabled - First-boot: Enabled with cluster-config.json - SSH: Key-based authentication only - System packages: vim, htop, curl, jq, tcpdump, etc. ### disko.nix - Based on disko project format - Declarative disk partitioning - Executed by nixos-anywhere during provisioning - Creates: EFI boot partition + LVM (root + data) ### cluster-config.json - Read by first-boot-automation systemd services - Defines: node identity, Raft peers, bootstrap mode - Deployed to: `/etc/nixos/secrets/cluster-config.json` ## Troubleshooting ### If Provisioning Fails 1. Check VM network connectivity: `ping 192.168.100.11` 2. Verify PXE server is serving netboot images (S2) 3. Check TLS certificates exist in secrets/ directories (S3) 4. Review nixos-anywhere logs 5. Check disko.nix syntax: `nix eval --json -f disko.nix` ### If Cluster Join Fails 1. SSH to node: `ssh root@192.168.100.11` 2. Check service status: `systemctl status chainfire.service` 3. View first-boot logs: `journalctl -u chainfire-cluster-join.service` 4. Verify cluster-config.json: `jq . /etc/nixos/secrets/cluster-config.json` 5. Test health endpoint: `curl -k https://localhost:2379/health` ### If Cluster Not Forming 1. Verify all 3 nodes started simultaneously (bootstrap requirement) 2. Check `initial_peers` matches on all nodes 3. Check network connectivity between nodes: `ping 192.168.100.12` 4. Check firewall allows Raft ports (2380, 2480) 5. Review Chainfire logs: `journalctl -u chainfire.service` ## Documentation References - **T032 Bare-Metal Provisioning**: `/home/centra/cloud/docs/por/T032-baremetal-provisioning/` - **First-Boot Automation**: `/home/centra/cloud/baremetal/first-boot/README.md` - **Image Builder**: `/home/centra/cloud/baremetal/image-builder/README.md` - **VM Cluster Setup**: `/home/centra/cloud/baremetal/vm-cluster/README.md` - **NixOS Modules**: `/home/centra/cloud/nix/modules/` ## Notes - **Bootstrap vs Join**: All 3 nodes use bootstrap mode (simultaneous start). Additional nodes would use `bootstrap: false` and join via `leader_url`. - **PXE vs Direct**: Foreman decision (MID: 000620) confirms Full PXE validation. S2 will build and deploy netboot artifacts. - **Hardware Config**: `hardware-configuration.nix` will be auto-generated by nixos-anywhere during provisioning. - **SSH Keys**: Placeholder key in configuration.nix will be replaced during nixos-anywhere with actual provisioning key. ## Success Criteria (T036 Acceptance) - ✅ 3 VMs deployed with QEMU - ✅ Virtual network configured (multicast socket) - ⏳ PXE server operational (S2) - ⏳ All 3 nodes provisioned via nixos-anywhere (S5) - ⏳ Chainfire + FlareDB Raft clusters formed (S6) - ⏳ IAM service operational on all nodes (S6) - ⏳ Health checks passing (S6) - ⏳ T032 RUNBOOK validated end-to-end (S6) --- **S4 Status:** COMPLETE (Node Configs Ready for S5) **Next:** Awaiting S3 (TLS Certs) + S2 (PXE Server) from PeerA