- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
272 lines
9.1 KiB
Markdown
272 lines
9.1 KiB
Markdown
# T036 VM Cluster Deployment - Configuration Guide
|
|
|
|
This document describes the node configurations prepared for the 3-node PlasmaCloud test cluster.
|
|
|
|
## Overview
|
|
|
|
**Goal:** Deploy and validate a 3-node PlasmaCloud cluster using T032 bare-metal provisioning tools in a VM environment.
|
|
|
|
**Deployment Profile:** Control-plane (all 8 PlasmaCloud services on each node)
|
|
|
|
**Cluster Mode:** Bootstrap (3-node Raft quorum initialization)
|
|
|
|
## Node Configurations
|
|
|
|
### Network Topology
|
|
|
|
| Node | IP | Hostname | MAC | Role |
|
|
|------|-----|----------|-----|------|
|
|
| node01 | 192.168.100.11 | node01.plasma.local | 52:54:00:00:01:01 | control-plane |
|
|
| node02 | 192.168.100.12 | node02.plasma.local | 52:54:00:00:01:02 | control-plane |
|
|
| node03 | 192.168.100.13 | node03.plasma.local | 52:54:00:00:01:03 | control-plane |
|
|
|
|
**Network:** 192.168.100.0/24 (QEMU multicast socket: 230.0.0.1:1234)
|
|
|
|
**Gateway:** 192.168.100.1 (PXE server)
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
T036-vm-cluster-deployment/
|
|
├── DEPLOYMENT.md (this file)
|
|
├── task.yaml
|
|
├── node01/
|
|
│ ├── configuration.nix # NixOS system configuration
|
|
│ ├── disko.nix # Disk partitioning layout
|
|
│ └── secrets/
|
|
│ ├── cluster-config.json # Raft cluster configuration
|
|
│ ├── ca.crt # [S3] CA certificate (to be added)
|
|
│ ├── node01.crt # [S3] Node certificate (to be added)
|
|
│ ├── node01.key # [S3] Node private key (to be added)
|
|
│ └── README.md # Secrets documentation
|
|
├── node02/ (same structure)
|
|
└── node03/ (same structure)
|
|
```
|
|
|
|
## Configuration Details
|
|
|
|
### Control-Plane Services (Enabled on All Nodes)
|
|
|
|
1. **Chainfire** - Distributed configuration (ports: 2379/2380/2381)
|
|
2. **FlareDB** - KV database (ports: 2479/2480)
|
|
3. **IAM** - Identity management (port: 8080)
|
|
4. **PlasmaVMC** - VM control plane (port: 8081)
|
|
5. **PrismNET** - SDN controller (port: 8082)
|
|
6. **FlashDNS** - DNS server (port: 8053)
|
|
7. **FiberLB** - Load balancer (port: 8084)
|
|
8. **LightningStor** - Block storage (port: 8085)
|
|
9. **K8sHost** - Kubernetes component (port: 8086)
|
|
|
|
### Disk Layout (disko.nix)
|
|
|
|
All nodes use identical single-disk LVM layout:
|
|
|
|
- **Device:** `/dev/vda` (100GB QCOW2)
|
|
- **Partitions:**
|
|
- ESP (boot): 512MB, FAT32, mounted at `/boot`
|
|
- LVM Physical Volume: Remaining space (~99.5GB)
|
|
- **LVM Volume Group:** `pool`
|
|
- `root` LV: 80GB, ext4, mounted at `/`
|
|
- `data` LV: ~19.5GB, ext4, mounted at `/var/lib`
|
|
|
|
### Cluster Configuration (cluster-config.json)
|
|
|
|
All nodes configured for **bootstrap mode** (3-node simultaneous initialization):
|
|
|
|
```json
|
|
{
|
|
"bootstrap": true,
|
|
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"],
|
|
"flaredb_peers": ["node01:2480", "node02:2480", "node03:2480"]
|
|
}
|
|
```
|
|
|
|
**Key Points:**
|
|
- All 3 nodes have `bootstrap: true` (Raft bootstrap cluster)
|
|
- `leader_url` points to node01 (first node) for reference
|
|
- `initial_peers` identical on all nodes (required for bootstrap)
|
|
- First-boot automation will initialize cluster automatically
|
|
|
|
### First-Boot Automation
|
|
|
|
Enabled on all nodes via `services.first-boot-automation`:
|
|
|
|
1. Wait for local service health (Chainfire, FlareDB, IAM)
|
|
2. Detect bootstrap mode (`bootstrap: true`)
|
|
3. Skip cluster join (bootstrap nodes auto-form cluster via `initial_peers`)
|
|
4. Create marker files (`.chainfire-initialized`, `.flaredb-initialized`)
|
|
5. Run health checks
|
|
|
|
**Expected Behavior:**
|
|
- All 3 nodes start simultaneously
|
|
- Raft consensus auto-elects leader
|
|
- Cluster operational within 30-60 seconds
|
|
|
|
## Next Steps (After S4)
|
|
|
|
### S3: TLS Certificate Generation (PeerA)
|
|
|
|
Generate certificates and copy to each node's `secrets/` directory:
|
|
|
|
```bash
|
|
# Generate CA and node certificates (see T032 QUICKSTART)
|
|
cd /home/centra/cloud/baremetal/tls
|
|
./generate-ca.sh
|
|
./generate-node-cert.sh node01.plasma.local 192.168.100.11
|
|
./generate-node-cert.sh node02.plasma.local 192.168.100.12
|
|
./generate-node-cert.sh node03.plasma.local 192.168.100.13
|
|
|
|
# Copy to node configuration directories
|
|
cp ca.crt docs/por/T036-vm-cluster-deployment/node01/secrets/
|
|
cp node01.crt node01.key docs/por/T036-vm-cluster-deployment/node01/secrets/
|
|
# Repeat for node02 and node03
|
|
```
|
|
|
|
### S5: Cluster Provisioning (PeerA + PeerB)
|
|
|
|
Deploy using nixos-anywhere:
|
|
|
|
```bash
|
|
cd /home/centra/cloud
|
|
|
|
# Start VMs (S1 - already done by PeerA)
|
|
# VMs should be running and accessible via PXE network
|
|
|
|
# Deploy all 3 nodes in parallel
|
|
for node in node01 node02 node03; do
|
|
nixos-anywhere --flake docs/por/T036-vm-cluster-deployment/$node \
|
|
root@$node.plasma.local &
|
|
done
|
|
wait
|
|
|
|
# Monitor first-boot logs
|
|
ssh root@node01.plasma.local 'journalctl -u chainfire-cluster-join.service -f'
|
|
```
|
|
|
|
### S6: Cluster Validation (Both)
|
|
|
|
Verify cluster health:
|
|
|
|
```bash
|
|
# Check Chainfire cluster
|
|
curl -k https://192.168.100.11:2379/admin/cluster/members | jq
|
|
|
|
# Expected: 3 members, all healthy, leader elected
|
|
|
|
# Check FlareDB cluster
|
|
curl -k https://192.168.100.11:2479/admin/cluster/members | jq
|
|
|
|
# Test CRUD operations
|
|
curl -k -X PUT https://192.168.100.11:2479/api/v1/kv/test-key \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"value": "hello-cluster"}'
|
|
|
|
curl -k https://192.168.100.11:2479/api/v1/kv/test-key
|
|
|
|
# Verify data replicated to all nodes
|
|
curl -k https://192.168.100.12:2479/api/v1/kv/test-key
|
|
curl -k https://192.168.100.13:2479/api/v1/kv/test-key
|
|
```
|
|
|
|
## Coordination with PeerA
|
|
|
|
**PeerA Status (from S1):**
|
|
- ✅ VM infrastructure created (QEMU multicast socket)
|
|
- ✅ Disk images created (node01/02/03.qcow2, pxe-server.qcow2)
|
|
- ✅ Launch scripts ready
|
|
- ⏳ S2 (PXE Server) - Waiting on Full PXE decision (Foreman MID: 000620)
|
|
- ⏳ S3 (TLS Certs) - Pending
|
|
|
|
**PeerB Status (S4):**
|
|
- ✅ Node configurations complete (configuration.nix, disko.nix)
|
|
- ✅ Cluster configs ready (cluster-config.json)
|
|
- ✅ TLS directory structure prepared
|
|
- ⏳ Awaiting S3 certificates from PeerA
|
|
|
|
**Dependency Flow:**
|
|
```
|
|
S1 (VMs) → S2 (PXE) → S3 (TLS) → S4 (Configs) → S5 (Provision) → S6 (Validate)
|
|
PeerA PeerA PeerA PeerB Both Both
|
|
```
|
|
|
|
## Configuration Files Reference
|
|
|
|
### configuration.nix
|
|
|
|
- Imports: `hardware-configuration.nix`, `disko.nix`, `nix/modules/default.nix`
|
|
- Network: Static IP, hostname, firewall rules
|
|
- Services: All control-plane services enabled
|
|
- First-boot: Enabled with cluster-config.json
|
|
- SSH: Key-based authentication only
|
|
- System packages: vim, htop, curl, jq, tcpdump, etc.
|
|
|
|
### disko.nix
|
|
|
|
- Based on disko project format
|
|
- Declarative disk partitioning
|
|
- Executed by nixos-anywhere during provisioning
|
|
- Creates: EFI boot partition + LVM (root + data)
|
|
|
|
### cluster-config.json
|
|
|
|
- Read by first-boot-automation systemd services
|
|
- Defines: node identity, Raft peers, bootstrap mode
|
|
- Deployed to: `/etc/nixos/secrets/cluster-config.json`
|
|
|
|
## Troubleshooting
|
|
|
|
### If Provisioning Fails
|
|
|
|
1. Check VM network connectivity: `ping 192.168.100.11`
|
|
2. Verify PXE server is serving netboot images (S2)
|
|
3. Check TLS certificates exist in secrets/ directories (S3)
|
|
4. Review nixos-anywhere logs
|
|
5. Check disko.nix syntax: `nix eval --json -f disko.nix`
|
|
|
|
### If Cluster Join Fails
|
|
|
|
1. SSH to node: `ssh root@192.168.100.11`
|
|
2. Check service status: `systemctl status chainfire.service`
|
|
3. View first-boot logs: `journalctl -u chainfire-cluster-join.service`
|
|
4. Verify cluster-config.json: `jq . /etc/nixos/secrets/cluster-config.json`
|
|
5. Test health endpoint: `curl -k https://localhost:2379/health`
|
|
|
|
### If Cluster Not Forming
|
|
|
|
1. Verify all 3 nodes started simultaneously (bootstrap requirement)
|
|
2. Check `initial_peers` matches on all nodes
|
|
3. Check network connectivity between nodes: `ping 192.168.100.12`
|
|
4. Check firewall allows Raft ports (2380, 2480)
|
|
5. Review Chainfire logs: `journalctl -u chainfire.service`
|
|
|
|
## Documentation References
|
|
|
|
- **T032 Bare-Metal Provisioning**: `/home/centra/cloud/docs/por/T032-baremetal-provisioning/`
|
|
- **First-Boot Automation**: `/home/centra/cloud/baremetal/first-boot/README.md`
|
|
- **Image Builder**: `/home/centra/cloud/baremetal/image-builder/README.md`
|
|
- **VM Cluster Setup**: `/home/centra/cloud/baremetal/vm-cluster/README.md`
|
|
- **NixOS Modules**: `/home/centra/cloud/nix/modules/`
|
|
|
|
## Notes
|
|
|
|
- **Bootstrap vs Join**: All 3 nodes use bootstrap mode (simultaneous start). Additional nodes would use `bootstrap: false` and join via `leader_url`.
|
|
- **PXE vs Direct**: Foreman decision (MID: 000620) confirms Full PXE validation. S2 will build and deploy netboot artifacts.
|
|
- **Hardware Config**: `hardware-configuration.nix` will be auto-generated by nixos-anywhere during provisioning.
|
|
- **SSH Keys**: Placeholder key in configuration.nix will be replaced during nixos-anywhere with actual provisioning key.
|
|
|
|
## Success Criteria (T036 Acceptance)
|
|
|
|
- ✅ 3 VMs deployed with QEMU
|
|
- ✅ Virtual network configured (multicast socket)
|
|
- ⏳ PXE server operational (S2)
|
|
- ⏳ All 3 nodes provisioned via nixos-anywhere (S5)
|
|
- ⏳ Chainfire + FlareDB Raft clusters formed (S6)
|
|
- ⏳ IAM service operational on all nodes (S6)
|
|
- ⏳ Health checks passing (S6)
|
|
- ⏳ T032 RUNBOOK validated end-to-end (S6)
|
|
|
|
---
|
|
|
|
**S4 Status:** COMPLETE (Node Configs Ready for S5)
|
|
|
|
**Next:** Awaiting S3 (TLS Certs) + S2 (PXE Server) from PeerA
|