Includes all pending changes needed for nixos-anywhere: - fiberlb: L7 policy, rule, certificate types - deployer: New service for cluster management - nix-nos: Generic network modules - Various service updates and fixes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.6 KiB
6.6 KiB
T039.S6 Integration Test Plan
Owner: peerA Prerequisites: S3-S5 complete (NixOS provisioned, services deployed, clusters formed)
Test Categories
1. Service Health Checks
Verify all 11 services respond on all 3 nodes.
# Node IPs (from T036 config)
NODES=(192.168.100.11 192.168.100.12 192.168.100.13)
# Service ports (from nix/modules/*.nix - verified 2025-12-12)
declare -A SERVICES=(
["chainfire"]=2379
["flaredb"]=2479
["iam"]=3000
["plasmavmc"]=4000
["lightningstor"]=8000
["flashdns"]=6000
["fiberlb"]=7000
["prismnet"]=5000
["k8shost"]=6443
["nightlight"]=9101
["creditservice"]=3010
)
# Health check each service on each node
for node in "${NODES[@]}"; do
for svc in "${!SERVICES[@]}"; do
grpcurl -plaintext $node:${SERVICES[$svc]} list || echo "FAIL: $svc on $node"
done
done
Expected: All services respond with gRPC reflection
2. Cluster Formation Validation
2.1 ChainFire Cluster
# Check cluster status on each node
for node in "${NODES[@]}"; do
grpcurl -plaintext $node:2379 chainfire.ClusterService/GetStatus
done
Expected:
- 3 nodes in cluster
- Leader elected
- All nodes healthy
2.2 FlareDB Cluster
# Check FlareDB cluster health
for node in "${NODES[@]}"; do
grpcurl -plaintext $node:2479 flaredb.AdminService/GetClusterStatus
done
Expected:
- 3 nodes joined
- Quorum formed (2/3 minimum)
3. Cross-Component Integration (T029 Scenarios)
3.1 IAM Authentication Flow
# Create test organization
grpcurl -plaintext $NODES[0]:3000 iam.OrgService/CreateOrg \
-d '{"name":"test-org","display_name":"Test Organization"}'
# Create test user
grpcurl -plaintext $NODES[0]:3000 iam.UserService/CreateUser \
-d '{"org_id":"test-org","username":"testuser","password":"testpass"}'
# Authenticate and get token
TOKEN=$(grpcurl -plaintext $NODES[0]:3000 iam.AuthService/Authenticate \
-d '{"username":"testuser","password":"testpass"}' | jq -r '.token')
# Validate token
grpcurl -plaintext $NODES[0]:3000 iam.AuthService/ValidateToken \
-d "{\"token\":\"$TOKEN\"}"
Expected: Token issued and validated successfully
3.2 FlareDB Storage
# Write data
grpcurl -plaintext $NODES[0]:2479 flaredb.KVService/Put \
-d '{"key":"test-key","value":"dGVzdC12YWx1ZQ=="}'
# Read from different node (replication test)
grpcurl -plaintext $NODES[1]:2479 flaredb.KVService/Get \
-d '{"key":"test-key"}'
Expected: Data replicated across nodes
3.3 LightningSTOR S3 Operations
# Create bucket via S3 API
curl -X PUT http://$NODES[0]:9100/test-bucket
# Upload object
curl -X PUT http://$NODES[0]:9100/test-bucket/test-object \
-d "test content"
# Download object from different node
curl http://$NODES[1]:9100/test-bucket/test-object
Expected: Object storage working, multi-node accessible
3.4 FlashDNS Resolution
# Add DNS record
grpcurl -plaintext $NODES[0]:6000 flashdns.RecordService/CreateRecord \
-d '{"zone":"test.cloud","name":"test","type":"A","value":"192.168.100.100"}'
# Query DNS from different node
dig @$NODES[1] test.test.cloud A +short
Expected: DNS record created and resolvable
4. Nightlight Metrics Collection
# Check Prometheus endpoint on each node
for node in "${NODES[@]}"; do
curl -s http://$node:9090/api/v1/targets | jq '.data.activeTargets | length'
done
# Query metrics
curl -s "http://$NODES[0]:9090/api/v1/query?query=up" | jq '.data.result'
Expected: All targets up, metrics being collected
5. FiberLB Load Balancing (T051 Validation)
# Create load balancer for test service
grpcurl -plaintext $NODES[0]:7000 fiberlb.LBService/CreateLoadBalancer \
-d '{"name":"test-lb","org_id":"test-org"}'
# Create pool with round-robin
grpcurl -plaintext $NODES[0]:7000 fiberlb.PoolService/CreatePool \
-d '{"lb_id":"...","algorithm":"ROUND_ROBIN","protocol":"TCP"}'
# Add backends
for i in 1 2 3; do
grpcurl -plaintext $NODES[0]:7000 fiberlb.BackendService/CreateBackend \
-d "{\"pool_id\":\"...\",\"address\":\"192.168.100.1$i\",\"port\":8080}"
done
# Verify distribution (requires test backend servers)
for i in {1..10}; do
curl -s http://<VIP>:80 | head -1
done | sort | uniq -c
Expected: Requests distributed across backends
6. PrismNET Overlay Networking
# Create VPC
grpcurl -plaintext $NODES[0]:5000 prismnet.VPCService/CreateVPC \
-d '{"name":"test-vpc","cidr":"10.0.0.0/16"}'
# Create subnet
grpcurl -plaintext $NODES[0]:5000 prismnet.SubnetService/CreateSubnet \
-d '{"vpc_id":"...","name":"test-subnet","cidr":"10.0.1.0/24"}'
# Create port
grpcurl -plaintext $NODES[0]:5000 prismnet.PortService/CreatePort \
-d '{"subnet_id":"...","name":"test-port"}'
Expected: VPC/subnet/port created successfully
7. CreditService Quota (If Implemented)
# Check wallet balance
grpcurl -plaintext $NODES[0]:3010 creditservice.WalletService/GetBalance \
-d '{"org_id":"test-org","project_id":"test-project"}'
Expected: Quota system responding
8. Node Failure Resilience
# Shutdown node03
ssh root@$NODES[2] "systemctl stop chainfire flaredb"
# Verify cluster still operational (quorum: 2/3)
grpcurl -plaintext $NODES[0]:2379 chainfire.ClusterService/GetStatus
# Write data
grpcurl -plaintext $NODES[0]:2479 flaredb.KVService/Put \
-d '{"key":"failover-test","value":"..."}'
# Read data
grpcurl -plaintext $NODES[1]:2479 flaredb.KVService/Get \
-d '{"key":"failover-test"}'
# Restart node03
ssh root@$NODES[2] "systemctl start chainfire flaredb"
# Verify rejoin
sleep 30
grpcurl -plaintext $NODES[2]:2379 chainfire.ClusterService/GetStatus
Expected: Cluster survives single node failure, node rejoins
Test Execution Order
- Service Health (basic connectivity)
- Cluster Formation (Raft quorum)
- IAM Auth (foundation for other tests)
- FlareDB Storage (data layer)
- Nightlight Metrics (observability)
- LightningSTOR S3 (object storage)
- FlashDNS (name resolution)
- FiberLB (load balancing)
- PrismNET (networking)
- CreditService (quota)
- Node Failure (resilience)
Success Criteria
- All services respond on all nodes
- ChainFire cluster: 3 nodes, leader elected
- FlareDB cluster: quorum formed, replication working
- IAM: auth tokens issued/validated
- Data: read/write across nodes
- Metrics: targets up, queries working
- LB: traffic distributed
- Failover: survives 1 node loss
Failure Handling
If tests fail:
- Capture service logs:
journalctl -u <service> --no-pager - Document failure in evidence section
- Create follow-up task if systemic issue
- Do not proceed to production traffic