photoncloud-monorepo/docs/por/T039-production-deployment/S6-integration-test-plan.md
centra 3eeb303dcb feat: Batch commit for T039.S3 deployment
Includes all pending changes needed for nixos-anywhere:
- fiberlb: L7 policy, rule, certificate types
- deployer: New service for cluster management
- nix-nos: Generic network modules
- Various service updates and fixes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-13 04:34:51 +09:00

6.6 KiB

T039.S6 Integration Test Plan

Owner: peerA Prerequisites: S3-S5 complete (NixOS provisioned, services deployed, clusters formed)

Test Categories

1. Service Health Checks

Verify all 11 services respond on all 3 nodes.

# Node IPs (from T036 config)
NODES=(192.168.100.11 192.168.100.12 192.168.100.13)

# Service ports (from nix/modules/*.nix - verified 2025-12-12)
declare -A SERVICES=(
  ["chainfire"]=2379
  ["flaredb"]=2479
  ["iam"]=3000
  ["plasmavmc"]=4000
  ["lightningstor"]=8000
  ["flashdns"]=6000
  ["fiberlb"]=7000
  ["prismnet"]=5000
  ["k8shost"]=6443
  ["nightlight"]=9101
  ["creditservice"]=3010
)

# Health check each service on each node
for node in "${NODES[@]}"; do
  for svc in "${!SERVICES[@]}"; do
    grpcurl -plaintext $node:${SERVICES[$svc]} list || echo "FAIL: $svc on $node"
  done
done

Expected: All services respond with gRPC reflection

2. Cluster Formation Validation

2.1 ChainFire Cluster

# Check cluster status on each node
for node in "${NODES[@]}"; do
  grpcurl -plaintext $node:2379 chainfire.ClusterService/GetStatus
done

Expected:

  • 3 nodes in cluster
  • Leader elected
  • All nodes healthy

2.2 FlareDB Cluster

# Check FlareDB cluster health
for node in "${NODES[@]}"; do
  grpcurl -plaintext $node:2479 flaredb.AdminService/GetClusterStatus
done

Expected:

  • 3 nodes joined
  • Quorum formed (2/3 minimum)

3. Cross-Component Integration (T029 Scenarios)

3.1 IAM Authentication Flow

# Create test organization
grpcurl -plaintext $NODES[0]:3000 iam.OrgService/CreateOrg \
  -d '{"name":"test-org","display_name":"Test Organization"}'

# Create test user
grpcurl -plaintext $NODES[0]:3000 iam.UserService/CreateUser \
  -d '{"org_id":"test-org","username":"testuser","password":"testpass"}'

# Authenticate and get token
TOKEN=$(grpcurl -plaintext $NODES[0]:3000 iam.AuthService/Authenticate \
  -d '{"username":"testuser","password":"testpass"}' | jq -r '.token')

# Validate token
grpcurl -plaintext $NODES[0]:3000 iam.AuthService/ValidateToken \
  -d "{\"token\":\"$TOKEN\"}"

Expected: Token issued and validated successfully

3.2 FlareDB Storage

# Write data
grpcurl -plaintext $NODES[0]:2479 flaredb.KVService/Put \
  -d '{"key":"test-key","value":"dGVzdC12YWx1ZQ=="}'

# Read from different node (replication test)
grpcurl -plaintext $NODES[1]:2479 flaredb.KVService/Get \
  -d '{"key":"test-key"}'

Expected: Data replicated across nodes

3.3 LightningSTOR S3 Operations

# Create bucket via S3 API
curl -X PUT http://$NODES[0]:9100/test-bucket

# Upload object
curl -X PUT http://$NODES[0]:9100/test-bucket/test-object \
  -d "test content"

# Download object from different node
curl http://$NODES[1]:9100/test-bucket/test-object

Expected: Object storage working, multi-node accessible

3.4 FlashDNS Resolution

# Add DNS record
grpcurl -plaintext $NODES[0]:6000 flashdns.RecordService/CreateRecord \
  -d '{"zone":"test.cloud","name":"test","type":"A","value":"192.168.100.100"}'

# Query DNS from different node
dig @$NODES[1] test.test.cloud A +short

Expected: DNS record created and resolvable

4. Nightlight Metrics Collection

# Check Prometheus endpoint on each node
for node in "${NODES[@]}"; do
  curl -s http://$node:9090/api/v1/targets | jq '.data.activeTargets | length'
done

# Query metrics
curl -s "http://$NODES[0]:9090/api/v1/query?query=up" | jq '.data.result'

Expected: All targets up, metrics being collected

5. FiberLB Load Balancing (T051 Validation)

# Create load balancer for test service
grpcurl -plaintext $NODES[0]:7000 fiberlb.LBService/CreateLoadBalancer \
  -d '{"name":"test-lb","org_id":"test-org"}'

# Create pool with round-robin
grpcurl -plaintext $NODES[0]:7000 fiberlb.PoolService/CreatePool \
  -d '{"lb_id":"...","algorithm":"ROUND_ROBIN","protocol":"TCP"}'

# Add backends
for i in 1 2 3; do
  grpcurl -plaintext $NODES[0]:7000 fiberlb.BackendService/CreateBackend \
    -d "{\"pool_id\":\"...\",\"address\":\"192.168.100.1$i\",\"port\":8080}"
done

# Verify distribution (requires test backend servers)
for i in {1..10}; do
  curl -s http://<VIP>:80 | head -1
done | sort | uniq -c

Expected: Requests distributed across backends

6. PrismNET Overlay Networking

# Create VPC
grpcurl -plaintext $NODES[0]:5000 prismnet.VPCService/CreateVPC \
  -d '{"name":"test-vpc","cidr":"10.0.0.0/16"}'

# Create subnet
grpcurl -plaintext $NODES[0]:5000 prismnet.SubnetService/CreateSubnet \
  -d '{"vpc_id":"...","name":"test-subnet","cidr":"10.0.1.0/24"}'

# Create port
grpcurl -plaintext $NODES[0]:5000 prismnet.PortService/CreatePort \
  -d '{"subnet_id":"...","name":"test-port"}'

Expected: VPC/subnet/port created successfully

7. CreditService Quota (If Implemented)

# Check wallet balance
grpcurl -plaintext $NODES[0]:3010 creditservice.WalletService/GetBalance \
  -d '{"org_id":"test-org","project_id":"test-project"}'

Expected: Quota system responding

8. Node Failure Resilience

# Shutdown node03
ssh root@$NODES[2] "systemctl stop chainfire flaredb"

# Verify cluster still operational (quorum: 2/3)
grpcurl -plaintext $NODES[0]:2379 chainfire.ClusterService/GetStatus

# Write data
grpcurl -plaintext $NODES[0]:2479 flaredb.KVService/Put \
  -d '{"key":"failover-test","value":"..."}'

# Read data
grpcurl -plaintext $NODES[1]:2479 flaredb.KVService/Get \
  -d '{"key":"failover-test"}'

# Restart node03
ssh root@$NODES[2] "systemctl start chainfire flaredb"

# Verify rejoin
sleep 30
grpcurl -plaintext $NODES[2]:2379 chainfire.ClusterService/GetStatus

Expected: Cluster survives single node failure, node rejoins

Test Execution Order

  1. Service Health (basic connectivity)
  2. Cluster Formation (Raft quorum)
  3. IAM Auth (foundation for other tests)
  4. FlareDB Storage (data layer)
  5. Nightlight Metrics (observability)
  6. LightningSTOR S3 (object storage)
  7. FlashDNS (name resolution)
  8. FiberLB (load balancing)
  9. PrismNET (networking)
  10. CreditService (quota)
  11. Node Failure (resilience)

Success Criteria

  • All services respond on all nodes
  • ChainFire cluster: 3 nodes, leader elected
  • FlareDB cluster: quorum formed, replication working
  • IAM: auth tokens issued/validated
  • Data: read/write across nodes
  • Metrics: targets up, queries working
  • LB: traffic distributed
  • Failover: survives 1 node loss

Failure Handling

If tests fail:

  1. Capture service logs: journalctl -u <service> --no-pager
  2. Document failure in evidence section
  3. Create follow-up task if systemic issue
  4. Do not proceed to production traffic