photoncloud-monorepo/docs/por/T025-k8s-hosting/spec.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

71 KiB

K8s Hosting Specification

Overview

PlasmaCloud's K8s Hosting service provides managed Kubernetes clusters for multi-tenant container orchestration. This specification defines a k3s-based architecture that integrates deeply with existing PlasmaCloud infrastructure components: PrismNET for networking, FiberLB for load balancing, IAM for authentication/authorization, FlashDNS for service discovery, and LightningStor for persistent storage.

Purpose

Enable customers to deploy and manage containerized workloads using standard Kubernetes APIs while benefiting from PlasmaCloud's integrated infrastructure services. The system provides:

  • Standard K8s API compatibility: Use kubectl, Helm, and existing K8s tooling
  • Multi-tenant isolation: Project-based namespaces with IAM-backed RBAC
  • Deep integration: Leverage PrismNET SDN, FiberLB load balancing, LightningStor block storage
  • Production-ready: HA control plane, automated failover, comprehensive monitoring

Scope

Phase 1 (MVP, 3-4 months):

  • Core K8s APIs (Pods, Services, Deployments, ReplicaSets, Namespaces, ConfigMaps, Secrets)
  • LoadBalancer services via FiberLB
  • Persistent storage via LightningStor CSI
  • IAM authentication and RBAC
  • PrismNET CNI for pod networking
  • FlashDNS service discovery

Future Phases:

  • PlasmaVMC integration for VM-backed pods (enhanced isolation)
  • StatefulSets, DaemonSets, Jobs/CronJobs
  • Network policies with PrismNET enforcement
  • Horizontal Pod Autoscaler
  • FlareDB as k3s datastore

Architecture Decision Summary

Base Technology: k3s

  • Lightweight K8s distribution (single binary, minimal dependencies)
  • Production-proven (CNCF certified, widely deployed)
  • Flexible architecture allowing component replacement
  • Embedded SQLite (single-server) or etcd (HA cluster)
  • 3-4 month timeline achievable

Component Replacement Strategy:

  • Disable: servicelb (replaced by FiberLB), traefik (use FiberLB), flannel (replaced by PrismNET)
  • Keep: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, containerd
  • Add: Custom controllers for FiberLB, FlashDNS, IAM webhook, LightningStor CSI, PrismNET CNI

Architecture

Base: k3s with Selective Component Replacement

k3s Core (Keep):

  • kube-apiserver: K8s REST API server with IAM webhook authentication
  • kube-scheduler: Pod scheduling with resource awareness
  • kube-controller-manager: Core controllers (replication, endpoints, service accounts, etc.)
  • kubelet: Node agent managing pod lifecycle via containerd CRI
  • containerd: Container runtime (Phase 1), later replaceable by PlasmaVMC CRI
  • kube-proxy: Service networking (iptables/ipvs mode)

k3s Components (Disable):

  • servicelb: Default LoadBalancer implementation → Replaced by FiberLB controller
  • traefik: Ingress controller → Replaced by FiberLB L7 capabilities
  • flannel: CNI plugin → Replaced by PrismNET CNI
  • local-path-provisioner: Storage provisioner → Replaced by LightningStor CSI

PlasmaCloud Custom Components (Add):

  • PrismNET CNI Plugin: Pod networking via OVN logical switches
  • FiberLB Controller: LoadBalancer service reconciliation
  • IAM Webhook Server: Token validation and user mapping
  • FlashDNS Controller: Service DNS record synchronization
  • LightningStor CSI Driver: PersistentVolume provisioning and attachment

Component Topology

┌─────────────────────────────────────────────────────────────┐
│                     k3s Control Plane                       │
│  ┌──────────────┐  ┌────────────┐  ┌──────────────────┐   │
│  │ kube-apiserver│◄─┤ IAM Webhook├──┤ IAM Service      │   │
│  │              │  │            │  │ (Authentication) │   │
│  └──────┬───────┘  └────────────┘  └──────────────────┘   │
│         │                                                   │
│  ┌──────▼───────┐  ┌──────────────┐  ┌────────────────┐   │
│  │kube-scheduler│  │kube-controller│  │ etcd/SQLite    │   │
│  │              │  │   -manager    │  │  (Datastore)   │   │
│  └──────────────┘  └──────────────┘  └────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
┌───────▼───────┐  ┌───────▼───────┐  ┌──────▼──────┐
│ FiberLB       │  │ FlashDNS      │  │ LightningStor│
│ Controller    │  │ Controller    │  │ CSI Plugin   │
│ (Watch Svcs)  │  │ (Sync DNS)    │  │ (Provision)  │
└───────┬───────┘  └───────┬───────┘  └──────┬───────┘
        │                  │                  │
        ▼                  ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌────────────────┐
│ FiberLB      │  │ FlashDNS     │  │ LightningStor  │
│ gRPC API     │  │ gRPC API     │  │ gRPC API       │
└──────────────┘  └──────────────┘  └────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      k3s Worker Nodes                        │
│  ┌──────────────┐  ┌────────────┐  ┌──────────────────┐   │
│  │  kubelet     │◄─┤containerd  ├──┤ Pods (containers)│   │
│  │              │  │    CRI     │  │                  │   │
│  └──────┬───────┘  └────────────┘  └──────────────────┘   │
│         │                                                   │
│  ┌──────▼───────┐  ┌──────────────┐                        │
│  │ PrismNET CNI  │◄─┤ kube-proxy   │                        │
│  │ (Pod Network)│  │ (Service Net)│                        │
│  └──────┬───────┘  └──────────────┘                        │
│         │                                                   │
│         ▼                                                   │
│  ┌──────────────┐                                          │
│  │ PrismNET OVN  │                                          │
│  │ (ovs-vswitchd)│                                         │
│  └──────────────┘                                          │
└─────────────────────────────────────────────────────────────┘

Data Flow Examples

1. Pod Creation:

kubectl create pod → kube-apiserver (IAM auth) → scheduler → kubelet → containerd
                                                                  ↓
                                                            PrismNET CNI
                                                                  ↓
                                                          OVN logical port

2. LoadBalancer Service:

kubectl expose → kube-apiserver → Service created → FiberLB controller watches
                                                           ↓
                                                   FiberLB gRPC API
                                                           ↓
                                               External IP + L4 forwarding

3. PersistentVolume:

PVC created → kube-apiserver → CSI controller → LightningStor CSI driver
                                                         ↓
                                                 LightningStor gRPC
                                                         ↓
                                                   Volume created
                                                         ↓
                                               kubelet → CSI node plugin
                                                         ↓
                                                   Mount to pod

K8s API Subset

Phase 1: Core APIs (Essential)

Pods (v1):

  • Full CRUD operations (create, get, list, update, delete, patch)
  • Watch API for real-time updates
  • Logs streaming (kubectl logs -f)
  • Exec into containers (kubectl exec)
  • Port forwarding (kubectl port-forward)
  • Status: Phase (Pending, Running, Succeeded, Failed), conditions, container states

Services (v1):

  • ClusterIP: Internal cluster networking (default)
  • LoadBalancer: External access via FiberLB
  • Headless: StatefulSet support (clusterIP: None)
  • Service discovery via FlashDNS
  • Endpoint slices for large service backends

Deployments (apps/v1):

  • Declarative desired state (replicas, pod template)
  • Rolling updates with configurable strategy (maxSurge, maxUnavailable)
  • Rollback to previous revision
  • Pause/resume for canary deployments
  • Scaling (manual in Phase 1)

ReplicaSets (apps/v1):

  • Pod replication with label selectors
  • Owned by Deployments (rarely created directly)
  • Orphan/adopt pod ownership

Namespaces (v1):

  • Tenant isolation (one namespace per project)
  • Resource quota enforcement
  • Network policy scope (Phase 2)
  • RBAC scope

ConfigMaps (v1):

  • Non-sensitive configuration data
  • Mount as volumes or environment variables
  • Update triggers pod restarts (via annotation)

Secrets (v1):

  • Sensitive data (passwords, tokens, certificates)
  • Base64 encoded in etcd (at-rest encryption in future phase)
  • Mount as volumes or environment variables
  • Service account tokens

Nodes (v1):

  • Node registration via kubelet
  • Heartbeat and status reporting
  • Capacity and allocatable resources
  • Labels and taints for scheduling

Events (v1):

  • Audit trail of cluster activities
  • Retention policy (1 hour in-memory, longer in etcd)
  • Debugging and troubleshooting

Phase 2: Storage & Config (Required for MVP)

PersistentVolumes (v1):

  • Volume lifecycle independent of pods
  • Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany (LightningStor support)
  • Reclaim policy: Retain, Delete
  • Status: Available, Bound, Released, Failed

PersistentVolumeClaims (v1):

  • User request for storage
  • Binding to PVs by storage class, capacity, access mode
  • Volume expansion (if storage class allows)

StorageClasses (storage.k8s.io/v1):

  • Dynamic provisioning via LightningStor CSI
  • Parameters: volume type (ssd, hdd), replication factor, org_id, project_id
  • Volume binding mode: Immediate or WaitForFirstConsumer

Phase 3: Advanced (Post-MVP)

StatefulSets (apps/v1):

  • Ordered pod creation/deletion
  • Stable network identities (pod-0, pod-1, ...)
  • Persistent storage per pod via volumeClaimTemplates
  • Use case: Databases, distributed systems

DaemonSets (apps/v1):

  • One pod per node (e.g., log collectors, monitoring agents)
  • Node selector and tolerations

Jobs (batch/v1):

  • Run-to-completion workloads
  • Parallelism and completions
  • Retry policy

CronJobs (batch/v1):

  • Scheduled jobs (cron syntax)
  • Concurrency policy

NetworkPolicies (networking.k8s.io/v1):

  • Ingress and egress rules
  • Label-based pod selection
  • Namespace selectors
  • Requires PrismNET CNI support for OVN ACL translation

Ingress (networking.k8s.io/v1):

  • HTTP/HTTPS routing via FiberLB L7
  • Host-based and path-based routing
  • TLS termination

Deferred APIs (Not in MVP)

  • HorizontalPodAutoscaler (autoscaling/v2): Requires metrics-server
  • VerticalPodAutoscaler: Complex, low priority
  • PodDisruptionBudget: Useful for HA, but post-MVP
  • LimitRange: Resource limits per namespace (future)
  • ResourceQuota: Supported in Phase 1, but advanced features deferred
  • CustomResourceDefinitions (CRDs): Framework exists, but no custom resources in Phase 1
  • APIService: Aggregation layer not needed initially

Integration Specifications

1. PrismNET CNI Plugin

Purpose: Provide pod networking using PrismNET's OVN-based SDN.

Interface: CNI 1.0.0 specification (https://github.com/containernetworking/cni/blob/main/SPEC.md)

Components:

  • CNI binary: /opt/cni/bin/prismnet
  • Configuration: /etc/cni/net.d/10-prismnet.conflist
  • IPAM plugin: /opt/cni/bin/prismnet-ipam (or integrated)

Responsibilities:

  • Create network interface for pod (veth pair)
  • Allocate IP address from namespace-specific subnet
  • Connect pod to OVN logical switch
  • Configure routing for pod egress
  • Enforce network policies (Phase 2)

Configuration Schema:

{
  "cniVersion": "1.0.0",
  "name": "prismnet",
  "type": "prismnet",
  "ipam": {
    "type": "prismnet-ipam",
    "subnet": "10.244.0.0/16",
    "rangeStart": "10.244.0.10",
    "rangeEnd": "10.244.255.254",
    "routes": [
      {"dst": "0.0.0.0/0"}
    ],
    "gateway": "10.244.0.1"
  },
  "ovn": {
    "northbound": "tcp:prismnet-server:6641",
    "southbound": "tcp:prismnet-server:6642",
    "encapType": "geneve"
  },
  "mtu": 1400,
  "prismnetEndpoint": "prismnet-server:5000"
}

CNI Plugin Workflow:

  1. ADD Command (pod creation):

    Input: Container ID, network namespace path, interface name
    Process:
    - Call PrismNET gRPC API: AllocateIP(namespace, pod_name)
    - Create veth pair: one end in pod netns, one in host
    - Add host veth to OVN logical switch port
    - Configure pod veth: IP address, routes, MTU
    - Return: IP config, routes, DNS settings
    
  2. DEL Command (pod deletion):

    Input: Container ID, network namespace path
    Process:
    - Call PrismNET gRPC API: ReleaseIP(namespace, pod_name)
    - Delete OVN logical switch port
    - Delete veth pair
    
  3. CHECK Command (health check):

    Verify interface exists and has expected configuration
    

API Integration (PrismNET gRPC):

service NetworkService {
  rpc AllocateIP(AllocateIPRequest) returns (AllocateIPResponse);
  rpc ReleaseIP(ReleaseIPRequest) returns (ReleaseIPResponse);
  rpc CreateLogicalSwitch(CreateLogicalSwitchRequest) returns (CreateLogicalSwitchResponse);
}

message AllocateIPRequest {
  string namespace = 1;
  string pod_name = 2;
  string container_id = 3;
}

message AllocateIPResponse {
  string ip_address = 1;  // e.g., "10.244.1.5/24"
  string gateway = 2;
  repeated string dns_servers = 3;
}

OVN Topology:

  • Logical Switch per Namespace: k8s-<namespace> (e.g., k8s-project-123)
  • Logical Router: k8s-cluster-router for inter-namespace routing
  • Logical Switch Ports: One per pod (<pod-name>-<container-id>)
  • ACLs: NetworkPolicy enforcement (Phase 2)

Network Policy Translation (Phase 2):

K8s NetworkPolicy:
  podSelector: app=web
  ingress:
  - from:
    - podSelector: app=frontend
    ports:
    - protocol: TCP
      port: 80

→ OVN ACL:
  direction: to-lport
  match: "ip4.src == $frontend_pods && tcp.dst == 80"
  action: allow-related
  priority: 1000

Address Sets:

  • Dynamic updates as pods are added/removed
  • Efficient ACL matching for large pod groups

2. FiberLB LoadBalancer Controller

Purpose: Reconcile K8s Services of type LoadBalancer with FiberLB resources.

Architecture:

  • Controller Process: Runs as a pod in kube-system namespace or embedded in k3s server
  • Watch Resources: Services (type=LoadBalancer), Endpoints
  • Manage Resources: FiberLB LoadBalancers, Listeners, Pools, Members

Controller Logic:

1. Service Watch Loop:

for event := range serviceWatcher {
  if event.Type == Created || event.Type == Updated {
    if service.Spec.Type == "LoadBalancer" {
      reconcileLoadBalancer(service)
    }
  } else if event.Type == Deleted {
    deleteLoadBalancer(service)
  }
}

2. Reconcile Logic:

Input: Service object
Process:
1. Check if FiberLB LoadBalancer exists (by annotation or name mapping)
2. If not exists:
   a. Allocate external IP from pool
   b. Create FiberLB LoadBalancer resource (gRPC CreateLoadBalancer)
   c. Store LoadBalancer ID in service annotation
3. For each service.Spec.Ports:
   a. Create/update FiberLB Listener (protocol, port, algorithm)
4. Get service endpoints:
   a. Create/update FiberLB Pool with backend members (pod IPs, ports)
5. Update service.Status.LoadBalancer.Ingress with external IP
6. If service spec changed:
   a. Update FiberLB resources accordingly

3. Endpoint Watch Loop:

for event := range endpointWatcher {
  service := getServiceForEndpoint(event.Object)
  if service.Spec.Type == "LoadBalancer" {
    updateLoadBalancerPool(service, event.Object)
  }
}

Configuration:

  • External IP Pool: --external-ip-pool=192.168.100.0/24 (CIDR or IP range)
  • FiberLB Endpoint: --fiberlb-endpoint=fiberlb-server:7000 (gRPC address)
  • IP Allocation: First-available or integration with IPAM service

Service Annotations:

apiVersion: v1
kind: Service
metadata:
  name: web-service
  annotations:
    fiberlb.plasmacloud.io/load-balancer-id: "lb-abc123"
    fiberlb.plasmacloud.io/algorithm: "round-robin"  # round-robin | least-conn | ip-hash
    fiberlb.plasmacloud.io/health-check-path: "/health"
    fiberlb.plasmacloud.io/health-check-interval: "10s"
    fiberlb.plasmacloud.io/health-check-timeout: "5s"
    fiberlb.plasmacloud.io/health-check-retries: "3"
    fiberlb.plasmacloud.io/session-affinity: "client-ip"  # For sticky sessions
spec:
  type: LoadBalancer
  selector:
    app: web
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
status:
  loadBalancer:
    ingress:
    - ip: 192.168.100.50

FiberLB gRPC API Integration:

service LoadBalancerService {
  rpc CreateLoadBalancer(CreateLoadBalancerRequest) returns (LoadBalancer);
  rpc UpdateLoadBalancer(UpdateLoadBalancerRequest) returns (LoadBalancer);
  rpc DeleteLoadBalancer(DeleteLoadBalancerRequest) returns (Empty);
  rpc CreateListener(CreateListenerRequest) returns (Listener);
  rpc UpdatePool(UpdatePoolRequest) returns (Pool);
}

message CreateLoadBalancerRequest {
  string name = 1;
  string description = 2;
  string external_ip = 3;  // If empty, allocate from pool
  string org_id = 4;
  string project_id = 5;
}

message CreateListenerRequest {
  string load_balancer_id = 1;
  string protocol = 2;  // TCP, UDP, HTTP, HTTPS
  int32 port = 3;
  string default_pool_id = 4;
  HealthCheck health_check = 5;
}

message UpdatePoolRequest {
  string pool_id = 1;
  repeated PoolMember members = 2;
  string algorithm = 3;
}

message PoolMember {
  string address = 1;  // Pod IP
  int32 port = 2;
  int32 weight = 3;
}

Health Checks:

  • HTTP health checks: Use annotation health-check-path
  • TCP health checks: Connection-based for non-HTTP services
  • Health check failures remove pod from pool (auto-healing)

Edge Cases:

  • Service deletion: Controller must clean up FiberLB resources and release external IP
  • Endpoint churn: Debounce pool updates to avoid excessive FiberLB API calls
  • IP exhaustion: Return error event on service, set status condition

3. IAM Authentication Webhook

Purpose: Authenticate K8s API requests using PlasmaCloud IAM tokens.

Architecture:

  • Webhook Server: HTTPS endpoint (can be part of IAM service or standalone)
  • Integration Point: kube-apiserver --authentication-token-webhook-config-file
  • Protocol: K8s TokenReview API

Webhook Endpoint: POST /apis/iam.plasmacloud.io/v1/authenticate

Request Flow:

kubectl --token=<IAM-token> get pods
    ↓
kube-apiserver extracts Bearer token
    ↓
POST /apis/iam.plasmacloud.io/v1/authenticate
    body: TokenReview with token
    ↓
IAM webhook validates token
    ↓
Response: authenticated=true, user info, groups
    ↓
kube-apiserver proceeds with RBAC authorization

Request Schema (from kube-apiserver):

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "spec": {
    "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
  }
}

Response Schema (from IAM webhook):

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "status": {
    "authenticated": true,
    "user": {
      "username": "user@example.com",
      "uid": "user-550e8400-e29b-41d4-a716-446655440000",
      "groups": [
        "org:org-123",
        "project:proj-456",
        "system:authenticated"
      ],
      "extra": {
        "org_id": ["org-123"],
        "project_id": ["proj-456"],
        "roles": ["org_admin"]
      }
    }
  }
}

Error Response (invalid token):

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "status": {
    "authenticated": false,
    "error": "Invalid or expired token"
  }
}

IAM Token Format:

  • JWT: Signed by IAM service with shared secret or public/private key
  • Claims: sub (user ID), email, org_id, project_id, roles, exp (expiration)
  • Example:
    {
      "sub": "user-550e8400-e29b-41d4-a716-446655440000",
      "email": "user@example.com",
      "org_id": "org-123",
      "project_id": "proj-456",
      "roles": ["org_admin", "project_member"],
      "exp": 1672531200
    }
    

User/Group Mapping:

IAM Principal K8s Username K8s Groups
User (email) user@example.com org:<org_id>, project:<project_id>, system:authenticated
User (ID) user- org:<org_id>, project:<project_id>, system:authenticated
Service Account sa-@ org:<org_id>, project:<project_id>, system:serviceaccounts
Org Admin admin@example.com org:<org_id>, project:<all_projects>, k8s:org-admin

RBAC Integration:

  • Groups are used in RoleBindings and ClusterRoleBindings
  • Example: org:org-123 group gets admin access to all project-* namespaces for that org

Webhook Configuration File (/etc/k8shost/iam-webhook.yaml):

apiVersion: v1
kind: Config
clusters:
- name: iam-webhook
  cluster:
    server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
    certificate-authority: /etc/k8shost/ca.crt
users:
- name: k8s-apiserver
  user:
    client-certificate: /etc/k8shost/apiserver-client.crt
    client-key: /etc/k8shost/apiserver-client.key
current-context: webhook
contexts:
- context:
    cluster: iam-webhook
    user: k8s-apiserver
  name: webhook

Performance Considerations:

  • Caching: kube-apiserver caches successful authentications (--authentication-token-webhook-cache-ttl=2m)
  • Timeouts: Webhook must respond within 10s (configurable)
  • Rate Limiting: IAM webhook should handle high request volume (100s of req/s)

4. FlashDNS Service Discovery Controller

Purpose: Synchronize K8s Services and Pods to FlashDNS for cluster DNS resolution.

Architecture:

  • Controller Process: Runs as pod in kube-system or embedded in k3s server
  • Watch Resources: Services, Endpoints, Pods
  • Manage Resources: FlashDNS A/AAAA/SRV records

DNS Hierarchy:

  • Pod A Records: <pod-ip-dashed>.pod.cluster.local → Pod IP
    • Example: 10-244-1-5.pod.cluster.local10.244.1.5
  • Service A Records: <service>.<namespace>.svc.cluster.local → ClusterIP or external IP
    • Example: web.default.svc.cluster.local10.96.0.100
  • Headless Service: <endpoint>.<service>.<namespace>.svc.cluster.local → Endpoint IPs
    • Example: web-0.web.default.svc.cluster.local10.244.1.10
  • SRV Records: _<port>._<protocol>.<service>.<namespace>.svc.cluster.local
    • Example: _http._tcp.web.default.svc.cluster.local0 50 80 web.default.svc.cluster.local

Controller Logic:

1. Service Watch:

for event := range serviceWatcher {
  service := event.Object
  switch event.Type {
  case Created, Updated:
    if service.Spec.ClusterIP != "None":
      // Regular service
      createOrUpdateDNSRecord(
        name: service.Name + "." + service.Namespace + ".svc.cluster.local",
        type: "A",
        value: service.Spec.ClusterIP
      )

      if len(service.Status.LoadBalancer.Ingress) > 0:
        // LoadBalancer service - also add external IP
        createOrUpdateDNSRecord(
          name: service.Name + "." + service.Namespace + ".svc.cluster.local",
          type: "A",
          value: service.Status.LoadBalancer.Ingress[0].IP
        )
    else:
      // Headless service - add endpoint records
      endpoints := getEndpoints(service)
      for _, ep := range endpoints:
        createOrUpdateDNSRecord(
          name: ep.Hostname + "." + service.Name + "." + service.Namespace + ".svc.cluster.local",
          type: "A",
          value: ep.IP
        )

    // Create SRV records for each port
    for _, port := range service.Spec.Ports:
      createSRVRecord(service, port)

  case Deleted:
    deleteDNSRecords(service)
  }
}

2. Pod Watch (for pod DNS):

for event := range podWatcher {
  pod := event.Object
  switch event.Type {
  case Created, Updated:
    if pod.Status.PodIP != "":
      dashedIP := strings.ReplaceAll(pod.Status.PodIP, ".", "-")
      createOrUpdateDNSRecord(
        name: dashedIP + ".pod.cluster.local",
        type: "A",
        value: pod.Status.PodIP
      )
  case Deleted:
    deleteDNSRecord(pod)
  }
}

FlashDNS gRPC API Integration:

service DNSService {
  rpc CreateRecord(CreateRecordRequest) returns (DNSRecord);
  rpc UpdateRecord(UpdateRecordRequest) returns (DNSRecord);
  rpc DeleteRecord(DeleteRecordRequest) returns (Empty);
  rpc ListRecords(ListRecordsRequest) returns (ListRecordsResponse);
}

message CreateRecordRequest {
  string zone = 1;  // "cluster.local"
  string name = 2;  // "web.default.svc"
  string type = 3;  // "A", "AAAA", "SRV", "CNAME"
  string value = 4; // "10.96.0.100"
  int32 ttl = 5;    // 30 (seconds)
  map<string, string> labels = 6;  // k8s metadata
}

message DNSRecord {
  string id = 1;
  string zone = 2;
  string name = 3;
  string type = 4;
  string value = 5;
  int32 ttl = 6;
}

Configuration:

  • FlashDNS Endpoint: --flashdns-endpoint=flashdns-server:6000
  • Cluster Domain: --cluster-domain=cluster.local (default)
  • Record TTL: --dns-ttl=30 (seconds, low for fast updates)

Example DNS Records:

# Regular service
web.default.svc.cluster.local.  30 IN A 10.96.0.100

# Headless service with 3 pods
web.default.svc.cluster.local.  30 IN A 10.244.1.10
web.default.svc.cluster.local.  30 IN A 10.244.1.11
web.default.svc.cluster.local.  30 IN A 10.244.1.12

# StatefulSet pods (Phase 3)
web-0.web.default.svc.cluster.local.  30 IN A 10.244.1.10
web-1.web.default.svc.cluster.local.  30 IN A 10.244.1.11

# SRV record for service port
_http._tcp.web.default.svc.cluster.local. 30 IN SRV 0 50 80 web.default.svc.cluster.local.

# Pod DNS
10-244-1-10.pod.cluster.local.  30 IN A 10.244.1.10

Integration with kubelet:

  • kubelet configures pod DNS via /etc/resolv.conf
  • nameserver: FlashDNS service IP (typically first IP in service CIDR, e.g., 10.96.0.10)
  • search: <namespace>.svc.cluster.local svc.cluster.local cluster.local

Edge Cases:

  • Service IP change: Update DNS record atomically
  • Endpoint churn: Debounce updates for headless services with many endpoints
  • DNS caching: Low TTL (30s) for fast convergence

5. LightningStor CSI Driver

Purpose: Provide dynamic PersistentVolume provisioning and lifecycle management.

CSI Driver Name: stor.plasmacloud.io

Architecture:

  • Controller Plugin: Runs as StatefulSet or Deployment in kube-system
    • Provisioning, deletion, attaching, detaching, snapshots
  • Node Plugin: Runs as DaemonSet on every node
    • Staging, publishing (mounting), unpublishing, unstaging

CSI Components:

1. Controller Service (Identity, Controller RPCs):

  • CreateVolume: Provision new volume via LightningStor
  • DeleteVolume: Delete volume
  • ControllerPublishVolume: Attach volume to node
  • ControllerUnpublishVolume: Detach volume from node
  • ValidateVolumeCapabilities: Check if volume supports requested capabilities
  • ListVolumes: List all volumes
  • GetCapacity: Query available storage capacity
  • CreateSnapshot, DeleteSnapshot: Volume snapshots (Phase 2)

2. Node Service (Node RPCs):

  • NodeStageVolume: Mount volume to global staging path on node
  • NodeUnstageVolume: Unmount from staging path
  • NodePublishVolume: Bind mount from staging to pod path
  • NodeUnpublishVolume: Unmount from pod path
  • NodeGetInfo: Return node ID and topology
  • NodeGetCapabilities: Return node capabilities

CSI Driver Workflow:

Volume Provisioning:

1. User creates PVC:
   apiVersion: v1
   kind: PersistentVolumeClaim
   metadata:
     name: my-pvc
   spec:
     accessModes: [ReadWriteOnce]
     resources:
       requests:
         storage: 10Gi
     storageClassName: lightningstor-ssd

2. CSI Controller watches PVC, calls CreateVolume:
   CreateVolumeRequest {
     name: "pvc-550e8400-e29b-41d4-a716-446655440000"
     capacity_range: { required_bytes: 10737418240 }
     volume_capabilities: [{ access_mode: SINGLE_NODE_WRITER }]
     parameters: {
       "type": "ssd",
       "replication": "3",
       "org_id": "org-123",
       "project_id": "proj-456"
     }
   }

3. CSI Controller calls LightningStor gRPC CreateVolume:
   LightningStor creates volume, returns volume_id

4. CSI Controller creates PV:
   apiVersion: v1
   kind: PersistentVolume
   metadata:
     name: pvc-550e8400-e29b-41d4-a716-446655440000
   spec:
     capacity:
       storage: 10Gi
     accessModes: [ReadWriteOnce]
     persistentVolumeReclaimPolicy: Delete
     storageClassName: lightningstor-ssd
     csi:
       driver: stor.plasmacloud.io
       volumeHandle: vol-abc123
       fsType: ext4

5. K8s binds PVC to PV

Volume Attachment (when pod is scheduled):

1. kube-controller-manager creates VolumeAttachment:
   apiVersion: storage.k8s.io/v1
   kind: VolumeAttachment
   metadata:
     name: csi-<hash>
   spec:
     attacher: stor.plasmacloud.io
     nodeName: worker-1
     source:
       persistentVolumeName: pvc-550e8400-e29b-41d4-a716-446655440000

2. CSI Controller watches VolumeAttachment, calls ControllerPublishVolume:
   ControllerPublishVolumeRequest {
     volume_id: "vol-abc123"
     node_id: "worker-1"
     volume_capability: { access_mode: SINGLE_NODE_WRITER }
   }

3. CSI Controller calls LightningStor gRPC AttachVolume:
   LightningStor attaches volume to node (e.g., iSCSI target, NBD)

4. CSI Controller updates VolumeAttachment status: attached=true

Volume Mounting (on node):

1. kubelet calls CSI Node plugin: NodeStageVolume
   NodeStageVolumeRequest {
     volume_id: "vol-abc123"
     staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
     volume_capability: { mount: { fs_type: "ext4" } }
   }

2. CSI Node plugin:
   - Discovers block device (e.g., /dev/nbd0) via LightningStor
   - Formats if needed: mkfs.ext4 /dev/nbd0
   - Mounts to staging path: mount /dev/nbd0 <staging_target_path>

3. kubelet calls CSI Node plugin: NodePublishVolume
   NodePublishVolumeRequest {
     volume_id: "vol-abc123"
     staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
     target_path: "/var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/pvc-<hash>/mount"
   }

4. CSI Node plugin:
   - Bind mount staging path to target path
   - Pod can now read/write to volume

LightningStor gRPC API Integration:

service VolumeService {
  rpc CreateVolume(CreateVolumeRequest) returns (Volume);
  rpc DeleteVolume(DeleteVolumeRequest) returns (Empty);
  rpc AttachVolume(AttachVolumeRequest) returns (VolumeAttachment);
  rpc DetachVolume(DetachVolumeRequest) returns (Empty);
  rpc GetVolume(GetVolumeRequest) returns (Volume);
  rpc ListVolumes(ListVolumesRequest) returns (ListVolumesResponse);
}

message CreateVolumeRequest {
  string name = 1;
  int64 size_bytes = 2;
  string volume_type = 3;  // "ssd", "hdd"
  int32 replication_factor = 4;
  string org_id = 5;
  string project_id = 6;
}

message Volume {
  string id = 1;
  string name = 2;
  int64 size_bytes = 3;
  string status = 4;  // "available", "in-use", "error"
  string volume_type = 5;
}

message AttachVolumeRequest {
  string volume_id = 1;
  string node_id = 2;
  string attach_mode = 3;  // "read-write", "read-only"
}

message VolumeAttachment {
  string id = 1;
  string volume_id = 2;
  string node_id = 3;
  string device_path = 4;  // e.g., "/dev/nbd0"
  string connection_info = 5;  // JSON with iSCSI target, NBD socket, etc.
}

StorageClass Examples:

# SSD storage with 3x replication
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: lightningstor-ssd
provisioner: stor.plasmacloud.io
parameters:
  type: "ssd"
  replication: "3"
volumeBindingMode: WaitForFirstConsumer  # Topology-aware scheduling
allowVolumeExpansion: true
reclaimPolicy: Delete

---
# HDD storage with 2x replication
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: lightningstor-hdd
provisioner: stor.plasmacloud.io
parameters:
  type: "hdd"
  replication: "2"
volumeBindingMode: Immediate
allowVolumeExpansion: true
reclaimPolicy: Retain  # Keep volume after PVC deletion

Access Modes:

  • ReadWriteOnce (RWO): Single node read-write (most common)
  • ReadOnlyMany (ROX): Multiple nodes read-only
  • ReadWriteMany (RWX): Multiple nodes read-write (requires shared filesystem like NFS, Phase 2)

Volume Expansion (if allowVolumeExpansion: true):

1. User edits PVC: spec.resources.requests.storage: 20Gi (was 10Gi)
2. CSI Controller calls ControllerExpandVolume
3. LightningStor expands volume backend
4. CSI Node plugin calls NodeExpandVolume
5. Filesystem resize: resize2fs /dev/nbd0

6. PlasmaVMC Integration

Phase 1 (MVP): Use containerd as default CRI

  • k3s ships with containerd embedded
  • Standard OCI container runtime
  • No changes needed for Phase 1

Phase 3 (Future): Custom CRI for VM-backed pods

Motivation:

  • Enhanced Isolation: Stronger security boundary than containers
  • Multi-Tenant Security: Prevent container escape attacks
  • Consistent Runtime: Unify VM and container workloads on PlasmaVMC

Architecture:

  • PlasmaVMC implements CRI (Container Runtime Interface)
  • Each pod runs as a lightweight VM (Firecracker microVM)
  • Pod containers run inside VM (still using containerd within VM)
  • kubelet communicates with PlasmaVMC CRI endpoint instead of containerd

CRI Interface Implementation:

RuntimeService:

  • RunPodSandbox: Create Firecracker microVM for pod
  • StopPodSandbox: Stop microVM
  • RemovePodSandbox: Delete microVM
  • PodSandboxStatus: Query microVM status
  • ListPodSandbox: List all pod microVMs
  • CreateContainer: Create container inside microVM
  • StartContainer, StopContainer, RemoveContainer: Container lifecycle
  • ExecSync, Exec: Execute commands in container
  • Attach: Attach to container stdio

ImageService:

  • PullImage: Download container image (delegate to internal containerd)
  • RemoveImage: Delete image
  • ListImages: List cached images
  • ImageStatus: Query image metadata

Implementation Strategy:

┌─────────────────────────────────────────┐
│           kubelet (k3s agent)           │
└─────────────┬───────────────────────────┘
              │ CRI gRPC
              ▼
┌─────────────────────────────────────────┐
│      PlasmaVMC CRI Server (Rust)        │
│  - RunPodSandbox → Create microVM       │
│  - CreateContainer → Run in VM          │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│      Firecracker VMM (per pod)          │
│  ┌───────────────────────────────────┐  │
│  │  Pod VM (minimal Linux kernel)    │  │
│  │  ┌──────────────────────────────┐ │  │
│  │  │ containerd (in-VM)           │ │  │
│  │  │  - Container 1               │ │  │
│  │  │  - Container 2               │ │  │
│  │  └──────────────────────────────┘ │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Configuration (Phase 3):

services.k8shost = {
  enable = true;
  cri = "plasmavmc";  # Instead of "containerd"
  plasmavmc = {
    endpoint = "unix:///var/run/plasmavmc/cri.sock";
    vmKernel = "/var/lib/plasmavmc/vmlinux.bin";
    vmRootfs = "/var/lib/plasmavmc/rootfs.ext4";
  };
};

Benefits:

  • Stronger isolation for untrusted workloads
  • Leverage existing PlasmaVMC infrastructure
  • Consistent management across VM and K8s workloads

Challenges:

  • Performance overhead (microVM startup time, memory overhead)
  • Image caching complexity (need containerd inside VM)
  • Networking integration (CNI must configure VM network)

Decision: Defer to Phase 3, focus on standard containerd for MVP.

Multi-Tenant Model

Namespace Strategy

Principle: One K8s namespace per PlasmaCloud project.

Namespace Naming:

  • Project namespaces: project-<project_id> (e.g., project-550e8400-e29b-41d4-a716-446655440000)
  • Org shared namespaces (optional): org-<org_id>-shared (for shared resources like monitoring)
  • System namespaces: kube-system, kube-public, kube-node-lease, default

Namespace Lifecycle:

  • Created automatically when project provisions K8s cluster
  • Labeled with org_id, project_id for RBAC and billing
  • Deleted when project is deleted (with grace period)

Namespace Metadata:

apiVersion: v1
kind: Namespace
metadata:
  name: project-550e8400-e29b-41d4-a716-446655440000
  labels:
    plasmacloud.io/org-id: "org-123"
    plasmacloud.io/project-id: "proj-456"
    plasmacloud.io/tenant-type: "project"
  annotations:
    plasmacloud.io/project-name: "my-web-app"
    plasmacloud.io/created-by: "user@example.com"

RBAC Templates

Org Admin Role (full access to all project namespaces):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: org-admin
  namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: org-admin-binding
  namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
  name: org:org-123
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: org-admin
  apiGroup: rbac.authorization.k8s.io

Project Admin Role (full access to specific project namespace):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: project-admin
  namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "storage.k8s.io"]
  resources: ["*"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: project-admin-binding
  namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
  name: project:proj-456
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: project-admin
  apiGroup: rbac.authorization.k8s.io

Project Viewer Role (read-only access):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: project-viewer
  namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io"]
  resources: ["pods", "services", "deployments", "replicasets", "configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: project-viewer-binding
  namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
  name: project:proj-456:viewer
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: project-viewer
  apiGroup: rbac.authorization.k8s.io

ClusterRole for Node Access (for cluster admins):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: plasmacloud-cluster-admin
rules:
- apiGroups: [""]
  resources: ["nodes", "persistentvolumes"]
  verbs: ["*"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: plasmacloud-cluster-admin-binding
subjects:
- kind: Group
  name: system:plasmacloud-admins
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: plasmacloud-cluster-admin
  apiGroup: rbac.authorization.k8s.io

Network Isolation

Default NetworkPolicy (deny all, except DNS):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53  # DNS

Allow Ingress from LoadBalancer:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-loadbalancer
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0  # Allow from anywhere (LoadBalancer external traffic)
    ports:
    - protocol: TCP
      port: 8080

Allow Inter-Namespace Communication (optional, for org-shared services):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-org-shared
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          plasmacloud.io/org-id: "org-123"
          plasmacloud.io/tenant-type: "org-shared"

PrismNET Enforcement:

  • NetworkPolicies are translated to OVN ACLs by PrismNET CNI controller
  • Enforced at OVN logical switch level (low-level packet filtering)

Resource Quotas

CPU and Memory Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: project-compute-quota
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  hard:
    requests.cpu: "10"       # 10 CPU cores
    requests.memory: "20Gi"  # 20 GB RAM
    limits.cpu: "20"         # Allow bursting to 20 cores
    limits.memory: "40Gi"    # Allow bursting to 40 GB RAM

Storage Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: project-storage-quota
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  hard:
    persistentvolumeclaims: "10"  # Max 10 PVCs
    requests.storage: "100Gi"     # Total storage requests

Object Count Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: project-object-quota
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  hard:
    pods: "50"
    services: "20"
    services.loadbalancers: "5"   # Max 5 LoadBalancer services (limit external IPs)
    configmaps: "50"
    secrets: "50"

Quota Enforcement:

  • K8s admission controller rejects resource creation exceeding quota
  • User receives clear error message
  • Quota usage visible in kubectl describe quota

Deployment Model

Single-Server (Development/Small)

Target Use Case:

  • Development and testing environments
  • Small production workloads (<10 nodes)
  • Cost-sensitive deployments

Architecture:

  • Single k3s server node with embedded SQLite datastore
  • Control plane and worker colocated
  • No HA guarantees

k3s Server Command:

k3s server \
  --data-dir=/var/lib/k8shost \
  --disable=servicelb,traefik,flannel \
  --flannel-backend=none \
  --disable-network-policy \
  --cluster-domain=cluster.local \
  --service-cidr=10.96.0.0/12 \
  --cluster-cidr=10.244.0.0/16 \
  --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
  --bind-address=0.0.0.0 \
  --advertise-address=192.168.1.100 \
  --tls-san=k8s-api.example.com

NixOS Configuration:

{ config, lib, pkgs, ... }:

{
  services.k8shost = {
    enable = true;
    mode = "server";
    datastore = "sqlite";  # Embedded SQLite
    disableComponents = ["servicelb" "traefik" "flannel"];

    networking = {
      serviceCIDR = "10.96.0.0/12";
      clusterCIDR = "10.244.0.0/16";
      clusterDomain = "cluster.local";
    };

    prismnet = {
      enable = true;
      endpoint = "prismnet-server:5000";
      ovnNorthbound = "tcp:prismnet-server:6641";
      ovnSouthbound = "tcp:prismnet-server:6642";
    };

    fiberlb = {
      enable = true;
      endpoint = "fiberlb-server:7000";
      externalIpPool = "192.168.100.0/24";
    };

    iam = {
      enable = true;
      webhookEndpoint = "https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate";
      caCertFile = "/etc/k8shost/ca.crt";
      clientCertFile = "/etc/k8shost/client.crt";
      clientKeyFile = "/etc/k8shost/client.key";
    };

    flashdns = {
      enable = true;
      endpoint = "flashdns-server:6000";
      clusterDomain = "cluster.local";
      recordTTL = 30;
    };

    lightningstor = {
      enable = true;
      endpoint = "lightningstor-server:8000";
      csiNodeDaemonSet = true;  # Deploy CSI node plugin as DaemonSet
    };
  };

  # Open firewall for K8s API
  networking.firewall.allowedTCPPorts = [ 6443 ];
}

Limitations:

  • No HA (single point of failure)
  • SQLite has limited concurrency
  • Control plane downtime affects entire cluster

HA Cluster (Production)

Target Use Case:

  • Production workloads requiring high availability
  • Large clusters (>10 nodes)
  • Mission-critical applications

Architecture:

  • 3 or 5 k3s server nodes (odd number for quorum)
  • Embedded etcd (Raft consensus, HA datastore)
  • Load balancer in front of API servers
  • Agent nodes for workload scheduling

k3s Server Command (each server node):

k3s server \
  --data-dir=/var/lib/k8shost \
  --disable=servicelb,traefik,flannel \
  --flannel-backend=none \
  --disable-network-policy \
  --cluster-domain=cluster.local \
  --service-cidr=10.96.0.0/12 \
  --cluster-cidr=10.244.0.0/16 \
  --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
  --cluster-init \  # First server only
  --server https://k8s-api-lb.internal:6443 \  # Join existing cluster (not for first server)
  --tls-san=k8s-api-lb.example.com \
  --tls-san=k8s-api.example.com

k3s Agent Command (worker nodes):

k3s agent \
  --server https://k8s-api-lb.internal:6443 \
  --token <join-token>

NixOS Configuration (Server Node):

{ config, lib, pkgs, ... }:

{
  services.k8shost = {
    enable = true;
    mode = "server";
    datastore = "etcd";  # Embedded etcd for HA
    clusterInit = true;  # Set to false for joining servers
    serverUrl = "https://k8s-api-lb.internal:6443";  # For joining servers

    # ... same integrations as single-server ...
  };

  # High availability settings
  systemd.services.k8shost = {
    serviceConfig = {
      Restart = "always";
      RestartSec = "10s";
    };
  };
}

Load Balancer Configuration (FiberLB):

# External LoadBalancer for API access
apiVersion: v1
kind: LoadBalancer
metadata:
  name: k8s-api-lb
spec:
  listeners:
  - protocol: TCP
    port: 6443
    backend_pool: k8s-api-servers
  pools:
  - name: k8s-api-servers
    algorithm: round-robin
    members:
    - address: 192.168.1.101  # server-1
      port: 6443
    - address: 192.168.1.102  # server-2
      port: 6443
    - address: 192.168.1.103  # server-3
      port: 6443
    health_check:
      type: tcp
      interval: 10s
      timeout: 5s
      retries: 3

Datastore Options:

Pros:

  • Built-in to k3s, no external dependencies
  • Proven, battle-tested (CNCF etcd project)
  • Automatic HA with Raft consensus
  • Easy setup (just --cluster-init)

Cons:

  • Another distributed datastore (in addition to Chainfire/FlareDB)
  • etcd-specific operations (backup, restore, defragmentation)

Option 2: FlareDB as External Datastore

Pros:

  • Unified storage layer for PlasmaCloud
  • Leverage existing FlareDB deployment
  • Simplified infrastructure (one less system to manage)

Cons:

  • k3s requires etcd API compatibility
  • FlareDB would need to implement etcd v3 API (significant effort)
  • Untested for K8s workloads

Recommendation for MVP: Use embedded etcd for HA mode. Investigate FlareDB etcd compatibility in Phase 2 or 3.

Backup and Disaster Recovery:

# etcd snapshot (on any server node)
k3s etcd-snapshot save --name backup-$(date +%Y%m%d-%H%M%S)

# List snapshots
k3s etcd-snapshot ls

# Restore from snapshot
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/k8shost/server/db/snapshots/backup-20250101-120000

NixOS Module Integration

Module Structure:

nix/modules/
├── k8shost.nix              # Main module
├── k8shost/
│   ├── controller.nix       # FiberLB, FlashDNS controllers
│   ├── csi.nix              # LightningStor CSI driver
│   └── cni.nix              # PrismNET CNI plugin

Main Module (nix/modules/k8shost.nix):

{ config, lib, pkgs, ... }:

with lib;

let
  cfg = config.services.k8shost;
in
{
  options.services.k8shost = {
    enable = mkEnableOption "PlasmaCloud K8s Hosting Service";

    mode = mkOption {
      type = types.enum ["server" "agent"];
      default = "server";
      description = "Run as server (control plane) or agent (worker)";
    };

    datastore = mkOption {
      type = types.enum ["sqlite" "etcd"];
      default = "sqlite";
      description = "Datastore backend (sqlite for single-server, etcd for HA)";
    };

    disableComponents = mkOption {
      type = types.listOf types.str;
      default = ["servicelb" "traefik" "flannel"];
      description = "k3s components to disable";
    };

    networking = {
      serviceCIDR = mkOption {
        type = types.str;
        default = "10.96.0.0/12";
        description = "CIDR for service ClusterIPs";
      };

      clusterCIDR = mkOption {
        type = types.str;
        default = "10.244.0.0/16";
        description = "CIDR for pod IPs";
      };

      clusterDomain = mkOption {
        type = types.str;
        default = "cluster.local";
        description = "Cluster DNS domain";
      };
    };

    # Integration options (prismnet, fiberlb, iam, flashdns, lightningstor)
    # ...
  };

  config = mkIf cfg.enable {
    # Install k3s package
    environment.systemPackages = [ pkgs.k3s ];

    # Create systemd service
    systemd.services.k8shost = {
      description = "PlasmaCloud K8s Hosting Service (k3s)";
      after = [ "network.target" "iam.service" "prismnet.service" ];
      requires = [ "iam.service" "prismnet.service" ];
      wantedBy = [ "multi-user.target" ];

      serviceConfig = {
        Type = "notify";
        ExecStart = "${pkgs.k3s}/bin/k3s ${cfg.mode} ${concatStringsSep " " (buildServerArgs cfg)}";
        KillMode = "process";
        Delegate = "yes";
        LimitNOFILE = 1048576;
        LimitNPROC = "infinity";
        LimitCORE = "infinity";
        TasksMax = "infinity";
        Restart = "always";
        RestartSec = "5s";
      };
    };

    # Create configuration files
    environment.etc."k8shost/iam-webhook.yaml" = {
      text = generateIAMWebhookConfig cfg.iam;
      mode = "0600";
    };

    # Deploy controllers (FiberLB, FlashDNS, etc.)
    # ... (as separate systemd services or in-cluster deployments)
  };
}

API Server Configuration

k3s Server Flags (Complete)

k3s server \
  # Data and cluster configuration
  --data-dir=/var/lib/k8shost \
  --cluster-init \  # For first server in HA cluster
  --server https://k8s-api-lb.internal:6443 \  # Join existing HA cluster
  --token <cluster-token> \  # Secure join token

  # Disable default components
  --disable=servicelb,traefik,flannel,local-storage \
  --flannel-backend=none \
  --disable-network-policy \

  # Network configuration
  --cluster-domain=cluster.local \
  --service-cidr=10.96.0.0/12 \
  --cluster-cidr=10.244.0.0/16 \
  --service-node-port-range=30000-32767 \

  # API server configuration
  --bind-address=0.0.0.0 \
  --advertise-address=192.168.1.100 \
  --tls-san=k8s-api.example.com \
  --tls-san=k8s-api-lb.example.com \

  # Authentication
  --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
  --authentication-token-webhook-cache-ttl=2m \

  # Authorization (RBAC enabled by default)
  # --authorization-mode=Node,RBAC \  # Default, no need to specify

  # Audit logging
  --kube-apiserver-arg=audit-log-path=/var/log/k8shost/audit.log \
  --kube-apiserver-arg=audit-log-maxage=30 \
  --kube-apiserver-arg=audit-log-maxbackup=10 \
  --kube-apiserver-arg=audit-log-maxsize=100 \

  # Feature gates (if needed)
  # --kube-apiserver-arg=feature-gates=SomeFeature=true

Authentication Webhook Configuration

File: /etc/k8shost/iam-webhook.yaml

apiVersion: v1
kind: Config
clusters:
- name: iam-webhook
  cluster:
    server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
    certificate-authority: /etc/k8shost/ca.crt
users:
- name: k8s-apiserver
  user:
    client-certificate: /etc/k8shost/apiserver-client.crt
    client-key: /etc/k8shost/apiserver-client.key
current-context: webhook
contexts:
- context:
    cluster: iam-webhook
    user: k8s-apiserver
  name: webhook

Certificate Management:

  • CA certificate: Issued by PlasmaCloud IAM PKI
  • Client certificate: For kube-apiserver to authenticate to IAM webhook
  • Rotation: Certificates expire after 1 year, auto-renewed by IAM

Security

TLS/mTLS

Component Communication:

Source Destination Protocol Auth Method
kube-apiserver IAM webhook HTTPS + mTLS Client cert
FiberLB controller FiberLB gRPC gRPC + TLS IAM token
FlashDNS controller FlashDNS gRPC gRPC + TLS IAM token
LightningStor CSI LightningStor gRPC gRPC + TLS IAM token
PrismNET CNI PrismNET gRPC gRPC + TLS IAM token
kubectl kube-apiserver HTTPS IAM token (Bearer)

Certificate Issuance:

  • All certificates issued by IAM service (centralized PKI)
  • Automatic renewal before expiration
  • Certificate revocation via IAM CRL

Pod Security

Pod Security Standards (PSS):

  • Baseline Profile: Enforced on all namespaces by default
    • Deny privileged containers
    • Deny host network/PID/IPC
    • Deny hostPath volumes
    • Deny privilege escalation
  • Restricted Profile: Optional, for highly sensitive workloads

Example PodSecurityPolicy (deprecated in K8s 1.25, use PSS):

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny

Security Contexts (enforced):

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL

Service Account Permissions:

  • Minimal RBAC permissions by default
  • Principle of least privilege
  • No cluster-admin access for user workloads

Testing Strategy

Unit Tests

Controllers (Go):

// fiberlb_controller_test.go
func TestReconcileLoadBalancer(t *testing.T) {
    // Mock K8s client
    client := fake.NewSimpleClientset()

    // Mock FiberLB gRPC client
    mockFiberLB := &mockFiberLBClient{}

    controller := NewFiberLBController(client, mockFiberLB)

    // Create test service
    svc := &corev1.Service{
        ObjectMeta: metav1.ObjectMeta{Name: "test-svc", Namespace: "default"},
        Spec: corev1.ServiceSpec{Type: corev1.ServiceTypeLoadBalancer},
    }

    // Reconcile
    err := controller.Reconcile(svc)
    assert.NoError(t, err)

    // Verify FiberLB API called
    assert.Equal(t, 1, mockFiberLB.createLoadBalancerCalls)
}

CNI Plugin (Rust):

#[test]
fn test_cni_add() {
    let mut mock_ovn = MockOVNClient::new();
    mock_ovn.expect_allocate_ip()
        .returning(|ns, pod| Ok("10.244.1.5/24".to_string()));

    let plugin = PrismNETPlugin::new(mock_ovn);
    let result = plugin.handle_add(/* ... */);

    assert!(result.is_ok());
    assert_eq!(result.unwrap().ip, "10.244.1.5");
}

CSI Driver (Go):

func TestCreateVolume(t *testing.T) {
    mockLightningStor := &mockLightningStorClient{}
    mockLightningStor.On("CreateVolume", mock.Anything).Return(&Volume{ID: "vol-123"}, nil)

    driver := NewCSIDriver(mockLightningStor)

    req := &csi.CreateVolumeRequest{
        Name: "test-vol",
        CapacityRange: &csi.CapacityRange{RequiredBytes: 10 * 1024 * 1024 * 1024},
    }

    resp, err := driver.CreateVolume(context.Background(), req)
    assert.NoError(t, err)
    assert.Equal(t, "vol-123", resp.Volume.VolumeId)
}

Integration Tests

Test Environment:

  • Single-node k3s cluster (kind or k3s in Docker)
  • Mock or real PlasmaCloud services (PrismNET, FiberLB, etc.)
  • Automated setup and teardown

Test Cases:

1. Single-Pod Deployment:

#!/bin/bash
set -e

# Deploy nginx pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:latest
    ports:
    - containerPort: 80
EOF

# Wait for pod to be running
kubectl wait --for=condition=Ready pod/nginx --timeout=60s

# Verify pod IP allocated
POD_IP=$(kubectl get pod nginx -o jsonpath='{.status.podIP}')
[ -n "$POD_IP" ] || exit 1

# Cleanup
kubectl delete pod nginx

2. Service Exposure (LoadBalancer):

#!/bin/bash
set -e

# Create deployment
kubectl create deployment web --image=nginx:latest --replicas=2

# Expose as LoadBalancer
kubectl expose deployment web --type=LoadBalancer --port=80

# Wait for external IP
for i in {1..30}; do
  EXTERNAL_IP=$(kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
  [ -n "$EXTERNAL_IP" ] && break
  sleep 2
done

[ -n "$EXTERNAL_IP" ] || exit 1

# Verify HTTP access
curl -f http://$EXTERNAL_IP || exit 1

# Cleanup
kubectl delete svc web
kubectl delete deployment web

3. PersistentVolume Provisioning:

#!/bin/bash
set -e

# Create PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
  storageClassName: lightningstor-ssd
EOF

# Wait for PVC to be bound
kubectl wait --for=condition=Bound pvc/test-pvc --timeout=60s

# Create pod using PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: app
    image: busybox
    command: ["sh", "-c", "echo hello > /data/test.txt && sleep 3600"]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-pvc
EOF

kubectl wait --for=condition=Ready pod/test-pod --timeout=60s

# Verify file written
kubectl exec test-pod -- cat /data/test.txt | grep hello || exit 1

# Cleanup
kubectl delete pod test-pod
kubectl delete pvc test-pvc

4. Multi-Tenant Isolation:

#!/bin/bash
set -e

# Create two namespaces
kubectl create namespace project-a
kubectl create namespace project-b

# Deploy pod in each
kubectl run pod-a --image=nginx -n project-a
kubectl run pod-b --image=nginx -n project-b

# Verify network isolation (if NetworkPolicies enabled)
# Pod A should NOT be able to reach Pod B
POD_B_IP=$(kubectl get pod pod-b -n project-b -o jsonpath='{.status.podIP}')
kubectl exec pod-a -n project-a -- curl --max-time 5 http://$POD_B_IP && exit 1 || true

# Cleanup
kubectl delete ns project-a project-b

E2E Test Scenario

End-to-End Test: Deploy Multi-Tier Application

#!/bin/bash
set -ex

NAMESPACE="project-123"

# 1. Create namespace
kubectl create namespace $NAMESPACE

# 2. Deploy PostgreSQL with PVC
kubectl apply -n $NAMESPACE -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 5Gi
  storageClassName: lightningstor-ssd
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        env:
        - name: POSTGRES_PASSWORD
          value: testpass
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
EOF

# 3. Deploy web application
kubectl apply -n $NAMESPACE -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:latest
        env:
        - name: DATABASE_URL
          value: postgres://postgres:testpass@postgres:5432/mydb
        ports:
        - containerPort: 8080
EOF

# 4. Expose web via LoadBalancer
kubectl expose deployment web -n $NAMESPACE --type=LoadBalancer --port=80 --target-port=8080

# 5. Wait for resources
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=postgres --timeout=120s
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=web --timeout=120s

# 6. Verify LoadBalancer external IP
for i in {1..60}; do
  EXTERNAL_IP=$(kubectl get svc web -n $NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
  [ -n "$EXTERNAL_IP" ] && break
  sleep 2
done
[ -n "$EXTERNAL_IP" ] || { echo "No external IP assigned"; exit 1; }

# 7. Verify DNS resolution
kubectl run -n $NAMESPACE --rm -it --restart=Never test-dns --image=busybox -- nslookup postgres.${NAMESPACE}.svc.cluster.local

# 8. Verify HTTP access
curl -f http://$EXTERNAL_IP/health || { echo "Health check failed"; exit 1; }

# 9. Verify PVC mounted
kubectl exec -n $NAMESPACE deployment/postgres -- ls /var/lib/postgresql/data | grep pg_wal

# 10. Verify network isolation (optional, if NetworkPolicies enabled)
# ...

# Cleanup
kubectl delete namespace $NAMESPACE

echo "E2E test passed!"

Implementation Phases

Phase 1: Foundation (4-5 weeks)

Week 1-2: k3s Setup and IAM Integration

  • Install and configure k3s with disabled components
  • Implement IAM authentication webhook server
  • Configure kube-apiserver to use IAM webhook
  • Create RBAC templates (org admin, project admin, viewer)
  • Test: Authenticate with IAM token, verify RBAC enforcement

Week 3: PrismNET CNI Plugin

  • Implement CNI binary (ADD, DEL, CHECK commands)
  • Integrate with PrismNET gRPC API (AllocateIP, ReleaseIP)
  • Configure OVN logical switches per namespace
  • Test: Create pod, verify network interface and IP allocation

Week 4: FiberLB Controller

  • Implement controller watch loop (Services, Endpoints)
  • Integrate with FiberLB gRPC API (CreateLoadBalancer, UpdatePool)
  • Implement external IP allocation from pool
  • Test: Expose service as LoadBalancer, verify external IP and routing

Week 5: Basic RBAC and Multi-Tenancy

  • Implement namespace-per-project provisioning
  • Deploy default RBAC roles and bindings
  • Test: Create multiple projects, verify isolation

Deliverables:

  • Functional k3s cluster with IAM authentication
  • Pod networking via PrismNET
  • LoadBalancer services via FiberLB
  • Multi-tenant namespaces with RBAC

Phase 2: Storage & DNS (5-6 weeks)

Week 6-7: LightningStor CSI Driver

  • Implement CSI Controller Service (CreateVolume, DeleteVolume, ControllerPublishVolume)
  • Implement CSI Node Service (NodeStageVolume, NodePublishVolume)
  • Integrate with LightningStor gRPC API
  • Deploy CSI driver as pods (controller + node DaemonSet)
  • Create StorageClasses for SSD and HDD
  • Test: Create PVC, attach to pod, write/read data

Week 8: FlashDNS Controller

  • Implement controller watch loop (Services, Pods)
  • Integrate with FlashDNS gRPC API (CreateRecord, UpdateRecord)
  • Generate DNS records (A, SRV) for services and pods
  • Configure kubelet DNS settings
  • Test: Resolve service DNS from pod, verify DNS updates

Week 9: Network Policy Support

  • Extend PrismNET CNI with NetworkPolicy controller
  • Translate K8s NetworkPolicy to OVN ACLs
  • Implement address sets for pod label selectors
  • Test: Create NetworkPolicy, verify ingress/egress enforcement

Week 10-11: Integration Testing

  • Write integration test suite (pod, service, PVC, DNS)
  • Test multi-tier application deployment
  • Performance testing (pod creation time, network throughput)
  • Fix bugs and optimize

Deliverables:

  • Persistent storage via LightningStor CSI
  • Service discovery via FlashDNS
  • Network policies enforced by PrismNET
  • Comprehensive integration tests

Phase 3: Advanced Features (Post-MVP, 6-8 weeks)

StatefulSets:

  • Verify StatefulSet controller functionality (built-in to k3s)
  • Test with headless services and volumeClaimTemplates
  • Example: Deploy Cassandra or Kafka cluster

PlasmaVMC CRI Integration:

  • Implement CRI server in PlasmaVMC (Rust)
  • Create Firecracker microVM per pod
  • Test pod lifecycle (create, start, stop, delete)
  • Performance benchmarking (startup time, resource overhead)

FlareDB as k3s Datastore:

  • Investigate etcd API compatibility layer for FlareDB
  • Implement etcd v3 gRPC API shim
  • Test k3s with FlareDB backend
  • Benchmarking and stability testing

Autoscaling:

  • Deploy metrics-server
  • Implement HorizontalPodAutoscaler
  • Test autoscaling based on CPU/memory metrics

Ingress (L7 LoadBalancer):

  • Implement Ingress controller using FiberLB L7 capabilities
  • Support host-based and path-based routing
  • TLS termination

Success Criteria

Functional Requirements:

  1. Deploy pods, services, deployments using kubectl
  2. LoadBalancer services receive external IPs from FiberLB
  3. PersistentVolumes provisioned from LightningStor and mounted to pods
  4. DNS resolution works for services and pods (via FlashDNS)
  5. Multi-tenant isolation enforced (namespaces, RBAC, network policies)
  6. IAM authentication and RBAC functional (token validation, user/group mapping)
  7. E2E test passes (multi-tier application deployment)

Performance Requirements:

  1. Pod creation time: <10 seconds (from API call to running state)
  2. Service LoadBalancer IP allocation: <5 seconds
  3. PersistentVolume provisioning: <30 seconds
  4. DNS record updates: <10 seconds (after service creation)
  5. Support 100+ pods per cluster
  6. Support 10+ concurrent namespaces

Operational Requirements:

  1. NixOS module for declarative deployment
  2. Cluster upgrade path (k3s version upgrades)
  3. Backup and restore procedures (etcd snapshots)
  4. Monitoring and alerting integration (Prometheus, Grafana)
  5. Logging aggregation (FluentBit → centralized log storage)

Next Steps (S3-S6)

S3: Workspace Scaffold

  • Create k8shost/ workspace directory structure
  • Set up Go module for controllers (FiberLB, FlashDNS)
  • Set up Rust workspace for CNI plugin
  • Set up Go module for CSI driver
  • Create NixOS module skeleton

Directory Structure:

k8shost/
├── controllers/          # Go: FiberLB, FlashDNS, IAM webhook
│   ├── fiberlb/
│   ├── flashdns/
│   ├── iamwebhook/
│   └── main.go
├── cni/                  # Rust: PrismNET CNI plugin
│   ├── src/
│   └── Cargo.toml
├── csi/                  # Go: LightningStor CSI driver
│   ├── controller/
│   ├── node/
│   └── main.go
├── nix/
│   └── modules/
│       └── k8shost.nix
└── tests/
    ├── integration/
    └── e2e/

S4: Controllers Implementation

  • Implement FiberLB controller (Service watch, gRPC integration)
  • Implement FlashDNS controller (Service/Pod watch, DNS record sync)
  • Implement IAM webhook server (TokenReview API, IAM validation)
  • Unit tests for each controller

S5: CNI + CSI Implementation

  • Implement PrismNET CNI plugin (ADD/DEL/CHECK, OVN integration)
  • Implement LightningStor CSI driver (Controller and Node services)
  • Deploy CSI driver as pods (Deployment + DaemonSet)
  • Unit tests for CNI and CSI

S6: Integration Testing

  • Set up integration test environment (k3s cluster + mock services)
  • Write integration tests (pod, service, PVC, DNS, multi-tenant)
  • Write E2E test (multi-tier application)
  • CI/CD pipeline for automated testing

References


Document Version: 1.0 Last Updated: 2025-12-09 Authors: PlasmaCloud Platform Team Status: Draft for Review