# K8s Hosting Specification ## Overview PlasmaCloud's K8s Hosting service provides managed Kubernetes clusters for multi-tenant container orchestration. This specification defines a k3s-based architecture that integrates deeply with existing PlasmaCloud infrastructure components: PrismNET for networking, FiberLB for load balancing, IAM for authentication/authorization, FlashDNS for service discovery, and LightningStor for persistent storage. ### Purpose Enable customers to deploy and manage containerized workloads using standard Kubernetes APIs while benefiting from PlasmaCloud's integrated infrastructure services. The system provides: - **Standard K8s API compatibility**: Use kubectl, Helm, and existing K8s tooling - **Multi-tenant isolation**: Project-based namespaces with IAM-backed RBAC - **Deep integration**: Leverage PrismNET SDN, FiberLB load balancing, LightningStor block storage - **Production-ready**: HA control plane, automated failover, comprehensive monitoring ### Scope **Phase 1 (MVP, 3-4 months):** - Core K8s APIs (Pods, Services, Deployments, ReplicaSets, Namespaces, ConfigMaps, Secrets) - LoadBalancer services via FiberLB - Persistent storage via LightningStor CSI - IAM authentication and RBAC - PrismNET CNI for pod networking - FlashDNS service discovery **Future Phases:** - PlasmaVMC integration for VM-backed pods (enhanced isolation) - StatefulSets, DaemonSets, Jobs/CronJobs - Network policies with PrismNET enforcement - Horizontal Pod Autoscaler - FlareDB as k3s datastore ### Architecture Decision Summary **Base Technology: k3s** - Lightweight K8s distribution (single binary, minimal dependencies) - Production-proven (CNCF certified, widely deployed) - Flexible architecture allowing component replacement - Embedded SQLite (single-server) or etcd (HA cluster) - 3-4 month timeline achievable **Component Replacement Strategy:** - **Disable**: servicelb (replaced by FiberLB), traefik (use FiberLB), flannel (replaced by PrismNET) - **Keep**: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, containerd - **Add**: Custom controllers for FiberLB, FlashDNS, IAM webhook, LightningStor CSI, PrismNET CNI ## Architecture ### Base: k3s with Selective Component Replacement **k3s Core (Keep):** - **kube-apiserver**: K8s REST API server with IAM webhook authentication - **kube-scheduler**: Pod scheduling with resource awareness - **kube-controller-manager**: Core controllers (replication, endpoints, service accounts, etc.) - **kubelet**: Node agent managing pod lifecycle via containerd CRI - **containerd**: Container runtime (Phase 1), later replaceable by PlasmaVMC CRI - **kube-proxy**: Service networking (iptables/ipvs mode) **k3s Components (Disable):** - **servicelb**: Default LoadBalancer implementation → Replaced by FiberLB controller - **traefik**: Ingress controller → Replaced by FiberLB L7 capabilities - **flannel**: CNI plugin → Replaced by PrismNET CNI - **local-path-provisioner**: Storage provisioner → Replaced by LightningStor CSI **PlasmaCloud Custom Components (Add):** - **PrismNET CNI Plugin**: Pod networking via OVN logical switches - **FiberLB Controller**: LoadBalancer service reconciliation - **IAM Webhook Server**: Token validation and user mapping - **FlashDNS Controller**: Service DNS record synchronization - **LightningStor CSI Driver**: PersistentVolume provisioning and attachment ### Component Topology ``` ┌─────────────────────────────────────────────────────────────┐ │ k3s Control Plane │ │ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │ │ │ kube-apiserver│◄─┤ IAM Webhook├──┤ IAM Service │ │ │ │ │ │ │ │ (Authentication) │ │ │ └──────┬───────┘ └────────────┘ └──────────────────┘ │ │ │ │ │ ┌──────▼───────┐ ┌──────────────┐ ┌────────────────┐ │ │ │kube-scheduler│ │kube-controller│ │ etcd/SQLite │ │ │ │ │ │ -manager │ │ (Datastore) │ │ │ └──────────────┘ └──────────────┘ └────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ┌───────▼───────┐ ┌───────▼───────┐ ┌──────▼──────┐ │ FiberLB │ │ FlashDNS │ │ LightningStor│ │ Controller │ │ Controller │ │ CSI Plugin │ │ (Watch Svcs) │ │ (Sync DNS) │ │ (Provision) │ └───────┬───────┘ └───────┬───────┘ └──────┬───────┘ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │ FiberLB │ │ FlashDNS │ │ LightningStor │ │ gRPC API │ │ gRPC API │ │ gRPC API │ └──────────────┘ └──────────────┘ └────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ k3s Worker Nodes │ │ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │ │ │ kubelet │◄─┤containerd ├──┤ Pods (containers)│ │ │ │ │ │ CRI │ │ │ │ │ └──────┬───────┘ └────────────┘ └──────────────────┘ │ │ │ │ │ ┌──────▼───────┐ ┌──────────────┐ │ │ │ PrismNET CNI │◄─┤ kube-proxy │ │ │ │ (Pod Network)│ │ (Service Net)│ │ │ └──────┬───────┘ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ PrismNET OVN │ │ │ │ (ovs-vswitchd)│ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ### Data Flow Examples **1. Pod Creation:** ``` kubectl create pod → kube-apiserver (IAM auth) → scheduler → kubelet → containerd ↓ PrismNET CNI ↓ OVN logical port ``` **2. LoadBalancer Service:** ``` kubectl expose → kube-apiserver → Service created → FiberLB controller watches ↓ FiberLB gRPC API ↓ External IP + L4 forwarding ``` **3. PersistentVolume:** ``` PVC created → kube-apiserver → CSI controller → LightningStor CSI driver ↓ LightningStor gRPC ↓ Volume created ↓ kubelet → CSI node plugin ↓ Mount to pod ``` ## K8s API Subset ### Phase 1: Core APIs (Essential) **Pods (v1):** - Full CRUD operations (create, get, list, update, delete, patch) - Watch API for real-time updates - Logs streaming (`kubectl logs -f`) - Exec into containers (`kubectl exec`) - Port forwarding (`kubectl port-forward`) - Status: Phase (Pending, Running, Succeeded, Failed), conditions, container states **Services (v1):** - **ClusterIP**: Internal cluster networking (default) - **LoadBalancer**: External access via FiberLB - **Headless**: StatefulSet support (clusterIP: None) - Service discovery via FlashDNS - Endpoint slices for large service backends **Deployments (apps/v1):** - Declarative desired state (replicas, pod template) - Rolling updates with configurable strategy (maxSurge, maxUnavailable) - Rollback to previous revision - Pause/resume for canary deployments - Scaling (manual in Phase 1) **ReplicaSets (apps/v1):** - Pod replication with label selectors - Owned by Deployments (rarely created directly) - Orphan/adopt pod ownership **Namespaces (v1):** - Tenant isolation (one namespace per project) - Resource quota enforcement - Network policy scope (Phase 2) - RBAC scope **ConfigMaps (v1):** - Non-sensitive configuration data - Mount as volumes or environment variables - Update triggers pod restarts (via annotation) **Secrets (v1):** - Sensitive data (passwords, tokens, certificates) - Base64 encoded in etcd (at-rest encryption in future phase) - Mount as volumes or environment variables - Service account tokens **Nodes (v1):** - Node registration via kubelet - Heartbeat and status reporting - Capacity and allocatable resources - Labels and taints for scheduling **Events (v1):** - Audit trail of cluster activities - Retention policy (1 hour in-memory, longer in etcd) - Debugging and troubleshooting ### Phase 2: Storage & Config (Required for MVP) **PersistentVolumes (v1):** - Volume lifecycle independent of pods - Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany (LightningStor support) - Reclaim policy: Retain, Delete - Status: Available, Bound, Released, Failed **PersistentVolumeClaims (v1):** - User request for storage - Binding to PVs by storage class, capacity, access mode - Volume expansion (if storage class allows) **StorageClasses (storage.k8s.io/v1):** - Dynamic provisioning via LightningStor CSI - Parameters: volume type (ssd, hdd), replication factor, org_id, project_id - Volume binding mode: Immediate or WaitForFirstConsumer ### Phase 3: Advanced (Post-MVP) **StatefulSets (apps/v1):** - Ordered pod creation/deletion - Stable network identities (pod-0, pod-1, ...) - Persistent storage per pod via volumeClaimTemplates - Use case: Databases, distributed systems **DaemonSets (apps/v1):** - One pod per node (e.g., log collectors, monitoring agents) - Node selector and tolerations **Jobs (batch/v1):** - Run-to-completion workloads - Parallelism and completions - Retry policy **CronJobs (batch/v1):** - Scheduled jobs (cron syntax) - Concurrency policy **NetworkPolicies (networking.k8s.io/v1):** - Ingress and egress rules - Label-based pod selection - Namespace selectors - Requires PrismNET CNI support for OVN ACL translation **Ingress (networking.k8s.io/v1):** - HTTP/HTTPS routing via FiberLB L7 - Host-based and path-based routing - TLS termination ### Deferred APIs (Not in MVP) - HorizontalPodAutoscaler (autoscaling/v2): Requires metrics-server - VerticalPodAutoscaler: Complex, low priority - PodDisruptionBudget: Useful for HA, but post-MVP - LimitRange: Resource limits per namespace (future) - ResourceQuota: Supported in Phase 1, but advanced features deferred - CustomResourceDefinitions (CRDs): Framework exists, but no custom resources in Phase 1 - APIService: Aggregation layer not needed initially ## Integration Specifications ### 1. PrismNET CNI Plugin **Purpose:** Provide pod networking using PrismNET's OVN-based SDN. **Interface:** CNI 1.0.0 specification (https://github.com/containernetworking/cni/blob/main/SPEC.md) **Components:** - **CNI binary**: `/opt/cni/bin/prismnet` - **Configuration**: `/etc/cni/net.d/10-prismnet.conflist` - **IPAM plugin**: `/opt/cni/bin/prismnet-ipam` (or integrated) **Responsibilities:** - Create network interface for pod (veth pair) - Allocate IP address from namespace-specific subnet - Connect pod to OVN logical switch - Configure routing for pod egress - Enforce network policies (Phase 2) **Configuration Schema:** ```json { "cniVersion": "1.0.0", "name": "prismnet", "type": "prismnet", "ipam": { "type": "prismnet-ipam", "subnet": "10.244.0.0/16", "rangeStart": "10.244.0.10", "rangeEnd": "10.244.255.254", "routes": [ {"dst": "0.0.0.0/0"} ], "gateway": "10.244.0.1" }, "ovn": { "northbound": "tcp:prismnet-server:6641", "southbound": "tcp:prismnet-server:6642", "encapType": "geneve" }, "mtu": 1400, "prismnetEndpoint": "prismnet-server:5000" } ``` **CNI Plugin Workflow:** 1. **ADD Command** (pod creation): ``` Input: Container ID, network namespace path, interface name Process: - Call PrismNET gRPC API: AllocateIP(namespace, pod_name) - Create veth pair: one end in pod netns, one in host - Add host veth to OVN logical switch port - Configure pod veth: IP address, routes, MTU - Return: IP config, routes, DNS settings ``` 2. **DEL Command** (pod deletion): ``` Input: Container ID, network namespace path Process: - Call PrismNET gRPC API: ReleaseIP(namespace, pod_name) - Delete OVN logical switch port - Delete veth pair ``` 3. **CHECK Command** (health check): ``` Verify interface exists and has expected configuration ``` **API Integration (PrismNET gRPC):** ```protobuf service NetworkService { rpc AllocateIP(AllocateIPRequest) returns (AllocateIPResponse); rpc ReleaseIP(ReleaseIPRequest) returns (ReleaseIPResponse); rpc CreateLogicalSwitch(CreateLogicalSwitchRequest) returns (CreateLogicalSwitchResponse); } message AllocateIPRequest { string namespace = 1; string pod_name = 2; string container_id = 3; } message AllocateIPResponse { string ip_address = 1; // e.g., "10.244.1.5/24" string gateway = 2; repeated string dns_servers = 3; } ``` **OVN Topology:** - **Logical Switch per Namespace**: `k8s-` (e.g., `k8s-project-123`) - **Logical Router**: `k8s-cluster-router` for inter-namespace routing - **Logical Switch Ports**: One per pod (`-`) - **ACLs**: NetworkPolicy enforcement (Phase 2) **Network Policy Translation (Phase 2):** ``` K8s NetworkPolicy: podSelector: app=web ingress: - from: - podSelector: app=frontend ports: - protocol: TCP port: 80 → OVN ACL: direction: to-lport match: "ip4.src == $frontend_pods && tcp.dst == 80" action: allow-related priority: 1000 ``` **Address Sets:** - Dynamic updates as pods are added/removed - Efficient ACL matching for large pod groups ### 2. FiberLB LoadBalancer Controller **Purpose:** Reconcile K8s Services of type LoadBalancer with FiberLB resources. **Architecture:** - **Controller Process**: Runs as a pod in `kube-system` namespace or embedded in k3s server - **Watch Resources**: Services (type=LoadBalancer), Endpoints - **Manage Resources**: FiberLB LoadBalancers, Listeners, Pools, Members **Controller Logic:** **1. Service Watch Loop:** ```go for event := range serviceWatcher { if event.Type == Created || event.Type == Updated { if service.Spec.Type == "LoadBalancer" { reconcileLoadBalancer(service) } } else if event.Type == Deleted { deleteLoadBalancer(service) } } ``` **2. Reconcile Logic:** ``` Input: Service object Process: 1. Check if FiberLB LoadBalancer exists (by annotation or name mapping) 2. If not exists: a. Allocate external IP from pool b. Create FiberLB LoadBalancer resource (gRPC CreateLoadBalancer) c. Store LoadBalancer ID in service annotation 3. For each service.Spec.Ports: a. Create/update FiberLB Listener (protocol, port, algorithm) 4. Get service endpoints: a. Create/update FiberLB Pool with backend members (pod IPs, ports) 5. Update service.Status.LoadBalancer.Ingress with external IP 6. If service spec changed: a. Update FiberLB resources accordingly ``` **3. Endpoint Watch Loop:** ``` for event := range endpointWatcher { service := getServiceForEndpoint(event.Object) if service.Spec.Type == "LoadBalancer" { updateLoadBalancerPool(service, event.Object) } } ``` **Configuration:** - **External IP Pool**: `--external-ip-pool=192.168.100.0/24` (CIDR or IP range) - **FiberLB Endpoint**: `--fiberlb-endpoint=fiberlb-server:7000` (gRPC address) - **IP Allocation**: First-available or integration with IPAM service **Service Annotations:** ```yaml apiVersion: v1 kind: Service metadata: name: web-service annotations: fiberlb.plasmacloud.io/load-balancer-id: "lb-abc123" fiberlb.plasmacloud.io/algorithm: "round-robin" # round-robin | least-conn | ip-hash fiberlb.plasmacloud.io/health-check-path: "/health" fiberlb.plasmacloud.io/health-check-interval: "10s" fiberlb.plasmacloud.io/health-check-timeout: "5s" fiberlb.plasmacloud.io/health-check-retries: "3" fiberlb.plasmacloud.io/session-affinity: "client-ip" # For sticky sessions spec: type: LoadBalancer selector: app: web ports: - protocol: TCP port: 80 targetPort: 8080 status: loadBalancer: ingress: - ip: 192.168.100.50 ``` **FiberLB gRPC API Integration:** ```protobuf service LoadBalancerService { rpc CreateLoadBalancer(CreateLoadBalancerRequest) returns (LoadBalancer); rpc UpdateLoadBalancer(UpdateLoadBalancerRequest) returns (LoadBalancer); rpc DeleteLoadBalancer(DeleteLoadBalancerRequest) returns (Empty); rpc CreateListener(CreateListenerRequest) returns (Listener); rpc UpdatePool(UpdatePoolRequest) returns (Pool); } message CreateLoadBalancerRequest { string name = 1; string description = 2; string external_ip = 3; // If empty, allocate from pool string org_id = 4; string project_id = 5; } message CreateListenerRequest { string load_balancer_id = 1; string protocol = 2; // TCP, UDP, HTTP, HTTPS int32 port = 3; string default_pool_id = 4; HealthCheck health_check = 5; } message UpdatePoolRequest { string pool_id = 1; repeated PoolMember members = 2; string algorithm = 3; } message PoolMember { string address = 1; // Pod IP int32 port = 2; int32 weight = 3; } ``` **Health Checks:** - HTTP health checks: Use annotation `health-check-path` - TCP health checks: Connection-based for non-HTTP services - Health check failures remove pod from pool (auto-healing) **Edge Cases:** - **Service deletion**: Controller must clean up FiberLB resources and release external IP - **Endpoint churn**: Debounce pool updates to avoid excessive FiberLB API calls - **IP exhaustion**: Return error event on service, set status condition ### 3. IAM Authentication Webhook **Purpose:** Authenticate K8s API requests using PlasmaCloud IAM tokens. **Architecture:** - **Webhook Server**: HTTPS endpoint (can be part of IAM service or standalone) - **Integration Point**: kube-apiserver `--authentication-token-webhook-config-file` - **Protocol**: K8s TokenReview API **Webhook Endpoint:** `POST /apis/iam.plasmacloud.io/v1/authenticate` **Request Flow:** ``` kubectl --token= get pods ↓ kube-apiserver extracts Bearer token ↓ POST /apis/iam.plasmacloud.io/v1/authenticate body: TokenReview with token ↓ IAM webhook validates token ↓ Response: authenticated=true, user info, groups ↓ kube-apiserver proceeds with RBAC authorization ``` **Request Schema (from kube-apiserver):** ```json { "apiVersion": "authentication.k8s.io/v1", "kind": "TokenReview", "spec": { "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." } } ``` **Response Schema (from IAM webhook):** ```json { "apiVersion": "authentication.k8s.io/v1", "kind": "TokenReview", "status": { "authenticated": true, "user": { "username": "user@example.com", "uid": "user-550e8400-e29b-41d4-a716-446655440000", "groups": [ "org:org-123", "project:proj-456", "system:authenticated" ], "extra": { "org_id": ["org-123"], "project_id": ["proj-456"], "roles": ["org_admin"] } } } } ``` **Error Response (invalid token):** ```json { "apiVersion": "authentication.k8s.io/v1", "kind": "TokenReview", "status": { "authenticated": false, "error": "Invalid or expired token" } } ``` **IAM Token Format:** - **JWT**: Signed by IAM service with shared secret or public/private key - **Claims**: sub (user ID), email, org_id, project_id, roles, exp (expiration) - **Example**: ```json { "sub": "user-550e8400-e29b-41d4-a716-446655440000", "email": "user@example.com", "org_id": "org-123", "project_id": "proj-456", "roles": ["org_admin", "project_member"], "exp": 1672531200 } ``` **User/Group Mapping:** | IAM Principal | K8s Username | K8s Groups | |---------------|--------------|------------| | User (email) | user@example.com | org:, project:, system:authenticated | | User (ID) | user- | org:, project:, system:authenticated | | Service Account | sa-@ | org:, project:, system:serviceaccounts | | Org Admin | admin@example.com | org:, project:, k8s:org-admin | **RBAC Integration:** - Groups are used in RoleBindings and ClusterRoleBindings - Example: `org:org-123` group gets admin access to all `project-*` namespaces for that org **Webhook Configuration File (`/etc/k8shost/iam-webhook.yaml`):** ```yaml apiVersion: v1 kind: Config clusters: - name: iam-webhook cluster: server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate certificate-authority: /etc/k8shost/ca.crt users: - name: k8s-apiserver user: client-certificate: /etc/k8shost/apiserver-client.crt client-key: /etc/k8shost/apiserver-client.key current-context: webhook contexts: - context: cluster: iam-webhook user: k8s-apiserver name: webhook ``` **Performance Considerations:** - **Caching**: kube-apiserver caches successful authentications (--authentication-token-webhook-cache-ttl=2m) - **Timeouts**: Webhook must respond within 10s (configurable) - **Rate Limiting**: IAM webhook should handle high request volume (100s of req/s) ### 4. FlashDNS Service Discovery Controller **Purpose:** Synchronize K8s Services and Pods to FlashDNS for cluster DNS resolution. **Architecture:** - **Controller Process**: Runs as pod in `kube-system` or embedded in k3s server - **Watch Resources**: Services, Endpoints, Pods - **Manage Resources**: FlashDNS A/AAAA/SRV records **DNS Hierarchy:** - **Pod A Records**: `.pod.cluster.local` → Pod IP - Example: `10-244-1-5.pod.cluster.local` → `10.244.1.5` - **Service A Records**: `..svc.cluster.local` → ClusterIP or external IP - Example: `web.default.svc.cluster.local` → `10.96.0.100` - **Headless Service**: `...svc.cluster.local` → Endpoint IPs - Example: `web-0.web.default.svc.cluster.local` → `10.244.1.10` - **SRV Records**: `_._...svc.cluster.local` - Example: `_http._tcp.web.default.svc.cluster.local` → `0 50 80 web.default.svc.cluster.local` **Controller Logic:** **1. Service Watch:** ``` for event := range serviceWatcher { service := event.Object switch event.Type { case Created, Updated: if service.Spec.ClusterIP != "None": // Regular service createOrUpdateDNSRecord( name: service.Name + "." + service.Namespace + ".svc.cluster.local", type: "A", value: service.Spec.ClusterIP ) if len(service.Status.LoadBalancer.Ingress) > 0: // LoadBalancer service - also add external IP createOrUpdateDNSRecord( name: service.Name + "." + service.Namespace + ".svc.cluster.local", type: "A", value: service.Status.LoadBalancer.Ingress[0].IP ) else: // Headless service - add endpoint records endpoints := getEndpoints(service) for _, ep := range endpoints: createOrUpdateDNSRecord( name: ep.Hostname + "." + service.Name + "." + service.Namespace + ".svc.cluster.local", type: "A", value: ep.IP ) // Create SRV records for each port for _, port := range service.Spec.Ports: createSRVRecord(service, port) case Deleted: deleteDNSRecords(service) } } ``` **2. Pod Watch (for pod DNS):** ``` for event := range podWatcher { pod := event.Object switch event.Type { case Created, Updated: if pod.Status.PodIP != "": dashedIP := strings.ReplaceAll(pod.Status.PodIP, ".", "-") createOrUpdateDNSRecord( name: dashedIP + ".pod.cluster.local", type: "A", value: pod.Status.PodIP ) case Deleted: deleteDNSRecord(pod) } } ``` **FlashDNS gRPC API Integration:** ```protobuf service DNSService { rpc CreateRecord(CreateRecordRequest) returns (DNSRecord); rpc UpdateRecord(UpdateRecordRequest) returns (DNSRecord); rpc DeleteRecord(DeleteRecordRequest) returns (Empty); rpc ListRecords(ListRecordsRequest) returns (ListRecordsResponse); } message CreateRecordRequest { string zone = 1; // "cluster.local" string name = 2; // "web.default.svc" string type = 3; // "A", "AAAA", "SRV", "CNAME" string value = 4; // "10.96.0.100" int32 ttl = 5; // 30 (seconds) map labels = 6; // k8s metadata } message DNSRecord { string id = 1; string zone = 2; string name = 3; string type = 4; string value = 5; int32 ttl = 6; } ``` **Configuration:** - **FlashDNS Endpoint**: `--flashdns-endpoint=flashdns-server:6000` - **Cluster Domain**: `--cluster-domain=cluster.local` (default) - **Record TTL**: `--dns-ttl=30` (seconds, low for fast updates) **Example DNS Records:** ``` # Regular service web.default.svc.cluster.local. 30 IN A 10.96.0.100 # Headless service with 3 pods web.default.svc.cluster.local. 30 IN A 10.244.1.10 web.default.svc.cluster.local. 30 IN A 10.244.1.11 web.default.svc.cluster.local. 30 IN A 10.244.1.12 # StatefulSet pods (Phase 3) web-0.web.default.svc.cluster.local. 30 IN A 10.244.1.10 web-1.web.default.svc.cluster.local. 30 IN A 10.244.1.11 # SRV record for service port _http._tcp.web.default.svc.cluster.local. 30 IN SRV 0 50 80 web.default.svc.cluster.local. # Pod DNS 10-244-1-10.pod.cluster.local. 30 IN A 10.244.1.10 ``` **Integration with kubelet:** - kubelet configures pod DNS via `/etc/resolv.conf` - `nameserver`: FlashDNS service IP (typically first IP in service CIDR, e.g., `10.96.0.10`) - `search`: `.svc.cluster.local svc.cluster.local cluster.local` **Edge Cases:** - **Service IP change**: Update DNS record atomically - **Endpoint churn**: Debounce updates for headless services with many endpoints - **DNS caching**: Low TTL (30s) for fast convergence ### 5. LightningStor CSI Driver **Purpose:** Provide dynamic PersistentVolume provisioning and lifecycle management. **CSI Driver Name:** `stor.plasmacloud.io` **Architecture:** - **Controller Plugin**: Runs as StatefulSet or Deployment in `kube-system` - Provisioning, deletion, attaching, detaching, snapshots - **Node Plugin**: Runs as DaemonSet on every node - Staging, publishing (mounting), unpublishing, unstaging **CSI Components:** **1. Controller Service (Identity, Controller RPCs):** - `CreateVolume`: Provision new volume via LightningStor - `DeleteVolume`: Delete volume - `ControllerPublishVolume`: Attach volume to node - `ControllerUnpublishVolume`: Detach volume from node - `ValidateVolumeCapabilities`: Check if volume supports requested capabilities - `ListVolumes`: List all volumes - `GetCapacity`: Query available storage capacity - `CreateSnapshot`, `DeleteSnapshot`: Volume snapshots (Phase 2) **2. Node Service (Node RPCs):** - `NodeStageVolume`: Mount volume to global staging path on node - `NodeUnstageVolume`: Unmount from staging path - `NodePublishVolume`: Bind mount from staging to pod path - `NodeUnpublishVolume`: Unmount from pod path - `NodeGetInfo`: Return node ID and topology - `NodeGetCapabilities`: Return node capabilities **CSI Driver Workflow:** **Volume Provisioning:** ``` 1. User creates PVC: apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-pvc spec: accessModes: [ReadWriteOnce] resources: requests: storage: 10Gi storageClassName: lightningstor-ssd 2. CSI Controller watches PVC, calls CreateVolume: CreateVolumeRequest { name: "pvc-550e8400-e29b-41d4-a716-446655440000" capacity_range: { required_bytes: 10737418240 } volume_capabilities: [{ access_mode: SINGLE_NODE_WRITER }] parameters: { "type": "ssd", "replication": "3", "org_id": "org-123", "project_id": "proj-456" } } 3. CSI Controller calls LightningStor gRPC CreateVolume: LightningStor creates volume, returns volume_id 4. CSI Controller creates PV: apiVersion: v1 kind: PersistentVolume metadata: name: pvc-550e8400-e29b-41d4-a716-446655440000 spec: capacity: storage: 10Gi accessModes: [ReadWriteOnce] persistentVolumeReclaimPolicy: Delete storageClassName: lightningstor-ssd csi: driver: stor.plasmacloud.io volumeHandle: vol-abc123 fsType: ext4 5. K8s binds PVC to PV ``` **Volume Attachment (when pod is scheduled):** ``` 1. kube-controller-manager creates VolumeAttachment: apiVersion: storage.k8s.io/v1 kind: VolumeAttachment metadata: name: csi- spec: attacher: stor.plasmacloud.io nodeName: worker-1 source: persistentVolumeName: pvc-550e8400-e29b-41d4-a716-446655440000 2. CSI Controller watches VolumeAttachment, calls ControllerPublishVolume: ControllerPublishVolumeRequest { volume_id: "vol-abc123" node_id: "worker-1" volume_capability: { access_mode: SINGLE_NODE_WRITER } } 3. CSI Controller calls LightningStor gRPC AttachVolume: LightningStor attaches volume to node (e.g., iSCSI target, NBD) 4. CSI Controller updates VolumeAttachment status: attached=true ``` **Volume Mounting (on node):** ``` 1. kubelet calls CSI Node plugin: NodeStageVolume NodeStageVolumeRequest { volume_id: "vol-abc123" staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io//globalmount" volume_capability: { mount: { fs_type: "ext4" } } } 2. CSI Node plugin: - Discovers block device (e.g., /dev/nbd0) via LightningStor - Formats if needed: mkfs.ext4 /dev/nbd0 - Mounts to staging path: mount /dev/nbd0 3. kubelet calls CSI Node plugin: NodePublishVolume NodePublishVolumeRequest { volume_id: "vol-abc123" staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io//globalmount" target_path: "/var/lib/kubelet/pods//volumes/kubernetes.io~csi/pvc-/mount" } 4. CSI Node plugin: - Bind mount staging path to target path - Pod can now read/write to volume ``` **LightningStor gRPC API Integration:** ```protobuf service VolumeService { rpc CreateVolume(CreateVolumeRequest) returns (Volume); rpc DeleteVolume(DeleteVolumeRequest) returns (Empty); rpc AttachVolume(AttachVolumeRequest) returns (VolumeAttachment); rpc DetachVolume(DetachVolumeRequest) returns (Empty); rpc GetVolume(GetVolumeRequest) returns (Volume); rpc ListVolumes(ListVolumesRequest) returns (ListVolumesResponse); } message CreateVolumeRequest { string name = 1; int64 size_bytes = 2; string volume_type = 3; // "ssd", "hdd" int32 replication_factor = 4; string org_id = 5; string project_id = 6; } message Volume { string id = 1; string name = 2; int64 size_bytes = 3; string status = 4; // "available", "in-use", "error" string volume_type = 5; } message AttachVolumeRequest { string volume_id = 1; string node_id = 2; string attach_mode = 3; // "read-write", "read-only" } message VolumeAttachment { string id = 1; string volume_id = 2; string node_id = 3; string device_path = 4; // e.g., "/dev/nbd0" string connection_info = 5; // JSON with iSCSI target, NBD socket, etc. } ``` **StorageClass Examples:** ```yaml # SSD storage with 3x replication apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: lightningstor-ssd provisioner: stor.plasmacloud.io parameters: type: "ssd" replication: "3" volumeBindingMode: WaitForFirstConsumer # Topology-aware scheduling allowVolumeExpansion: true reclaimPolicy: Delete --- # HDD storage with 2x replication apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: lightningstor-hdd provisioner: stor.plasmacloud.io parameters: type: "hdd" replication: "2" volumeBindingMode: Immediate allowVolumeExpansion: true reclaimPolicy: Retain # Keep volume after PVC deletion ``` **Access Modes:** - **ReadWriteOnce (RWO)**: Single node read-write (most common) - **ReadOnlyMany (ROX)**: Multiple nodes read-only - **ReadWriteMany (RWX)**: Multiple nodes read-write (requires shared filesystem like NFS, Phase 2) **Volume Expansion (if allowVolumeExpansion: true):** ``` 1. User edits PVC: spec.resources.requests.storage: 20Gi (was 10Gi) 2. CSI Controller calls ControllerExpandVolume 3. LightningStor expands volume backend 4. CSI Node plugin calls NodeExpandVolume 5. Filesystem resize: resize2fs /dev/nbd0 ``` ### 6. PlasmaVMC Integration **Phase 1 (MVP):** Use containerd as default CRI - k3s ships with containerd embedded - Standard OCI container runtime - No changes needed for Phase 1 **Phase 3 (Future):** Custom CRI for VM-backed pods **Motivation:** - **Enhanced Isolation**: Stronger security boundary than containers - **Multi-Tenant Security**: Prevent container escape attacks - **Consistent Runtime**: Unify VM and container workloads on PlasmaVMC **Architecture:** - PlasmaVMC implements CRI (Container Runtime Interface) - Each pod runs as a lightweight VM (Firecracker microVM) - Pod containers run inside VM (still using containerd within VM) - kubelet communicates with PlasmaVMC CRI endpoint instead of containerd **CRI Interface Implementation:** **RuntimeService:** - `RunPodSandbox`: Create Firecracker microVM for pod - `StopPodSandbox`: Stop microVM - `RemovePodSandbox`: Delete microVM - `PodSandboxStatus`: Query microVM status - `ListPodSandbox`: List all pod microVMs - `CreateContainer`: Create container inside microVM - `StartContainer`, `StopContainer`, `RemoveContainer`: Container lifecycle - `ExecSync`, `Exec`: Execute commands in container - `Attach`: Attach to container stdio **ImageService:** - `PullImage`: Download container image (delegate to internal containerd) - `RemoveImage`: Delete image - `ListImages`: List cached images - `ImageStatus`: Query image metadata **Implementation Strategy:** ``` ┌─────────────────────────────────────────┐ │ kubelet (k3s agent) │ └─────────────┬───────────────────────────┘ │ CRI gRPC ▼ ┌─────────────────────────────────────────┐ │ PlasmaVMC CRI Server (Rust) │ │ - RunPodSandbox → Create microVM │ │ - CreateContainer → Run in VM │ └─────────────┬───────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ Firecracker VMM (per pod) │ │ ┌───────────────────────────────────┐ │ │ │ Pod VM (minimal Linux kernel) │ │ │ │ ┌──────────────────────────────┐ │ │ │ │ │ containerd (in-VM) │ │ │ │ │ │ - Container 1 │ │ │ │ │ │ - Container 2 │ │ │ │ │ └──────────────────────────────┘ │ │ │ └───────────────────────────────────┘ │ └─────────────────────────────────────────┘ ``` **Configuration (Phase 3):** ```nix services.k8shost = { enable = true; cri = "plasmavmc"; # Instead of "containerd" plasmavmc = { endpoint = "unix:///var/run/plasmavmc/cri.sock"; vmKernel = "/var/lib/plasmavmc/vmlinux.bin"; vmRootfs = "/var/lib/plasmavmc/rootfs.ext4"; }; }; ``` **Benefits:** - Stronger isolation for untrusted workloads - Leverage existing PlasmaVMC infrastructure - Consistent management across VM and K8s workloads **Challenges:** - Performance overhead (microVM startup time, memory overhead) - Image caching complexity (need containerd inside VM) - Networking integration (CNI must configure VM network) **Decision:** Defer to Phase 3, focus on standard containerd for MVP. ## Multi-Tenant Model ### Namespace Strategy **Principle:** One K8s namespace per PlasmaCloud project. **Namespace Naming:** - **Project namespaces**: `project-` (e.g., `project-550e8400-e29b-41d4-a716-446655440000`) - **Org shared namespaces** (optional): `org--shared` (for shared resources like monitoring) - **System namespaces**: `kube-system`, `kube-public`, `kube-node-lease`, `default` **Namespace Lifecycle:** - Created automatically when project provisions K8s cluster - Labeled with `org_id`, `project_id` for RBAC and billing - Deleted when project is deleted (with grace period) **Namespace Metadata:** ```yaml apiVersion: v1 kind: Namespace metadata: name: project-550e8400-e29b-41d4-a716-446655440000 labels: plasmacloud.io/org-id: "org-123" plasmacloud.io/project-id: "proj-456" plasmacloud.io/tenant-type: "project" annotations: plasmacloud.io/project-name: "my-web-app" plasmacloud.io/created-by: "user@example.com" ``` ### RBAC Templates **Org Admin Role (full access to all project namespaces):** ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: org-admin namespace: project-550e8400-e29b-41d4-a716-446655440000 rules: - apiGroups: ["*"] resources: ["*"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: org-admin-binding namespace: project-550e8400-e29b-41d4-a716-446655440000 subjects: - kind: Group name: org:org-123 apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: org-admin apiGroup: rbac.authorization.k8s.io ``` **Project Admin Role (full access to specific project namespace):** ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: project-admin namespace: project-550e8400-e29b-41d4-a716-446655440000 rules: - apiGroups: ["", "apps", "batch", "networking.k8s.io", "storage.k8s.io"] resources: ["*"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: project-admin-binding namespace: project-550e8400-e29b-41d4-a716-446655440000 subjects: - kind: Group name: project:proj-456 apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: project-admin apiGroup: rbac.authorization.k8s.io ``` **Project Viewer Role (read-only access):** ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: project-viewer namespace: project-550e8400-e29b-41d4-a716-446655440000 rules: - apiGroups: ["", "apps", "batch", "networking.k8s.io"] resources: ["pods", "services", "deployments", "replicasets", "configmaps", "secrets"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: project-viewer-binding namespace: project-550e8400-e29b-41d4-a716-446655440000 subjects: - kind: Group name: project:proj-456:viewer apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: project-viewer apiGroup: rbac.authorization.k8s.io ``` **ClusterRole for Node Access (for cluster admins):** ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: plasmacloud-cluster-admin rules: - apiGroups: [""] resources: ["nodes", "persistentvolumes"] verbs: ["*"] - apiGroups: ["storage.k8s.io"] resources: ["storageclasses"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: plasmacloud-cluster-admin-binding subjects: - kind: Group name: system:plasmacloud-admins apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: plasmacloud-cluster-admin apiGroup: rbac.authorization.k8s.io ``` ### Network Isolation **Default NetworkPolicy (deny all, except DNS):** ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: project-550e8400-e29b-41d4-a716-446655440000 spec: podSelector: {} # Apply to all pods policyTypes: - Ingress - Egress egress: - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system ports: - protocol: UDP port: 53 # DNS ``` **Allow Ingress from LoadBalancer:** ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-loadbalancer namespace: project-550e8400-e29b-41d4-a716-446655440000 spec: podSelector: matchLabels: app: web policyTypes: - Ingress ingress: - from: - ipBlock: cidr: 0.0.0.0/0 # Allow from anywhere (LoadBalancer external traffic) ports: - protocol: TCP port: 8080 ``` **Allow Inter-Namespace Communication (optional, for org-shared services):** ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-org-shared namespace: project-550e8400-e29b-41d4-a716-446655440000 spec: podSelector: {} policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: plasmacloud.io/org-id: "org-123" plasmacloud.io/tenant-type: "org-shared" ``` **PrismNET Enforcement:** - NetworkPolicies are translated to OVN ACLs by PrismNET CNI controller - Enforced at OVN logical switch level (low-level packet filtering) ### Resource Quotas **CPU and Memory Quotas:** ```yaml apiVersion: v1 kind: ResourceQuota metadata: name: project-compute-quota namespace: project-550e8400-e29b-41d4-a716-446655440000 spec: hard: requests.cpu: "10" # 10 CPU cores requests.memory: "20Gi" # 20 GB RAM limits.cpu: "20" # Allow bursting to 20 cores limits.memory: "40Gi" # Allow bursting to 40 GB RAM ``` **Storage Quotas:** ```yaml apiVersion: v1 kind: ResourceQuota metadata: name: project-storage-quota namespace: project-550e8400-e29b-41d4-a716-446655440000 spec: hard: persistentvolumeclaims: "10" # Max 10 PVCs requests.storage: "100Gi" # Total storage requests ``` **Object Count Quotas:** ```yaml apiVersion: v1 kind: ResourceQuota metadata: name: project-object-quota namespace: project-550e8400-e29b-41d4-a716-446655440000 spec: hard: pods: "50" services: "20" services.loadbalancers: "5" # Max 5 LoadBalancer services (limit external IPs) configmaps: "50" secrets: "50" ``` **Quota Enforcement:** - K8s admission controller rejects resource creation exceeding quota - User receives clear error message - Quota usage visible in `kubectl describe quota` ## Deployment Model ### Single-Server (Development/Small) **Target Use Case:** - Development and testing environments - Small production workloads (<10 nodes) - Cost-sensitive deployments **Architecture:** - Single k3s server node with embedded SQLite datastore - Control plane and worker colocated - No HA guarantees **k3s Server Command:** ```bash k3s server \ --data-dir=/var/lib/k8shost \ --disable=servicelb,traefik,flannel \ --flannel-backend=none \ --disable-network-policy \ --cluster-domain=cluster.local \ --service-cidr=10.96.0.0/12 \ --cluster-cidr=10.244.0.0/16 \ --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \ --bind-address=0.0.0.0 \ --advertise-address=192.168.1.100 \ --tls-san=k8s-api.example.com ``` **NixOS Configuration:** ```nix { config, lib, pkgs, ... }: { services.k8shost = { enable = true; mode = "server"; datastore = "sqlite"; # Embedded SQLite disableComponents = ["servicelb" "traefik" "flannel"]; networking = { serviceCIDR = "10.96.0.0/12"; clusterCIDR = "10.244.0.0/16"; clusterDomain = "cluster.local"; }; prismnet = { enable = true; endpoint = "prismnet-server:5000"; ovnNorthbound = "tcp:prismnet-server:6641"; ovnSouthbound = "tcp:prismnet-server:6642"; }; fiberlb = { enable = true; endpoint = "fiberlb-server:7000"; externalIpPool = "192.168.100.0/24"; }; iam = { enable = true; webhookEndpoint = "https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate"; caCertFile = "/etc/k8shost/ca.crt"; clientCertFile = "/etc/k8shost/client.crt"; clientKeyFile = "/etc/k8shost/client.key"; }; flashdns = { enable = true; endpoint = "flashdns-server:6000"; clusterDomain = "cluster.local"; recordTTL = 30; }; lightningstor = { enable = true; endpoint = "lightningstor-server:8000"; csiNodeDaemonSet = true; # Deploy CSI node plugin as DaemonSet }; }; # Open firewall for K8s API networking.firewall.allowedTCPPorts = [ 6443 ]; } ``` **Limitations:** - No HA (single point of failure) - SQLite has limited concurrency - Control plane downtime affects entire cluster ### HA Cluster (Production) **Target Use Case:** - Production workloads requiring high availability - Large clusters (>10 nodes) - Mission-critical applications **Architecture:** - 3 or 5 k3s server nodes (odd number for quorum) - Embedded etcd (Raft consensus, HA datastore) - Load balancer in front of API servers - Agent nodes for workload scheduling **k3s Server Command (each server node):** ```bash k3s server \ --data-dir=/var/lib/k8shost \ --disable=servicelb,traefik,flannel \ --flannel-backend=none \ --disable-network-policy \ --cluster-domain=cluster.local \ --service-cidr=10.96.0.0/12 \ --cluster-cidr=10.244.0.0/16 \ --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \ --cluster-init \ # First server only --server https://k8s-api-lb.internal:6443 \ # Join existing cluster (not for first server) --tls-san=k8s-api-lb.example.com \ --tls-san=k8s-api.example.com ``` **k3s Agent Command (worker nodes):** ```bash k3s agent \ --server https://k8s-api-lb.internal:6443 \ --token ``` **NixOS Configuration (Server Node):** ```nix { config, lib, pkgs, ... }: { services.k8shost = { enable = true; mode = "server"; datastore = "etcd"; # Embedded etcd for HA clusterInit = true; # Set to false for joining servers serverUrl = "https://k8s-api-lb.internal:6443"; # For joining servers # ... same integrations as single-server ... }; # High availability settings systemd.services.k8shost = { serviceConfig = { Restart = "always"; RestartSec = "10s"; }; }; } ``` **Load Balancer Configuration (FiberLB):** ```yaml # External LoadBalancer for API access apiVersion: v1 kind: LoadBalancer metadata: name: k8s-api-lb spec: listeners: - protocol: TCP port: 6443 backend_pool: k8s-api-servers pools: - name: k8s-api-servers algorithm: round-robin members: - address: 192.168.1.101 # server-1 port: 6443 - address: 192.168.1.102 # server-2 port: 6443 - address: 192.168.1.103 # server-3 port: 6443 health_check: type: tcp interval: 10s timeout: 5s retries: 3 ``` **Datastore Options:** #### Option 1: Embedded etcd (Recommended for MVP) **Pros:** - Built-in to k3s, no external dependencies - Proven, battle-tested (CNCF etcd project) - Automatic HA with Raft consensus - Easy setup (just `--cluster-init`) **Cons:** - Another distributed datastore (in addition to Chainfire/FlareDB) - etcd-specific operations (backup, restore, defragmentation) #### Option 2: FlareDB as External Datastore **Pros:** - Unified storage layer for PlasmaCloud - Leverage existing FlareDB deployment - Simplified infrastructure (one less system to manage) **Cons:** - k3s requires etcd API compatibility - FlareDB would need to implement etcd v3 API (significant effort) - Untested for K8s workloads **Recommendation for MVP:** Use embedded etcd for HA mode. Investigate FlareDB etcd compatibility in Phase 2 or 3. **Backup and Disaster Recovery:** ```bash # etcd snapshot (on any server node) k3s etcd-snapshot save --name backup-$(date +%Y%m%d-%H%M%S) # List snapshots k3s etcd-snapshot ls # Restore from snapshot k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/k8shost/server/db/snapshots/backup-20250101-120000 ``` ### NixOS Module Integration **Module Structure:** ``` nix/modules/ ├── k8shost.nix # Main module ├── k8shost/ │ ├── controller.nix # FiberLB, FlashDNS controllers │ ├── csi.nix # LightningStor CSI driver │ └── cni.nix # PrismNET CNI plugin ``` **Main Module (`nix/modules/k8shost.nix`):** ```nix { config, lib, pkgs, ... }: with lib; let cfg = config.services.k8shost; in { options.services.k8shost = { enable = mkEnableOption "PlasmaCloud K8s Hosting Service"; mode = mkOption { type = types.enum ["server" "agent"]; default = "server"; description = "Run as server (control plane) or agent (worker)"; }; datastore = mkOption { type = types.enum ["sqlite" "etcd"]; default = "sqlite"; description = "Datastore backend (sqlite for single-server, etcd for HA)"; }; disableComponents = mkOption { type = types.listOf types.str; default = ["servicelb" "traefik" "flannel"]; description = "k3s components to disable"; }; networking = { serviceCIDR = mkOption { type = types.str; default = "10.96.0.0/12"; description = "CIDR for service ClusterIPs"; }; clusterCIDR = mkOption { type = types.str; default = "10.244.0.0/16"; description = "CIDR for pod IPs"; }; clusterDomain = mkOption { type = types.str; default = "cluster.local"; description = "Cluster DNS domain"; }; }; # Integration options (prismnet, fiberlb, iam, flashdns, lightningstor) # ... }; config = mkIf cfg.enable { # Install k3s package environment.systemPackages = [ pkgs.k3s ]; # Create systemd service systemd.services.k8shost = { description = "PlasmaCloud K8s Hosting Service (k3s)"; after = [ "network.target" "iam.service" "prismnet.service" ]; requires = [ "iam.service" "prismnet.service" ]; wantedBy = [ "multi-user.target" ]; serviceConfig = { Type = "notify"; ExecStart = "${pkgs.k3s}/bin/k3s ${cfg.mode} ${concatStringsSep " " (buildServerArgs cfg)}"; KillMode = "process"; Delegate = "yes"; LimitNOFILE = 1048576; LimitNPROC = "infinity"; LimitCORE = "infinity"; TasksMax = "infinity"; Restart = "always"; RestartSec = "5s"; }; }; # Create configuration files environment.etc."k8shost/iam-webhook.yaml" = { text = generateIAMWebhookConfig cfg.iam; mode = "0600"; }; # Deploy controllers (FiberLB, FlashDNS, etc.) # ... (as separate systemd services or in-cluster deployments) }; } ``` ## API Server Configuration ### k3s Server Flags (Complete) ```bash k3s server \ # Data and cluster configuration --data-dir=/var/lib/k8shost \ --cluster-init \ # For first server in HA cluster --server https://k8s-api-lb.internal:6443 \ # Join existing HA cluster --token \ # Secure join token # Disable default components --disable=servicelb,traefik,flannel,local-storage \ --flannel-backend=none \ --disable-network-policy \ # Network configuration --cluster-domain=cluster.local \ --service-cidr=10.96.0.0/12 \ --cluster-cidr=10.244.0.0/16 \ --service-node-port-range=30000-32767 \ # API server configuration --bind-address=0.0.0.0 \ --advertise-address=192.168.1.100 \ --tls-san=k8s-api.example.com \ --tls-san=k8s-api-lb.example.com \ # Authentication --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \ --authentication-token-webhook-cache-ttl=2m \ # Authorization (RBAC enabled by default) # --authorization-mode=Node,RBAC \ # Default, no need to specify # Audit logging --kube-apiserver-arg=audit-log-path=/var/log/k8shost/audit.log \ --kube-apiserver-arg=audit-log-maxage=30 \ --kube-apiserver-arg=audit-log-maxbackup=10 \ --kube-apiserver-arg=audit-log-maxsize=100 \ # Feature gates (if needed) # --kube-apiserver-arg=feature-gates=SomeFeature=true ``` ### Authentication Webhook Configuration **File: `/etc/k8shost/iam-webhook.yaml`** ```yaml apiVersion: v1 kind: Config clusters: - name: iam-webhook cluster: server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate certificate-authority: /etc/k8shost/ca.crt users: - name: k8s-apiserver user: client-certificate: /etc/k8shost/apiserver-client.crt client-key: /etc/k8shost/apiserver-client.key current-context: webhook contexts: - context: cluster: iam-webhook user: k8s-apiserver name: webhook ``` **Certificate Management:** - CA certificate: Issued by PlasmaCloud IAM PKI - Client certificate: For kube-apiserver to authenticate to IAM webhook - Rotation: Certificates expire after 1 year, auto-renewed by IAM ## Security ### TLS/mTLS **Component Communication:** | Source | Destination | Protocol | Auth Method | |--------|-------------|----------|-------------| | kube-apiserver | IAM webhook | HTTPS + mTLS | Client cert | | FiberLB controller | FiberLB gRPC | gRPC + TLS | IAM token | | FlashDNS controller | FlashDNS gRPC | gRPC + TLS | IAM token | | LightningStor CSI | LightningStor gRPC | gRPC + TLS | IAM token | | PrismNET CNI | PrismNET gRPC | gRPC + TLS | IAM token | | kubectl | kube-apiserver | HTTPS | IAM token (Bearer) | **Certificate Issuance:** - All certificates issued by IAM service (centralized PKI) - Automatic renewal before expiration - Certificate revocation via IAM CRL ### Pod Security **Pod Security Standards (PSS):** - **Baseline Profile**: Enforced on all namespaces by default - Deny privileged containers - Deny host network/PID/IPC - Deny hostPath volumes - Deny privilege escalation - **Restricted Profile**: Optional, for highly sensitive workloads **Example PodSecurityPolicy (deprecated in K8s 1.25, use PSS):** ```yaml apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted spec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL volumes: - configMap - emptyDir - projected - secret - downwardAPI - persistentVolumeClaim runAsUser: rule: MustRunAsNonRoot seLinux: rule: RunAsAny fsGroup: rule: RunAsAny ``` **Security Contexts (enforced):** ```yaml apiVersion: v1 kind: Pod metadata: name: secure-pod spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 2000 containers: - name: app image: myapp:latest securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL ``` **Service Account Permissions:** - Minimal RBAC permissions by default - Principle of least privilege - No cluster-admin access for user workloads ## Testing Strategy ### Unit Tests **Controllers (Go):** ```go // fiberlb_controller_test.go func TestReconcileLoadBalancer(t *testing.T) { // Mock K8s client client := fake.NewSimpleClientset() // Mock FiberLB gRPC client mockFiberLB := &mockFiberLBClient{} controller := NewFiberLBController(client, mockFiberLB) // Create test service svc := &corev1.Service{ ObjectMeta: metav1.ObjectMeta{Name: "test-svc", Namespace: "default"}, Spec: corev1.ServiceSpec{Type: corev1.ServiceTypeLoadBalancer}, } // Reconcile err := controller.Reconcile(svc) assert.NoError(t, err) // Verify FiberLB API called assert.Equal(t, 1, mockFiberLB.createLoadBalancerCalls) } ``` **CNI Plugin (Rust):** ```rust #[test] fn test_cni_add() { let mut mock_ovn = MockOVNClient::new(); mock_ovn.expect_allocate_ip() .returning(|ns, pod| Ok("10.244.1.5/24".to_string())); let plugin = PrismNETPlugin::new(mock_ovn); let result = plugin.handle_add(/* ... */); assert!(result.is_ok()); assert_eq!(result.unwrap().ip, "10.244.1.5"); } ``` **CSI Driver (Go):** ```go func TestCreateVolume(t *testing.T) { mockLightningStor := &mockLightningStorClient{} mockLightningStor.On("CreateVolume", mock.Anything).Return(&Volume{ID: "vol-123"}, nil) driver := NewCSIDriver(mockLightningStor) req := &csi.CreateVolumeRequest{ Name: "test-vol", CapacityRange: &csi.CapacityRange{RequiredBytes: 10 * 1024 * 1024 * 1024}, } resp, err := driver.CreateVolume(context.Background(), req) assert.NoError(t, err) assert.Equal(t, "vol-123", resp.Volume.VolumeId) } ``` ### Integration Tests **Test Environment:** - Single-node k3s cluster (kind or k3s in Docker) - Mock or real PlasmaCloud services (PrismNET, FiberLB, etc.) - Automated setup and teardown **Test Cases:** **1. Single-Pod Deployment:** ```bash #!/bin/bash set -e # Deploy nginx pod kubectl apply -f - < /data/test.txt && sleep 3600"] volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: test-pvc EOF kubectl wait --for=condition=Ready pod/test-pod --timeout=60s # Verify file written kubectl exec test-pod -- cat /data/test.txt | grep hello || exit 1 # Cleanup kubectl delete pod test-pod kubectl delete pvc test-pvc ``` **4. Multi-Tenant Isolation:** ```bash #!/bin/bash set -e # Create two namespaces kubectl create namespace project-a kubectl create namespace project-b # Deploy pod in each kubectl run pod-a --image=nginx -n project-a kubectl run pod-b --image=nginx -n project-b # Verify network isolation (if NetworkPolicies enabled) # Pod A should NOT be able to reach Pod B POD_B_IP=$(kubectl get pod pod-b -n project-b -o jsonpath='{.status.podIP}') kubectl exec pod-a -n project-a -- curl --max-time 5 http://$POD_B_IP && exit 1 || true # Cleanup kubectl delete ns project-a project-b ``` ### E2E Test Scenario **End-to-End Test: Deploy Multi-Tier Application** ```bash #!/bin/bash set -ex NAMESPACE="project-123" # 1. Create namespace kubectl create namespace $NAMESPACE # 2. Deploy PostgreSQL with PVC kubectl apply -n $NAMESPACE -f - <