centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth

- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1

2025-12-12 06:23:46 +09:00

71 KiB

Raw Blame History

K8s Hosting Specification

Overview

PlasmaCloud's K8s Hosting service provides managed Kubernetes clusters for multi-tenant container orchestration. This specification defines a k3s-based architecture that integrates deeply with existing PlasmaCloud infrastructure components: PrismNET for networking, FiberLB for load balancing, IAM for authentication/authorization, FlashDNS for service discovery, and LightningStor for persistent storage.

Purpose

Enable customers to deploy and manage containerized workloads using standard Kubernetes APIs while benefiting from PlasmaCloud's integrated infrastructure services. The system provides:

Standard K8s API compatibility: Use kubectl, Helm, and existing K8s tooling
Multi-tenant isolation: Project-based namespaces with IAM-backed RBAC
Deep integration: Leverage PrismNET SDN, FiberLB load balancing, LightningStor block storage
Production-ready: HA control plane, automated failover, comprehensive monitoring

Scope

Phase 1 (MVP, 3-4 months):

Core K8s APIs (Pods, Services, Deployments, ReplicaSets, Namespaces, ConfigMaps, Secrets)
LoadBalancer services via FiberLB
Persistent storage via LightningStor CSI
IAM authentication and RBAC
PrismNET CNI for pod networking
FlashDNS service discovery

Future Phases:

PlasmaVMC integration for VM-backed pods (enhanced isolation)
StatefulSets, DaemonSets, Jobs/CronJobs
Network policies with PrismNET enforcement
Horizontal Pod Autoscaler
FlareDB as k3s datastore

Architecture Decision Summary

Base Technology: k3s

Lightweight K8s distribution (single binary, minimal dependencies)
Production-proven (CNCF certified, widely deployed)
Flexible architecture allowing component replacement
Embedded SQLite (single-server) or etcd (HA cluster)
3-4 month timeline achievable

Component Replacement Strategy:

Disable: servicelb (replaced by FiberLB), traefik (use FiberLB), flannel (replaced by PrismNET)
Keep: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, containerd
Add: Custom controllers for FiberLB, FlashDNS, IAM webhook, LightningStor CSI, PrismNET CNI

Architecture

Base: k3s with Selective Component Replacement

k3s Core (Keep):

kube-apiserver: K8s REST API server with IAM webhook authentication
kube-scheduler: Pod scheduling with resource awareness
kube-controller-manager: Core controllers (replication, endpoints, service accounts, etc.)
kubelet: Node agent managing pod lifecycle via containerd CRI
containerd: Container runtime (Phase 1), later replaceable by PlasmaVMC CRI
kube-proxy: Service networking (iptables/ipvs mode)

k3s Components (Disable):

servicelb: Default LoadBalancer implementation → Replaced by FiberLB controller
traefik: Ingress controller → Replaced by FiberLB L7 capabilities
flannel: CNI plugin → Replaced by PrismNET CNI
local-path-provisioner: Storage provisioner → Replaced by LightningStor CSI

PlasmaCloud Custom Components (Add):

PrismNET CNI Plugin: Pod networking via OVN logical switches
FiberLB Controller: LoadBalancer service reconciliation
IAM Webhook Server: Token validation and user mapping
FlashDNS Controller: Service DNS record synchronization
LightningStor CSI Driver: PersistentVolume provisioning and attachment

Component Topology

┌─────────────────────────────────────────────────────────────┐
│                     k3s Control Plane                       │
│  ┌──────────────┐  ┌────────────┐  ┌──────────────────┐   │
│  │ kube-apiserver│◄─┤ IAM Webhook├──┤ IAM Service      │   │
│  │              │  │            │  │ (Authentication) │   │
│  └──────┬───────┘  └────────────┘  └──────────────────┘   │
│         │                                                   │
│  ┌──────▼───────┐  ┌──────────────┐  ┌────────────────┐   │
│  │kube-scheduler│  │kube-controller│  │ etcd/SQLite    │   │
│  │              │  │   -manager    │  │  (Datastore)   │   │
│  └──────────────┘  └──────────────┘  └────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
┌───────▼───────┐  ┌───────▼───────┐  ┌──────▼──────┐
│ FiberLB       │  │ FlashDNS      │  │ LightningStor│
│ Controller    │  │ Controller    │  │ CSI Plugin   │
│ (Watch Svcs)  │  │ (Sync DNS)    │  │ (Provision)  │
└───────┬───────┘  └───────┬───────┘  └──────┬───────┘
        │                  │                  │
        ▼                  ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌────────────────┐
│ FiberLB      │  │ FlashDNS     │  │ LightningStor  │
│ gRPC API     │  │ gRPC API     │  │ gRPC API       │
└──────────────┘  └──────────────┘  └────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      k3s Worker Nodes                        │
│  ┌──────────────┐  ┌────────────┐  ┌──────────────────┐   │
│  │  kubelet     │◄─┤containerd  ├──┤ Pods (containers)│   │
│  │              │  │    CRI     │  │                  │   │
│  └──────┬───────┘  └────────────┘  └──────────────────┘   │
│         │                                                   │
│  ┌──────▼───────┐  ┌──────────────┐                        │
│  │ PrismNET CNI  │◄─┤ kube-proxy   │                        │
│  │ (Pod Network)│  │ (Service Net)│                        │
│  └──────┬───────┘  └──────────────┘                        │
│         │                                                   │
│         ▼                                                   │
│  ┌──────────────┐                                          │
│  │ PrismNET OVN  │                                          │
│  │ (ovs-vswitchd)│                                         │
│  └──────────────┘                                          │
└─────────────────────────────────────────────────────────────┘

Data Flow Examples

1. Pod Creation:

kubectl create pod → kube-apiserver (IAM auth) → scheduler → kubelet → containerd
                                                                  ↓
                                                            PrismNET CNI
                                                                  ↓
                                                          OVN logical port

2. LoadBalancer Service:

kubectl expose → kube-apiserver → Service created → FiberLB controller watches
                                                           ↓
                                                   FiberLB gRPC API
                                                           ↓
                                               External IP + L4 forwarding

3. PersistentVolume:

PVC created → kube-apiserver → CSI controller → LightningStor CSI driver
                                                         ↓
                                                 LightningStor gRPC
                                                         ↓
                                                   Volume created
                                                         ↓
                                               kubelet → CSI node plugin
                                                         ↓
                                                   Mount to pod

K8s API Subset

Phase 1: Core APIs (Essential)

Pods (v1):

Full CRUD operations (create, get, list, update, delete, patch)
Watch API for real-time updates
Logs streaming (kubectl logs -f)
Exec into containers (kubectl exec)
Port forwarding (kubectl port-forward)
Status: Phase (Pending, Running, Succeeded, Failed), conditions, container states

Services (v1):

ClusterIP: Internal cluster networking (default)
LoadBalancer: External access via FiberLB
Headless: StatefulSet support (clusterIP: None)
Service discovery via FlashDNS
Endpoint slices for large service backends

Deployments (apps/v1):

Declarative desired state (replicas, pod template)
Rolling updates with configurable strategy (maxSurge, maxUnavailable)
Rollback to previous revision
Pause/resume for canary deployments
Scaling (manual in Phase 1)

ReplicaSets (apps/v1):

Pod replication with label selectors
Owned by Deployments (rarely created directly)
Orphan/adopt pod ownership

Namespaces (v1):

Tenant isolation (one namespace per project)
Resource quota enforcement
Network policy scope (Phase 2)
RBAC scope

ConfigMaps (v1):

Non-sensitive configuration data
Mount as volumes or environment variables
Update triggers pod restarts (via annotation)

Secrets (v1):

Sensitive data (passwords, tokens, certificates)
Base64 encoded in etcd (at-rest encryption in future phase)
Mount as volumes or environment variables
Service account tokens

Nodes (v1):

Node registration via kubelet
Heartbeat and status reporting
Capacity and allocatable resources
Labels and taints for scheduling

Events (v1):

Audit trail of cluster activities
Retention policy (1 hour in-memory, longer in etcd)
Debugging and troubleshooting

Phase 2: Storage & Config (Required for MVP)

PersistentVolumes (v1):

Volume lifecycle independent of pods
Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany (LightningStor support)
Reclaim policy: Retain, Delete
Status: Available, Bound, Released, Failed

PersistentVolumeClaims (v1):

User request for storage
Binding to PVs by storage class, capacity, access mode
Volume expansion (if storage class allows)

StorageClasses (storage.k8s.io/v1):

Dynamic provisioning via LightningStor CSI
Parameters: volume type (ssd, hdd), replication factor, org_id, project_id
Volume binding mode: Immediate or WaitForFirstConsumer

Phase 3: Advanced (Post-MVP)

StatefulSets (apps/v1):

Ordered pod creation/deletion
Stable network identities (pod-0, pod-1, ...)
Persistent storage per pod via volumeClaimTemplates
Use case: Databases, distributed systems

DaemonSets (apps/v1):

One pod per node (e.g., log collectors, monitoring agents)
Node selector and tolerations

Jobs (batch/v1):

Run-to-completion workloads
Parallelism and completions
Retry policy

CronJobs (batch/v1):

Scheduled jobs (cron syntax)
Concurrency policy

NetworkPolicies (networking.k8s.io/v1):

Ingress and egress rules
Label-based pod selection
Namespace selectors
Requires PrismNET CNI support for OVN ACL translation

Ingress (networking.k8s.io/v1):

HTTP/HTTPS routing via FiberLB L7
Host-based and path-based routing
TLS termination

Deferred APIs (Not in MVP)

HorizontalPodAutoscaler (autoscaling/v2): Requires metrics-server
VerticalPodAutoscaler: Complex, low priority
PodDisruptionBudget: Useful for HA, but post-MVP
LimitRange: Resource limits per namespace (future)
ResourceQuota: Supported in Phase 1, but advanced features deferred
CustomResourceDefinitions (CRDs): Framework exists, but no custom resources in Phase 1
APIService: Aggregation layer not needed initially

Integration Specifications

1. PrismNET CNI Plugin

Purpose: Provide pod networking using PrismNET's OVN-based SDN.

Interface: CNI 1.0.0 specification (https://github.com/containernetworking/cni/blob/main/SPEC.md)

Components:

CNI binary: /opt/cni/bin/prismnet
Configuration: /etc/cni/net.d/10-prismnet.conflist
IPAM plugin: /opt/cni/bin/prismnet-ipam (or integrated)

Responsibilities:

Create network interface for pod (veth pair)
Allocate IP address from namespace-specific subnet
Connect pod to OVN logical switch
Configure routing for pod egress
Enforce network policies (Phase 2)

Configuration Schema:

{
  "cniVersion": "1.0.0",
  "name": "prismnet",
  "type": "prismnet",
  "ipam": {
    "type": "prismnet-ipam",
    "subnet": "10.244.0.0/16",
    "rangeStart": "10.244.0.10",
    "rangeEnd": "10.244.255.254",
    "routes": [
      {"dst": "0.0.0.0/0"}
    ],
    "gateway": "10.244.0.1"
  },
  "ovn": {
    "northbound": "tcp:prismnet-server:6641",
    "southbound": "tcp:prismnet-server:6642",
    "encapType": "geneve"
  },
  "mtu": 1400,
  "prismnetEndpoint": "prismnet-server:5000"
}

CNI Plugin Workflow:

ADD Command (pod creation):

Input: Container ID, network namespace path, interface name
Process:
- Call PrismNET gRPC API: AllocateIP(namespace, pod_name)
- Create veth pair: one end in pod netns, one in host
- Add host veth to OVN logical switch port
- Configure pod veth: IP address, routes, MTU
- Return: IP config, routes, DNS settings

DEL Command (pod deletion):

Input: Container ID, network namespace path
Process:
- Call PrismNET gRPC API: ReleaseIP(namespace, pod_name)
- Delete OVN logical switch port
- Delete veth pair

CHECK Command (health check):

Verify interface exists and has expected configuration

API Integration (PrismNET gRPC):

service NetworkService {
  rpc AllocateIP(AllocateIPRequest) returns (AllocateIPResponse);
  rpc ReleaseIP(ReleaseIPRequest) returns (ReleaseIPResponse);
  rpc CreateLogicalSwitch(CreateLogicalSwitchRequest) returns (CreateLogicalSwitchResponse);
}

message AllocateIPRequest {
  string namespace = 1;
  string pod_name = 2;
  string container_id = 3;
}

message AllocateIPResponse {
  string ip_address = 1;  // e.g., "10.244.1.5/24"
  string gateway = 2;
  repeated string dns_servers = 3;
}

OVN Topology:

Logical Switch per Namespace: k8s-<namespace> (e.g., k8s-project-123)
Logical Router: k8s-cluster-router for inter-namespace routing
Logical Switch Ports: One per pod (<pod-name>-<container-id>)
ACLs: NetworkPolicy enforcement (Phase 2)

Network Policy Translation (Phase 2):

K8s NetworkPolicy:
  podSelector: app=web
  ingress:
  - from:
    - podSelector: app=frontend
    ports:
    - protocol: TCP
      port: 80

→ OVN ACL:
  direction: to-lport
  match: "ip4.src == $frontend_pods && tcp.dst == 80"
  action: allow-related
  priority: 1000

Address Sets:

Dynamic updates as pods are added/removed
Efficient ACL matching for large pod groups

2. FiberLB LoadBalancer Controller

Purpose: Reconcile K8s Services of type LoadBalancer with FiberLB resources.

Architecture:

Controller Process: Runs as a pod in kube-system namespace or embedded in k3s server
Watch Resources: Services (type=LoadBalancer), Endpoints
Manage Resources: FiberLB LoadBalancers, Listeners, Pools, Members

Controller Logic:

1. Service Watch Loop:

for event := range serviceWatcher {
  if event.Type == Created || event.Type == Updated {
    if service.Spec.Type == "LoadBalancer" {
      reconcileLoadBalancer(service)
    }
  } else if event.Type == Deleted {
    deleteLoadBalancer(service)
  }
}

2. Reconcile Logic:

Input: Service object
Process:
1. Check if FiberLB LoadBalancer exists (by annotation or name mapping)
2. If not exists:
   a. Allocate external IP from pool
   b. Create FiberLB LoadBalancer resource (gRPC CreateLoadBalancer)
   c. Store LoadBalancer ID in service annotation
3. For each service.Spec.Ports:
   a. Create/update FiberLB Listener (protocol, port, algorithm)
4. Get service endpoints:
   a. Create/update FiberLB Pool with backend members (pod IPs, ports)
5. Update service.Status.LoadBalancer.Ingress with external IP
6. If service spec changed:
   a. Update FiberLB resources accordingly

3. Endpoint Watch Loop:

for event := range endpointWatcher {
  service := getServiceForEndpoint(event.Object)
  if service.Spec.Type == "LoadBalancer" {
    updateLoadBalancerPool(service, event.Object)
  }
}

Configuration:

External IP Pool: --external-ip-pool=192.168.100.0/24 (CIDR or IP range)
FiberLB Endpoint: --fiberlb-endpoint=fiberlb-server:7000 (gRPC address)
IP Allocation: First-available or integration with IPAM service

Service Annotations:

apiVersion: v1
kind: Service
metadata:
  name: web-service
  annotations:
    fiberlb.plasmacloud.io/load-balancer-id: "lb-abc123"
    fiberlb.plasmacloud.io/algorithm: "round-robin"  # round-robin | least-conn | ip-hash
    fiberlb.plasmacloud.io/health-check-path: "/health"
    fiberlb.plasmacloud.io/health-check-interval: "10s"
    fiberlb.plasmacloud.io/health-check-timeout: "5s"
    fiberlb.plasmacloud.io/health-check-retries: "3"
    fiberlb.plasmacloud.io/session-affinity: "client-ip"  # For sticky sessions
spec:
  type: LoadBalancer
  selector:
    app: web
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
status:
  loadBalancer:
    ingress:
    - ip: 192.168.100.50

FiberLB gRPC API Integration:

service LoadBalancerService {
  rpc CreateLoadBalancer(CreateLoadBalancerRequest) returns (LoadBalancer);
  rpc UpdateLoadBalancer(UpdateLoadBalancerRequest) returns (LoadBalancer);
  rpc DeleteLoadBalancer(DeleteLoadBalancerRequest) returns (Empty);
  rpc CreateListener(CreateListenerRequest) returns (Listener);
  rpc UpdatePool(UpdatePoolRequest) returns (Pool);
}

message CreateLoadBalancerRequest {
  string name = 1;
  string description = 2;
  string external_ip = 3;  // If empty, allocate from pool
  string org_id = 4;
  string project_id = 5;
}

message CreateListenerRequest {
  string load_balancer_id = 1;
  string protocol = 2;  // TCP, UDP, HTTP, HTTPS
  int32 port = 3;
  string default_pool_id = 4;
  HealthCheck health_check = 5;
}

message UpdatePoolRequest {
  string pool_id = 1;
  repeated PoolMember members = 2;
  string algorithm = 3;
}

message PoolMember {
  string address = 1;  // Pod IP
  int32 port = 2;
  int32 weight = 3;
}

Health Checks:

HTTP health checks: Use annotation health-check-path
TCP health checks: Connection-based for non-HTTP services
Health check failures remove pod from pool (auto-healing)

Edge Cases:

Service deletion: Controller must clean up FiberLB resources and release external IP
Endpoint churn: Debounce pool updates to avoid excessive FiberLB API calls
IP exhaustion: Return error event on service, set status condition

3. IAM Authentication Webhook

Purpose: Authenticate K8s API requests using PlasmaCloud IAM tokens.

Architecture:

Webhook Server: HTTPS endpoint (can be part of IAM service or standalone)
Integration Point: kube-apiserver --authentication-token-webhook-config-file
Protocol: K8s TokenReview API

Webhook Endpoint: POST /apis/iam.plasmacloud.io/v1/authenticate

Request Flow:

kubectl --token=<IAM-token> get pods
    ↓
kube-apiserver extracts Bearer token
    ↓
POST /apis/iam.plasmacloud.io/v1/authenticate
    body: TokenReview with token
    ↓
IAM webhook validates token
    ↓
Response: authenticated=true, user info, groups
    ↓
kube-apiserver proceeds with RBAC authorization

Request Schema (from kube-apiserver):

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "spec": {
    "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
  }
}

Response Schema (from IAM webhook):

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "status": {
    "authenticated": true,
    "user": {
      "username": "user@example.com",
      "uid": "user-550e8400-e29b-41d4-a716-446655440000",
      "groups": [
        "org:org-123",
        "project:proj-456",
        "system:authenticated"
      ],
      "extra": {
        "org_id": ["org-123"],
        "project_id": ["proj-456"],
        "roles": ["org_admin"]
      }
    }
  }
}

Error Response (invalid token):

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "status": {
    "authenticated": false,
    "error": "Invalid or expired token"
  }
}

IAM Token Format:

JWT: Signed by IAM service with shared secret or public/private key
Claims: sub (user ID), email, org_id, project_id, roles, exp (expiration)

Example:

{
  "sub": "user-550e8400-e29b-41d4-a716-446655440000",
  "email": "user@example.com",
  "org_id": "org-123",
  "project_id": "proj-456",
  "roles": ["org_admin", "project_member"],
  "exp": 1672531200
}

User/Group Mapping:

IAM Principal	K8s Username	K8s Groups
User (email)	user@example.com	org:<org_id>, project:<project_id>, system:authenticated
User (ID)	user-	org:<org_id>, project:<project_id>, system:authenticated
Service Account	sa-@	org:<org_id>, project:<project_id>, system:serviceaccounts
Org Admin	admin@example.com	org:<org_id>, project:<all_projects>, k8s:org-admin

RBAC Integration:

Groups are used in RoleBindings and ClusterRoleBindings
Example: org:org-123 group gets admin access to all project-* namespaces for that org

Webhook Configuration File (/etc/k8shost/iam-webhook.yaml):

apiVersion: v1
kind: Config
clusters:
- name: iam-webhook
  cluster:
    server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
    certificate-authority: /etc/k8shost/ca.crt
users:
- name: k8s-apiserver
  user:
    client-certificate: /etc/k8shost/apiserver-client.crt
    client-key: /etc/k8shost/apiserver-client.key
current-context: webhook
contexts:
- context:
    cluster: iam-webhook
    user: k8s-apiserver
  name: webhook

Performance Considerations:

Caching: kube-apiserver caches successful authentications (--authentication-token-webhook-cache-ttl=2m)
Timeouts: Webhook must respond within 10s (configurable)
Rate Limiting: IAM webhook should handle high request volume (100s of req/s)

4. FlashDNS Service Discovery Controller

Purpose: Synchronize K8s Services and Pods to FlashDNS for cluster DNS resolution.

Architecture:

Controller Process: Runs as pod in kube-system or embedded in k3s server
Watch Resources: Services, Endpoints, Pods
Manage Resources: FlashDNS A/AAAA/SRV records

DNS Hierarchy:

Pod A Records: <pod-ip-dashed>.pod.cluster.local → Pod IP
- Example: 10-244-1-5.pod.cluster.local → 10.244.1.5
Service A Records: <service>.<namespace>.svc.cluster.local → ClusterIP or external IP
- Example: web.default.svc.cluster.local → 10.96.0.100
Headless Service: <endpoint>.<service>.<namespace>.svc.cluster.local → Endpoint IPs
- Example: web-0.web.default.svc.cluster.local → 10.244.1.10
SRV Records: _<port>._<protocol>.<service>.<namespace>.svc.cluster.local
- Example: _http._tcp.web.default.svc.cluster.local → 0 50 80 web.default.svc.cluster.local

Controller Logic:

1. Service Watch:

for event := range serviceWatcher {
  service := event.Object
  switch event.Type {
  case Created, Updated:
    if service.Spec.ClusterIP != "None":
      // Regular service
      createOrUpdateDNSRecord(
        name: service.Name + "." + service.Namespace + ".svc.cluster.local",
        type: "A",
        value: service.Spec.ClusterIP
      )

      if len(service.Status.LoadBalancer.Ingress) > 0:
        // LoadBalancer service - also add external IP
        createOrUpdateDNSRecord(
          name: service.Name + "." + service.Namespace + ".svc.cluster.local",
          type: "A",
          value: service.Status.LoadBalancer.Ingress[0].IP
        )
    else:
      // Headless service - add endpoint records
      endpoints := getEndpoints(service)
      for _, ep := range endpoints:
        createOrUpdateDNSRecord(
          name: ep.Hostname + "." + service.Name + "." + service.Namespace + ".svc.cluster.local",
          type: "A",
          value: ep.IP
        )

    // Create SRV records for each port
    for _, port := range service.Spec.Ports:
      createSRVRecord(service, port)

  case Deleted:
    deleteDNSRecords(service)
  }
}

2. Pod Watch (for pod DNS):

for event := range podWatcher {
  pod := event.Object
  switch event.Type {
  case Created, Updated:
    if pod.Status.PodIP != "":
      dashedIP := strings.ReplaceAll(pod.Status.PodIP, ".", "-")
      createOrUpdateDNSRecord(
        name: dashedIP + ".pod.cluster.local",
        type: "A",
        value: pod.Status.PodIP
      )
  case Deleted:
    deleteDNSRecord(pod)
  }
}

FlashDNS gRPC API Integration:

service DNSService {
  rpc CreateRecord(CreateRecordRequest) returns (DNSRecord);
  rpc UpdateRecord(UpdateRecordRequest) returns (DNSRecord);
  rpc DeleteRecord(DeleteRecordRequest) returns (Empty);
  rpc ListRecords(ListRecordsRequest) returns (ListRecordsResponse);
}

message CreateRecordRequest {
  string zone = 1;  // "cluster.local"
  string name = 2;  // "web.default.svc"
  string type = 3;  // "A", "AAAA", "SRV", "CNAME"
  string value = 4; // "10.96.0.100"
  int32 ttl = 5;    // 30 (seconds)
  map<string, string> labels = 6;  // k8s metadata
}

message DNSRecord {
  string id = 1;
  string zone = 2;
  string name = 3;
  string type = 4;
  string value = 5;
  int32 ttl = 6;
}

Configuration:

FlashDNS Endpoint: --flashdns-endpoint=flashdns-server:6000
Cluster Domain: --cluster-domain=cluster.local (default)
Record TTL: --dns-ttl=30 (seconds, low for fast updates)

Example DNS Records:

# Regular service
web.default.svc.cluster.local.  30 IN A 10.96.0.100

# Headless service with 3 pods
web.default.svc.cluster.local.  30 IN A 10.244.1.10
web.default.svc.cluster.local.  30 IN A 10.244.1.11
web.default.svc.cluster.local.  30 IN A 10.244.1.12

# StatefulSet pods (Phase 3)
web-0.web.default.svc.cluster.local.  30 IN A 10.244.1.10
web-1.web.default.svc.cluster.local.  30 IN A 10.244.1.11

# SRV record for service port
_http._tcp.web.default.svc.cluster.local. 30 IN SRV 0 50 80 web.default.svc.cluster.local.

# Pod DNS
10-244-1-10.pod.cluster.local.  30 IN A 10.244.1.10

Integration with kubelet:

kubelet configures pod DNS via /etc/resolv.conf
nameserver: FlashDNS service IP (typically first IP in service CIDR, e.g., 10.96.0.10)
search: <namespace>.svc.cluster.local svc.cluster.local cluster.local

Edge Cases:

Service IP change: Update DNS record atomically
Endpoint churn: Debounce updates for headless services with many endpoints
DNS caching: Low TTL (30s) for fast convergence

5. LightningStor CSI Driver

Purpose: Provide dynamic PersistentVolume provisioning and lifecycle management.

CSI Driver Name: stor.plasmacloud.io

Architecture:

Controller Plugin: Runs as StatefulSet or Deployment in kube-system
- Provisioning, deletion, attaching, detaching, snapshots
Node Plugin: Runs as DaemonSet on every node
- Staging, publishing (mounting), unpublishing, unstaging

CSI Components:

1. Controller Service (Identity, Controller RPCs):

CreateVolume: Provision new volume via LightningStor
DeleteVolume: Delete volume
ControllerPublishVolume: Attach volume to node
ControllerUnpublishVolume: Detach volume from node
ValidateVolumeCapabilities: Check if volume supports requested capabilities
ListVolumes: List all volumes
GetCapacity: Query available storage capacity
CreateSnapshot, DeleteSnapshot: Volume snapshots (Phase 2)

2. Node Service (Node RPCs):

NodeStageVolume: Mount volume to global staging path on node
NodeUnstageVolume: Unmount from staging path
NodePublishVolume: Bind mount from staging to pod path
NodeUnpublishVolume: Unmount from pod path
NodeGetInfo: Return node ID and topology
NodeGetCapabilities: Return node capabilities

CSI Driver Workflow:

Volume Provisioning:

1. User creates PVC:
   apiVersion: v1
   kind: PersistentVolumeClaim
   metadata:
     name: my-pvc
   spec:
     accessModes: [ReadWriteOnce]
     resources:
       requests:
         storage: 10Gi
     storageClassName: lightningstor-ssd

2. CSI Controller watches PVC, calls CreateVolume:
   CreateVolumeRequest {
     name: "pvc-550e8400-e29b-41d4-a716-446655440000"
     capacity_range: { required_bytes: 10737418240 }
     volume_capabilities: [{ access_mode: SINGLE_NODE_WRITER }]
     parameters: {
       "type": "ssd",
       "replication": "3",
       "org_id": "org-123",
       "project_id": "proj-456"
     }
   }

3. CSI Controller calls LightningStor gRPC CreateVolume:
   LightningStor creates volume, returns volume_id

4. CSI Controller creates PV:
   apiVersion: v1
   kind: PersistentVolume
   metadata:
     name: pvc-550e8400-e29b-41d4-a716-446655440000
   spec:
     capacity:
       storage: 10Gi
     accessModes: [ReadWriteOnce]
     persistentVolumeReclaimPolicy: Delete
     storageClassName: lightningstor-ssd
     csi:
       driver: stor.plasmacloud.io
       volumeHandle: vol-abc123
       fsType: ext4

5. K8s binds PVC to PV

Volume Attachment (when pod is scheduled):

1. kube-controller-manager creates VolumeAttachment:
   apiVersion: storage.k8s.io/v1
   kind: VolumeAttachment
   metadata:
     name: csi-<hash>
   spec:
     attacher: stor.plasmacloud.io
     nodeName: worker-1
     source:
       persistentVolumeName: pvc-550e8400-e29b-41d4-a716-446655440000

2. CSI Controller watches VolumeAttachment, calls ControllerPublishVolume:
   ControllerPublishVolumeRequest {
     volume_id: "vol-abc123"
     node_id: "worker-1"
     volume_capability: { access_mode: SINGLE_NODE_WRITER }
   }

3. CSI Controller calls LightningStor gRPC AttachVolume:
   LightningStor attaches volume to node (e.g., iSCSI target, NBD)

4. CSI Controller updates VolumeAttachment status: attached=true

Volume Mounting (on node):

1. kubelet calls CSI Node plugin: NodeStageVolume
   NodeStageVolumeRequest {
     volume_id: "vol-abc123"
     staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
     volume_capability: { mount: { fs_type: "ext4" } }
   }

2. CSI Node plugin:
   - Discovers block device (e.g., /dev/nbd0) via LightningStor
   - Formats if needed: mkfs.ext4 /dev/nbd0
   - Mounts to staging path: mount /dev/nbd0 <staging_target_path>

3. kubelet calls CSI Node plugin: NodePublishVolume
   NodePublishVolumeRequest {
     volume_id: "vol-abc123"
     staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
     target_path: "/var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/pvc-<hash>/mount"
   }

4. CSI Node plugin:
   - Bind mount staging path to target path
   - Pod can now read/write to volume

LightningStor gRPC API Integration:

service VolumeService {
  rpc CreateVolume(CreateVolumeRequest) returns (Volume);
  rpc DeleteVolume(DeleteVolumeRequest) returns (Empty);
  rpc AttachVolume(AttachVolumeRequest) returns (VolumeAttachment);
  rpc DetachVolume(DetachVolumeRequest) returns (Empty);
  rpc GetVolume(GetVolumeRequest) returns (Volume);
  rpc ListVolumes(ListVolumesRequest) returns (ListVolumesResponse);
}

message CreateVolumeRequest {
  string name = 1;
  int64 size_bytes = 2;
  string volume_type = 3;  // "ssd", "hdd"
  int32 replication_factor = 4;
  string org_id = 5;
  string project_id = 6;
}

message Volume {
  string id = 1;
  string name = 2;
  int64 size_bytes = 3;
  string status = 4;  // "available", "in-use", "error"
  string volume_type = 5;
}

message AttachVolumeRequest {
  string volume_id = 1;
  string node_id = 2;
  string attach_mode = 3;  // "read-write", "read-only"
}

message VolumeAttachment {
  string id = 1;
  string volume_id = 2;
  string node_id = 3;
  string device_path = 4;  // e.g., "/dev/nbd0"
  string connection_info = 5;  // JSON with iSCSI target, NBD socket, etc.
}

StorageClass Examples:

# SSD storage with 3x replication
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: lightningstor-ssd
provisioner: stor.plasmacloud.io
parameters:
  type: "ssd"
  replication: "3"
volumeBindingMode: WaitForFirstConsumer  # Topology-aware scheduling
allowVolumeExpansion: true
reclaimPolicy: Delete

---
# HDD storage with 2x replication
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: lightningstor-hdd
provisioner: stor.plasmacloud.io
parameters:
  type: "hdd"
  replication: "2"
volumeBindingMode: Immediate
allowVolumeExpansion: true
reclaimPolicy: Retain  # Keep volume after PVC deletion

Access Modes:

ReadWriteOnce (RWO): Single node read-write (most common)
ReadOnlyMany (ROX): Multiple nodes read-only
ReadWriteMany (RWX): Multiple nodes read-write (requires shared filesystem like NFS, Phase 2)

Volume Expansion (if allowVolumeExpansion: true):

1. User edits PVC: spec.resources.requests.storage: 20Gi (was 10Gi)
2. CSI Controller calls ControllerExpandVolume
3. LightningStor expands volume backend
4. CSI Node plugin calls NodeExpandVolume
5. Filesystem resize: resize2fs /dev/nbd0

6. PlasmaVMC Integration

Phase 1 (MVP): Use containerd as default CRI

k3s ships with containerd embedded
Standard OCI container runtime
No changes needed for Phase 1

Phase 3 (Future): Custom CRI for VM-backed pods

Motivation:

Enhanced Isolation: Stronger security boundary than containers
Multi-Tenant Security: Prevent container escape attacks
Consistent Runtime: Unify VM and container workloads on PlasmaVMC

Architecture:

PlasmaVMC implements CRI (Container Runtime Interface)
Each pod runs as a lightweight VM (Firecracker microVM)
Pod containers run inside VM (still using containerd within VM)
kubelet communicates with PlasmaVMC CRI endpoint instead of containerd

CRI Interface Implementation:

RuntimeService:

RunPodSandbox: Create Firecracker microVM for pod
StopPodSandbox: Stop microVM
RemovePodSandbox: Delete microVM
PodSandboxStatus: Query microVM status
ListPodSandbox: List all pod microVMs
CreateContainer: Create container inside microVM
StartContainer, StopContainer, RemoveContainer: Container lifecycle
ExecSync, Exec: Execute commands in container
Attach: Attach to container stdio

ImageService:

PullImage: Download container image (delegate to internal containerd)
RemoveImage: Delete image
ListImages: List cached images
ImageStatus: Query image metadata

Implementation Strategy:

┌─────────────────────────────────────────┐
│           kubelet (k3s agent)           │
└─────────────┬───────────────────────────┘
              │ CRI gRPC
              ▼
┌─────────────────────────────────────────┐
│      PlasmaVMC CRI Server (Rust)        │
│  - RunPodSandbox → Create microVM       │
│  - CreateContainer → Run in VM          │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│      Firecracker VMM (per pod)          │
│  ┌───────────────────────────────────┐  │
│  │  Pod VM (minimal Linux kernel)    │  │
│  │  ┌──────────────────────────────┐ │  │
│  │  │ containerd (in-VM)           │ │  │
│  │  │  - Container 1               │ │  │
│  │  │  - Container 2               │ │  │
│  │  └──────────────────────────────┘ │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Configuration (Phase 3):

services.k8shost = {
  enable = true;
  cri = "plasmavmc";  # Instead of "containerd"
  plasmavmc = {
    endpoint = "unix:///var/run/plasmavmc/cri.sock";
    vmKernel = "/var/lib/plasmavmc/vmlinux.bin";
    vmRootfs = "/var/lib/plasmavmc/rootfs.ext4";
  };
};

Benefits:

Stronger isolation for untrusted workloads
Leverage existing PlasmaVMC infrastructure
Consistent management across VM and K8s workloads

Challenges:

Performance overhead (microVM startup time, memory overhead)
Image caching complexity (need containerd inside VM)
Networking integration (CNI must configure VM network)

Decision: Defer to Phase 3, focus on standard containerd for MVP.

Multi-Tenant Model

Namespace Strategy

Principle: One K8s namespace per PlasmaCloud project.

Namespace Naming:

Project namespaces: project-<project_id> (e.g., project-550e8400-e29b-41d4-a716-446655440000)
Org shared namespaces (optional): org-<org_id>-shared (for shared resources like monitoring)
System namespaces: kube-system, kube-public, kube-node-lease, default

Namespace Lifecycle:

Created automatically when project provisions K8s cluster
Labeled with org_id, project_id for RBAC and billing
Deleted when project is deleted (with grace period)

Namespace Metadata:

apiVersion: v1
kind: Namespace
metadata:
  name: project-550e8400-e29b-41d4-a716-446655440000
  labels:
    plasmacloud.io/org-id: "org-123"
    plasmacloud.io/project-id: "proj-456"
    plasmacloud.io/tenant-type: "project"
  annotations:
    plasmacloud.io/project-name: "my-web-app"
    plasmacloud.io/created-by: "user@example.com"

RBAC Templates

Org Admin Role (full access to all project namespaces):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: org-admin
  namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: org-admin-binding
  namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
  name: org:org-123
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: org-admin
  apiGroup: rbac.authorization.k8s.io

Project Admin Role (full access to specific project namespace):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: project-admin
  namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "storage.k8s.io"]
  resources: ["*"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: project-admin-binding
  namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
  name: project:proj-456
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: project-admin
  apiGroup: rbac.authorization.k8s.io

Project Viewer Role (read-only access):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: project-viewer
  namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io"]
  resources: ["pods", "services", "deployments", "replicasets", "configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: project-viewer-binding
  namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
  name: project:proj-456:viewer
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: project-viewer
  apiGroup: rbac.authorization.k8s.io

ClusterRole for Node Access (for cluster admins):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: plasmacloud-cluster-admin
rules:
- apiGroups: [""]
  resources: ["nodes", "persistentvolumes"]
  verbs: ["*"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: plasmacloud-cluster-admin-binding
subjects:
- kind: Group
  name: system:plasmacloud-admins
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: plasmacloud-cluster-admin
  apiGroup: rbac.authorization.k8s.io

Network Isolation

Default NetworkPolicy (deny all, except DNS):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53  # DNS

Allow Ingress from LoadBalancer:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-loadbalancer
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0  # Allow from anywhere (LoadBalancer external traffic)
    ports:
    - protocol: TCP
      port: 8080

Allow Inter-Namespace Communication (optional, for org-shared services):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-org-shared
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          plasmacloud.io/org-id: "org-123"
          plasmacloud.io/tenant-type: "org-shared"

PrismNET Enforcement:

NetworkPolicies are translated to OVN ACLs by PrismNET CNI controller
Enforced at OVN logical switch level (low-level packet filtering)

Resource Quotas

CPU and Memory Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: project-compute-quota
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  hard:
    requests.cpu: "10"       # 10 CPU cores
    requests.memory: "20Gi"  # 20 GB RAM
    limits.cpu: "20"         # Allow bursting to 20 cores
    limits.memory: "40Gi"    # Allow bursting to 40 GB RAM

Storage Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: project-storage-quota
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  hard:
    persistentvolumeclaims: "10"  # Max 10 PVCs
    requests.storage: "100Gi"     # Total storage requests

Object Count Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: project-object-quota
  namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
  hard:
    pods: "50"
    services: "20"
    services.loadbalancers: "5"   # Max 5 LoadBalancer services (limit external IPs)
    configmaps: "50"
    secrets: "50"

Quota Enforcement:

K8s admission controller rejects resource creation exceeding quota
User receives clear error message
Quota usage visible in kubectl describe quota

Deployment Model

Single-Server (Development/Small)

Target Use Case:

Development and testing environments
Small production workloads (<10 nodes)
Cost-sensitive deployments

Architecture:

Single k3s server node with embedded SQLite datastore
Control plane and worker colocated
No HA guarantees

k3s Server Command:

k3s server \
  --data-dir=/var/lib/k8shost \
  --disable=servicelb,traefik,flannel \
  --flannel-backend=none \
  --disable-network-policy \
  --cluster-domain=cluster.local \
  --service-cidr=10.96.0.0/12 \
  --cluster-cidr=10.244.0.0/16 \
  --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
  --bind-address=0.0.0.0 \
  --advertise-address=192.168.1.100 \
  --tls-san=k8s-api.example.com

NixOS Configuration:

{ config, lib, pkgs, ... }:

{
  services.k8shost = {
    enable = true;
    mode = "server";
    datastore = "sqlite";  # Embedded SQLite
    disableComponents = ["servicelb" "traefik" "flannel"];

    networking = {
      serviceCIDR = "10.96.0.0/12";
      clusterCIDR = "10.244.0.0/16";
      clusterDomain = "cluster.local";
    };

    prismnet = {
      enable = true;
      endpoint = "prismnet-server:5000";
      ovnNorthbound = "tcp:prismnet-server:6641";
      ovnSouthbound = "tcp:prismnet-server:6642";
    };

    fiberlb = {
      enable = true;
      endpoint = "fiberlb-server:7000";
      externalIpPool = "192.168.100.0/24";
    };

    iam = {
      enable = true;
      webhookEndpoint = "https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate";
      caCertFile = "/etc/k8shost/ca.crt";
      clientCertFile = "/etc/k8shost/client.crt";
      clientKeyFile = "/etc/k8shost/client.key";
    };

    flashdns = {
      enable = true;
      endpoint = "flashdns-server:6000";
      clusterDomain = "cluster.local";
      recordTTL = 30;
    };

    lightningstor = {
      enable = true;
      endpoint = "lightningstor-server:8000";
      csiNodeDaemonSet = true;  # Deploy CSI node plugin as DaemonSet
    };
  };

  # Open firewall for K8s API
  networking.firewall.allowedTCPPorts = [ 6443 ];
}

Limitations:

No HA (single point of failure)
SQLite has limited concurrency
Control plane downtime affects entire cluster

HA Cluster (Production)

Target Use Case:

Production workloads requiring high availability
Large clusters (>10 nodes)
Mission-critical applications

Architecture:

3 or 5 k3s server nodes (odd number for quorum)
Embedded etcd (Raft consensus, HA datastore)
Load balancer in front of API servers
Agent nodes for workload scheduling

k3s Server Command (each server node):

k3s server \
  --data-dir=/var/lib/k8shost \
  --disable=servicelb,traefik,flannel \
  --flannel-backend=none \
  --disable-network-policy \
  --cluster-domain=cluster.local \
  --service-cidr=10.96.0.0/12 \
  --cluster-cidr=10.244.0.0/16 \
  --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
  --cluster-init \  # First server only
  --server https://k8s-api-lb.internal:6443 \  # Join existing cluster (not for first server)
  --tls-san=k8s-api-lb.example.com \
  --tls-san=k8s-api.example.com

k3s Agent Command (worker nodes):

k3s agent \
  --server https://k8s-api-lb.internal:6443 \
  --token <join-token>

NixOS Configuration (Server Node):

{ config, lib, pkgs, ... }:

{
  services.k8shost = {
    enable = true;
    mode = "server";
    datastore = "etcd";  # Embedded etcd for HA
    clusterInit = true;  # Set to false for joining servers
    serverUrl = "https://k8s-api-lb.internal:6443";  # For joining servers

    # ... same integrations as single-server ...
  };

  # High availability settings
  systemd.services.k8shost = {
    serviceConfig = {
      Restart = "always";
      RestartSec = "10s";
    };
  };
}

Load Balancer Configuration (FiberLB):

# External LoadBalancer for API access
apiVersion: v1
kind: LoadBalancer
metadata:
  name: k8s-api-lb
spec:
  listeners:
  - protocol: TCP
    port: 6443
    backend_pool: k8s-api-servers
  pools:
  - name: k8s-api-servers
    algorithm: round-robin
    members:
    - address: 192.168.1.101  # server-1
      port: 6443
    - address: 192.168.1.102  # server-2
      port: 6443
    - address: 192.168.1.103  # server-3
      port: 6443
    health_check:
      type: tcp
      interval: 10s
      timeout: 5s
      retries: 3

Datastore Options:

Option 1: Embedded etcd (Recommended for MVP)

Pros:

Built-in to k3s, no external dependencies
Proven, battle-tested (CNCF etcd project)
Automatic HA with Raft consensus
Easy setup (just --cluster-init)

Cons:

Another distributed datastore (in addition to Chainfire/FlareDB)
etcd-specific operations (backup, restore, defragmentation)

Option 2: FlareDB as External Datastore

Pros:

Unified storage layer for PlasmaCloud
Leverage existing FlareDB deployment
Simplified infrastructure (one less system to manage)

Cons:

k3s requires etcd API compatibility
FlareDB would need to implement etcd v3 API (significant effort)
Untested for K8s workloads

Recommendation for MVP: Use embedded etcd for HA mode. Investigate FlareDB etcd compatibility in Phase 2 or 3.

Backup and Disaster Recovery:

# etcd snapshot (on any server node)
k3s etcd-snapshot save --name backup-$(date +%Y%m%d-%H%M%S)

# List snapshots
k3s etcd-snapshot ls

# Restore from snapshot
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/k8shost/server/db/snapshots/backup-20250101-120000

NixOS Module Integration

Module Structure:

nix/modules/
├── k8shost.nix              # Main module
├── k8shost/
│   ├── controller.nix       # FiberLB, FlashDNS controllers
│   ├── csi.nix              # LightningStor CSI driver
│   └── cni.nix              # PrismNET CNI plugin

Main Module (nix/modules/k8shost.nix):

{ config, lib, pkgs, ... }:

with lib;

let
  cfg = config.services.k8shost;
in
{
  options.services.k8shost = {
    enable = mkEnableOption "PlasmaCloud K8s Hosting Service";

    mode = mkOption {
      type = types.enum ["server" "agent"];
      default = "server";
      description = "Run as server (control plane) or agent (worker)";
    };

    datastore = mkOption {
      type = types.enum ["sqlite" "etcd"];
      default = "sqlite";
      description = "Datastore backend (sqlite for single-server, etcd for HA)";
    };

    disableComponents = mkOption {
      type = types.listOf types.str;
      default = ["servicelb" "traefik" "flannel"];
      description = "k3s components to disable";
    };

    networking = {
      serviceCIDR = mkOption {
        type = types.str;
        default = "10.96.0.0/12";
        description = "CIDR for service ClusterIPs";
      };

      clusterCIDR = mkOption {
        type = types.str;
        default = "10.244.0.0/16";
        description = "CIDR for pod IPs";
      };

      clusterDomain = mkOption {
        type = types.str;
        default = "cluster.local";
        description = "Cluster DNS domain";
      };
    };

    # Integration options (prismnet, fiberlb, iam, flashdns, lightningstor)
    # ...
  };

  config = mkIf cfg.enable {
    # Install k3s package
    environment.systemPackages = [ pkgs.k3s ];

    # Create systemd service
    systemd.services.k8shost = {
      description = "PlasmaCloud K8s Hosting Service (k3s)";
      after = [ "network.target" "iam.service" "prismnet.service" ];
      requires = [ "iam.service" "prismnet.service" ];
      wantedBy = [ "multi-user.target" ];

      serviceConfig = {
        Type = "notify";
        ExecStart = "${pkgs.k3s}/bin/k3s ${cfg.mode} ${concatStringsSep " " (buildServerArgs cfg)}";
        KillMode = "process";
        Delegate = "yes";
        LimitNOFILE = 1048576;
        LimitNPROC = "infinity";
        LimitCORE = "infinity";
        TasksMax = "infinity";
        Restart = "always";
        RestartSec = "5s";
      };
    };

    # Create configuration files
    environment.etc."k8shost/iam-webhook.yaml" = {
      text = generateIAMWebhookConfig cfg.iam;
      mode = "0600";
    };

    # Deploy controllers (FiberLB, FlashDNS, etc.)
    # ... (as separate systemd services or in-cluster deployments)
  };
}

API Server Configuration

k3s Server Flags (Complete)

k3s server \
  # Data and cluster configuration
  --data-dir=/var/lib/k8shost \
  --cluster-init \  # For first server in HA cluster
  --server https://k8s-api-lb.internal:6443 \  # Join existing HA cluster
  --token <cluster-token> \  # Secure join token

  # Disable default components
  --disable=servicelb,traefik,flannel,local-storage \
  --flannel-backend=none \
  --disable-network-policy \

  # Network configuration
  --cluster-domain=cluster.local \
  --service-cidr=10.96.0.0/12 \
  --cluster-cidr=10.244.0.0/16 \
  --service-node-port-range=30000-32767 \

  # API server configuration
  --bind-address=0.0.0.0 \
  --advertise-address=192.168.1.100 \
  --tls-san=k8s-api.example.com \
  --tls-san=k8s-api-lb.example.com \

  # Authentication
  --authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
  --authentication-token-webhook-cache-ttl=2m \

  # Authorization (RBAC enabled by default)
  # --authorization-mode=Node,RBAC \  # Default, no need to specify

  # Audit logging
  --kube-apiserver-arg=audit-log-path=/var/log/k8shost/audit.log \
  --kube-apiserver-arg=audit-log-maxage=30 \
  --kube-apiserver-arg=audit-log-maxbackup=10 \
  --kube-apiserver-arg=audit-log-maxsize=100 \

  # Feature gates (if needed)
  # --kube-apiserver-arg=feature-gates=SomeFeature=true

Authentication Webhook Configuration

File: /etc/k8shost/iam-webhook.yaml

apiVersion: v1
kind: Config
clusters:
- name: iam-webhook
  cluster:
    server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
    certificate-authority: /etc/k8shost/ca.crt
users:
- name: k8s-apiserver
  user:
    client-certificate: /etc/k8shost/apiserver-client.crt
    client-key: /etc/k8shost/apiserver-client.key
current-context: webhook
contexts:
- context:
    cluster: iam-webhook
    user: k8s-apiserver
  name: webhook

Certificate Management:

CA certificate: Issued by PlasmaCloud IAM PKI
Client certificate: For kube-apiserver to authenticate to IAM webhook
Rotation: Certificates expire after 1 year, auto-renewed by IAM

Security

TLS/mTLS

Component Communication:

Source	Destination	Protocol	Auth Method
kube-apiserver	IAM webhook	HTTPS + mTLS	Client cert
FiberLB controller	FiberLB gRPC	gRPC + TLS	IAM token
FlashDNS controller	FlashDNS gRPC	gRPC + TLS	IAM token
LightningStor CSI	LightningStor gRPC	gRPC + TLS	IAM token
PrismNET CNI	PrismNET gRPC	gRPC + TLS	IAM token
kubectl	kube-apiserver	HTTPS	IAM token (Bearer)

Certificate Issuance:

All certificates issued by IAM service (centralized PKI)
Automatic renewal before expiration
Certificate revocation via IAM CRL

Pod Security

Pod Security Standards (PSS):

Baseline Profile: Enforced on all namespaces by default
- Deny privileged containers
- Deny host network/PID/IPC
- Deny hostPath volumes
- Deny privilege escalation
Restricted Profile: Optional, for highly sensitive workloads

Example PodSecurityPolicy (deprecated in K8s 1.25, use PSS):

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny

Security Contexts (enforced):

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL

Service Account Permissions:

Minimal RBAC permissions by default
Principle of least privilege
No cluster-admin access for user workloads

Testing Strategy

Unit Tests

Controllers (Go):

// fiberlb_controller_test.go
func TestReconcileLoadBalancer(t *testing.T) {
    // Mock K8s client
    client := fake.NewSimpleClientset()

    // Mock FiberLB gRPC client
    mockFiberLB := &mockFiberLBClient{}

    controller := NewFiberLBController(client, mockFiberLB)

    // Create test service
    svc := &corev1.Service{
        ObjectMeta: metav1.ObjectMeta{Name: "test-svc", Namespace: "default"},
        Spec: corev1.ServiceSpec{Type: corev1.ServiceTypeLoadBalancer},
    }

    // Reconcile
    err := controller.Reconcile(svc)
    assert.NoError(t, err)

    // Verify FiberLB API called
    assert.Equal(t, 1, mockFiberLB.createLoadBalancerCalls)
}

CNI Plugin (Rust):

#[test]
fn test_cni_add() {
    let mut mock_ovn = MockOVNClient::new();
    mock_ovn.expect_allocate_ip()
        .returning(|ns, pod| Ok("10.244.1.5/24".to_string()));

    let plugin = PrismNETPlugin::new(mock_ovn);
    let result = plugin.handle_add(/* ... */);

    assert!(result.is_ok());
    assert_eq!(result.unwrap().ip, "10.244.1.5");
}

CSI Driver (Go):

func TestCreateVolume(t *testing.T) {
    mockLightningStor := &mockLightningStorClient{}
    mockLightningStor.On("CreateVolume", mock.Anything).Return(&Volume{ID: "vol-123"}, nil)

    driver := NewCSIDriver(mockLightningStor)

    req := &csi.CreateVolumeRequest{
        Name: "test-vol",
        CapacityRange: &csi.CapacityRange{RequiredBytes: 10 * 1024 * 1024 * 1024},
    }

    resp, err := driver.CreateVolume(context.Background(), req)
    assert.NoError(t, err)
    assert.Equal(t, "vol-123", resp.Volume.VolumeId)
}

Integration Tests

Test Environment:

Single-node k3s cluster (kind or k3s in Docker)
Mock or real PlasmaCloud services (PrismNET, FiberLB, etc.)
Automated setup and teardown

Test Cases:

1. Single-Pod Deployment:

#!/bin/bash
set -e

# Deploy nginx pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:latest
    ports:
    - containerPort: 80
EOF

# Wait for pod to be running
kubectl wait --for=condition=Ready pod/nginx --timeout=60s

# Verify pod IP allocated
POD_IP=$(kubectl get pod nginx -o jsonpath='{.status.podIP}')
[ -n "$POD_IP" ] || exit 1

# Cleanup
kubectl delete pod nginx

2. Service Exposure (LoadBalancer):

#!/bin/bash
set -e

# Create deployment
kubectl create deployment web --image=nginx:latest --replicas=2

# Expose as LoadBalancer
kubectl expose deployment web --type=LoadBalancer --port=80

# Wait for external IP
for i in {1..30}; do
  EXTERNAL_IP=$(kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
  [ -n "$EXTERNAL_IP" ] && break
  sleep 2
done

[ -n "$EXTERNAL_IP" ] || exit 1

# Verify HTTP access
curl -f http://$EXTERNAL_IP || exit 1

# Cleanup
kubectl delete svc web
kubectl delete deployment web

3. PersistentVolume Provisioning:

#!/bin/bash
set -e

# Create PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
  storageClassName: lightningstor-ssd
EOF

# Wait for PVC to be bound
kubectl wait --for=condition=Bound pvc/test-pvc --timeout=60s

# Create pod using PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: app
    image: busybox
    command: ["sh", "-c", "echo hello > /data/test.txt && sleep 3600"]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-pvc
EOF

kubectl wait --for=condition=Ready pod/test-pod --timeout=60s

# Verify file written
kubectl exec test-pod -- cat /data/test.txt | grep hello || exit 1

# Cleanup
kubectl delete pod test-pod
kubectl delete pvc test-pvc

4. Multi-Tenant Isolation:

#!/bin/bash
set -e

# Create two namespaces
kubectl create namespace project-a
kubectl create namespace project-b

# Deploy pod in each
kubectl run pod-a --image=nginx -n project-a
kubectl run pod-b --image=nginx -n project-b

# Verify network isolation (if NetworkPolicies enabled)
# Pod A should NOT be able to reach Pod B
POD_B_IP=$(kubectl get pod pod-b -n project-b -o jsonpath='{.status.podIP}')
kubectl exec pod-a -n project-a -- curl --max-time 5 http://$POD_B_IP && exit 1 || true

# Cleanup
kubectl delete ns project-a project-b

E2E Test Scenario

End-to-End Test: Deploy Multi-Tier Application

#!/bin/bash
set -ex

NAMESPACE="project-123"

# 1. Create namespace
kubectl create namespace $NAMESPACE

# 2. Deploy PostgreSQL with PVC
kubectl apply -n $NAMESPACE -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 5Gi
  storageClassName: lightningstor-ssd
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        env:
        - name: POSTGRES_PASSWORD
          value: testpass
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
EOF

# 3. Deploy web application
kubectl apply -n $NAMESPACE -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:latest
        env:
        - name: DATABASE_URL
          value: postgres://postgres:testpass@postgres:5432/mydb
        ports:
        - containerPort: 8080
EOF

# 4. Expose web via LoadBalancer
kubectl expose deployment web -n $NAMESPACE --type=LoadBalancer --port=80 --target-port=8080

# 5. Wait for resources
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=postgres --timeout=120s
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=web --timeout=120s

# 6. Verify LoadBalancer external IP
for i in {1..60}; do
  EXTERNAL_IP=$(kubectl get svc web -n $NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
  [ -n "$EXTERNAL_IP" ] && break
  sleep 2
done
[ -n "$EXTERNAL_IP" ] || { echo "No external IP assigned"; exit 1; }

# 7. Verify DNS resolution
kubectl run -n $NAMESPACE --rm -it --restart=Never test-dns --image=busybox -- nslookup postgres.${NAMESPACE}.svc.cluster.local

# 8. Verify HTTP access
curl -f http://$EXTERNAL_IP/health || { echo "Health check failed"; exit 1; }

# 9. Verify PVC mounted
kubectl exec -n $NAMESPACE deployment/postgres -- ls /var/lib/postgresql/data | grep pg_wal

# 10. Verify network isolation (optional, if NetworkPolicies enabled)
# ...

# Cleanup
kubectl delete namespace $NAMESPACE

echo "E2E test passed!"

Implementation Phases

Phase 1: Foundation (4-5 weeks)

Week 1-2: k3s Setup and IAM Integration

Install and configure k3s with disabled components
Implement IAM authentication webhook server
Configure kube-apiserver to use IAM webhook
Create RBAC templates (org admin, project admin, viewer)
Test: Authenticate with IAM token, verify RBAC enforcement

Week 3: PrismNET CNI Plugin

Implement CNI binary (ADD, DEL, CHECK commands)
Integrate with PrismNET gRPC API (AllocateIP, ReleaseIP)
Configure OVN logical switches per namespace
Test: Create pod, verify network interface and IP allocation

Week 4: FiberLB Controller

Implement controller watch loop (Services, Endpoints)
Integrate with FiberLB gRPC API (CreateLoadBalancer, UpdatePool)
Implement external IP allocation from pool
Test: Expose service as LoadBalancer, verify external IP and routing

Week 5: Basic RBAC and Multi-Tenancy

Implement namespace-per-project provisioning
Deploy default RBAC roles and bindings
Test: Create multiple projects, verify isolation

Deliverables:

Functional k3s cluster with IAM authentication
Pod networking via PrismNET
LoadBalancer services via FiberLB
Multi-tenant namespaces with RBAC

Phase 2: Storage & DNS (5-6 weeks)

Week 6-7: LightningStor CSI Driver

Implement CSI Controller Service (CreateVolume, DeleteVolume, ControllerPublishVolume)
Implement CSI Node Service (NodeStageVolume, NodePublishVolume)
Integrate with LightningStor gRPC API
Deploy CSI driver as pods (controller + node DaemonSet)
Create StorageClasses for SSD and HDD
Test: Create PVC, attach to pod, write/read data

Week 8: FlashDNS Controller

Implement controller watch loop (Services, Pods)
Integrate with FlashDNS gRPC API (CreateRecord, UpdateRecord)
Generate DNS records (A, SRV) for services and pods
Configure kubelet DNS settings
Test: Resolve service DNS from pod, verify DNS updates

Week 9: Network Policy Support

Extend PrismNET CNI with NetworkPolicy controller
Translate K8s NetworkPolicy to OVN ACLs
Implement address sets for pod label selectors
Test: Create NetworkPolicy, verify ingress/egress enforcement

Week 10-11: Integration Testing

Write integration test suite (pod, service, PVC, DNS)
Test multi-tier application deployment
Performance testing (pod creation time, network throughput)
Fix bugs and optimize

Deliverables:

Persistent storage via LightningStor CSI
Service discovery via FlashDNS
Network policies enforced by PrismNET
Comprehensive integration tests

Phase 3: Advanced Features (Post-MVP, 6-8 weeks)

StatefulSets:

Verify StatefulSet controller functionality (built-in to k3s)
Test with headless services and volumeClaimTemplates
Example: Deploy Cassandra or Kafka cluster

PlasmaVMC CRI Integration:

Implement CRI server in PlasmaVMC (Rust)
Create Firecracker microVM per pod
Test pod lifecycle (create, start, stop, delete)
Performance benchmarking (startup time, resource overhead)

FlareDB as k3s Datastore:

Investigate etcd API compatibility layer for FlareDB
Implement etcd v3 gRPC API shim
Test k3s with FlareDB backend
Benchmarking and stability testing

Autoscaling:

Deploy metrics-server
Implement HorizontalPodAutoscaler
Test autoscaling based on CPU/memory metrics

Ingress (L7 LoadBalancer):

Implement Ingress controller using FiberLB L7 capabilities
Support host-based and path-based routing
TLS termination

Success Criteria

Functional Requirements:

✅ Deploy pods, services, deployments using kubectl
✅ LoadBalancer services receive external IPs from FiberLB
✅ PersistentVolumes provisioned from LightningStor and mounted to pods
✅ DNS resolution works for services and pods (via FlashDNS)
✅ Multi-tenant isolation enforced (namespaces, RBAC, network policies)
✅ IAM authentication and RBAC functional (token validation, user/group mapping)
✅ E2E test passes (multi-tier application deployment)

Performance Requirements:

Pod creation time: <10 seconds (from API call to running state)
Service LoadBalancer IP allocation: <5 seconds
PersistentVolume provisioning: <30 seconds
DNS record updates: <10 seconds (after service creation)
Support 100+ pods per cluster
Support 10+ concurrent namespaces

Operational Requirements:

NixOS module for declarative deployment
Cluster upgrade path (k3s version upgrades)
Backup and restore procedures (etcd snapshots)
Monitoring and alerting integration (Prometheus, Grafana)
Logging aggregation (FluentBit → centralized log storage)

Next Steps (S3-S6)

S3: Workspace Scaffold

Create k8shost/ workspace directory structure
Set up Go module for controllers (FiberLB, FlashDNS)
Set up Rust workspace for CNI plugin
Set up Go module for CSI driver
Create NixOS module skeleton

Directory Structure:

k8shost/
├── controllers/          # Go: FiberLB, FlashDNS, IAM webhook
│   ├── fiberlb/
│   ├── flashdns/
│   ├── iamwebhook/
│   └── main.go
├── cni/                  # Rust: PrismNET CNI plugin
│   ├── src/
│   └── Cargo.toml
├── csi/                  # Go: LightningStor CSI driver
│   ├── controller/
│   ├── node/
│   └── main.go
├── nix/
│   └── modules/
│       └── k8shost.nix
└── tests/
    ├── integration/
    └── e2e/

S4: Controllers Implementation

Implement FiberLB controller (Service watch, gRPC integration)
Implement FlashDNS controller (Service/Pod watch, DNS record sync)
Implement IAM webhook server (TokenReview API, IAM validation)
Unit tests for each controller

S5: CNI + CSI Implementation

Implement PrismNET CNI plugin (ADD/DEL/CHECK, OVN integration)
Implement LightningStor CSI driver (Controller and Node services)
Deploy CSI driver as pods (Deployment + DaemonSet)
Unit tests for CNI and CSI

S6: Integration Testing

Set up integration test environment (k3s cluster + mock services)
Write integration tests (pod, service, PVC, DNS, multi-tenant)
Write E2E test (multi-tier application)
CI/CD pipeline for automated testing

References

k3s Architecture: https://docs.k3s.io/architecture
k3s Installation: https://docs.k3s.io/installation
k3s HA Setup: https://docs.k3s.io/datastore/ha-embedded
CNI Specification: https://github.com/containernetworking/cni/blob/main/SPEC.md
CSI Specification: https://github.com/container-storage-interface/spec/blob/master/spec.md
K8s Authentication Webhooks: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#webhook-token-authentication
K8s Authorization (RBAC): https://kubernetes.io/docs/reference/access-authn-authz/rbac/
K8s Network Policies: https://kubernetes.io/docs/concepts/services-networking/network-policies/
OVN Architecture: https://www.ovn.org/support/dist-docs/ovn-architecture.7.html
Kubernetes API Reference: https://kubernetes.io/docs/reference/kubernetes-api/

Document Version: 1.0 Last Updated: 2025-12-09 Authors: PlasmaCloud Platform Team Status: Draft for Review

71 KiB Raw Blame History

K8s Hosting Specification

Overview

Purpose

Scope

Architecture Decision Summary

Architecture

Base: k3s with Selective Component Replacement

Component Topology

Data Flow Examples

K8s API Subset

Phase 1: Core APIs (Essential)

Phase 2: Storage & Config (Required for MVP)

Phase 3: Advanced (Post-MVP)

Deferred APIs (Not in MVP)

Integration Specifications

1. PrismNET CNI Plugin

2. FiberLB LoadBalancer Controller

3. IAM Authentication Webhook

4. FlashDNS Service Discovery Controller

5. LightningStor CSI Driver

6. PlasmaVMC Integration

Multi-Tenant Model

Namespace Strategy

RBAC Templates

Network Isolation

Resource Quotas

Deployment Model

Single-Server (Development/Small)

HA Cluster (Production)

Option 1: Embedded etcd (Recommended for MVP)

Option 2: FlareDB as External Datastore

NixOS Module Integration

API Server Configuration

k3s Server Flags (Complete)

Authentication Webhook Configuration

Security

TLS/mTLS

Pod Security

Testing Strategy

Unit Tests

Integration Tests

E2E Test Scenario

Implementation Phases

Phase 1: Foundation (4-5 weeks)

Phase 2: Storage & DNS (5-6 weeks)

Phase 3: Advanced Features (Post-MVP, 6-8 weeks)

Success Criteria

Next Steps (S3-S6)

S3: Workspace Scaffold

S4: Controllers Implementation

S5: CNI + CSI Implementation

S6: Integration Testing

References

71 KiB

Raw Blame History