- Created T026-practical-test task.yaml for MVP smoke testing - Added k8shost-server to flake.nix (packages, apps, overlays) - Staged all workspace directories for nix flake build - Updated flake.nix shellHook to include k8shost Resolves: T026.S1 blocker (R8 - nix submodule visibility)
71 KiB
K8s Hosting Specification
Overview
PlasmaCloud's K8s Hosting service provides managed Kubernetes clusters for multi-tenant container orchestration. This specification defines a k3s-based architecture that integrates deeply with existing PlasmaCloud infrastructure components: NovaNET for networking, FiberLB for load balancing, IAM for authentication/authorization, FlashDNS for service discovery, and LightningStor for persistent storage.
Purpose
Enable customers to deploy and manage containerized workloads using standard Kubernetes APIs while benefiting from PlasmaCloud's integrated infrastructure services. The system provides:
- Standard K8s API compatibility: Use kubectl, Helm, and existing K8s tooling
- Multi-tenant isolation: Project-based namespaces with IAM-backed RBAC
- Deep integration: Leverage NovaNET SDN, FiberLB load balancing, LightningStor block storage
- Production-ready: HA control plane, automated failover, comprehensive monitoring
Scope
Phase 1 (MVP, 3-4 months):
- Core K8s APIs (Pods, Services, Deployments, ReplicaSets, Namespaces, ConfigMaps, Secrets)
- LoadBalancer services via FiberLB
- Persistent storage via LightningStor CSI
- IAM authentication and RBAC
- NovaNET CNI for pod networking
- FlashDNS service discovery
Future Phases:
- PlasmaVMC integration for VM-backed pods (enhanced isolation)
- StatefulSets, DaemonSets, Jobs/CronJobs
- Network policies with NovaNET enforcement
- Horizontal Pod Autoscaler
- FlareDB as k3s datastore
Architecture Decision Summary
Base Technology: k3s
- Lightweight K8s distribution (single binary, minimal dependencies)
- Production-proven (CNCF certified, widely deployed)
- Flexible architecture allowing component replacement
- Embedded SQLite (single-server) or etcd (HA cluster)
- 3-4 month timeline achievable
Component Replacement Strategy:
- Disable: servicelb (replaced by FiberLB), traefik (use FiberLB), flannel (replaced by NovaNET)
- Keep: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, containerd
- Add: Custom controllers for FiberLB, FlashDNS, IAM webhook, LightningStor CSI, NovaNET CNI
Architecture
Base: k3s with Selective Component Replacement
k3s Core (Keep):
- kube-apiserver: K8s REST API server with IAM webhook authentication
- kube-scheduler: Pod scheduling with resource awareness
- kube-controller-manager: Core controllers (replication, endpoints, service accounts, etc.)
- kubelet: Node agent managing pod lifecycle via containerd CRI
- containerd: Container runtime (Phase 1), later replaceable by PlasmaVMC CRI
- kube-proxy: Service networking (iptables/ipvs mode)
k3s Components (Disable):
- servicelb: Default LoadBalancer implementation → Replaced by FiberLB controller
- traefik: Ingress controller → Replaced by FiberLB L7 capabilities
- flannel: CNI plugin → Replaced by NovaNET CNI
- local-path-provisioner: Storage provisioner → Replaced by LightningStor CSI
PlasmaCloud Custom Components (Add):
- NovaNET CNI Plugin: Pod networking via OVN logical switches
- FiberLB Controller: LoadBalancer service reconciliation
- IAM Webhook Server: Token validation and user mapping
- FlashDNS Controller: Service DNS record synchronization
- LightningStor CSI Driver: PersistentVolume provisioning and attachment
Component Topology
┌─────────────────────────────────────────────────────────────┐
│ k3s Control Plane │
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ kube-apiserver│◄─┤ IAM Webhook├──┤ IAM Service │ │
│ │ │ │ │ │ (Authentication) │ │
│ └──────┬───────┘ └────────────┘ └──────────────────┘ │
│ │ │
│ ┌──────▼───────┐ ┌──────────────┐ ┌────────────────┐ │
│ │kube-scheduler│ │kube-controller│ │ etcd/SQLite │ │
│ │ │ │ -manager │ │ (Datastore) │ │
│ └──────────────┘ └──────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌──────▼──────┐
│ FiberLB │ │ FlashDNS │ │ LightningStor│
│ Controller │ │ Controller │ │ CSI Plugin │
│ (Watch Svcs) │ │ (Sync DNS) │ │ (Provision) │
└───────┬───────┘ └───────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌────────────────┐
│ FiberLB │ │ FlashDNS │ │ LightningStor │
│ gRPC API │ │ gRPC API │ │ gRPC API │
└──────────────┘ └──────────────┘ └────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ k3s Worker Nodes │
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ kubelet │◄─┤containerd ├──┤ Pods (containers)│ │
│ │ │ │ CRI │ │ │ │
│ └──────┬───────┘ └────────────┘ └──────────────────┘ │
│ │ │
│ ┌──────▼───────┐ ┌──────────────┐ │
│ │ NovaNET CNI │◄─┤ kube-proxy │ │
│ │ (Pod Network)│ │ (Service Net)│ │
│ └──────┬───────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ NovaNET OVN │ │
│ │ (ovs-vswitchd)│ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Data Flow Examples
1. Pod Creation:
kubectl create pod → kube-apiserver (IAM auth) → scheduler → kubelet → containerd
↓
NovaNET CNI
↓
OVN logical port
2. LoadBalancer Service:
kubectl expose → kube-apiserver → Service created → FiberLB controller watches
↓
FiberLB gRPC API
↓
External IP + L4 forwarding
3. PersistentVolume:
PVC created → kube-apiserver → CSI controller → LightningStor CSI driver
↓
LightningStor gRPC
↓
Volume created
↓
kubelet → CSI node plugin
↓
Mount to pod
K8s API Subset
Phase 1: Core APIs (Essential)
Pods (v1):
- Full CRUD operations (create, get, list, update, delete, patch)
- Watch API for real-time updates
- Logs streaming (
kubectl logs -f) - Exec into containers (
kubectl exec) - Port forwarding (
kubectl port-forward) - Status: Phase (Pending, Running, Succeeded, Failed), conditions, container states
Services (v1):
- ClusterIP: Internal cluster networking (default)
- LoadBalancer: External access via FiberLB
- Headless: StatefulSet support (clusterIP: None)
- Service discovery via FlashDNS
- Endpoint slices for large service backends
Deployments (apps/v1):
- Declarative desired state (replicas, pod template)
- Rolling updates with configurable strategy (maxSurge, maxUnavailable)
- Rollback to previous revision
- Pause/resume for canary deployments
- Scaling (manual in Phase 1)
ReplicaSets (apps/v1):
- Pod replication with label selectors
- Owned by Deployments (rarely created directly)
- Orphan/adopt pod ownership
Namespaces (v1):
- Tenant isolation (one namespace per project)
- Resource quota enforcement
- Network policy scope (Phase 2)
- RBAC scope
ConfigMaps (v1):
- Non-sensitive configuration data
- Mount as volumes or environment variables
- Update triggers pod restarts (via annotation)
Secrets (v1):
- Sensitive data (passwords, tokens, certificates)
- Base64 encoded in etcd (at-rest encryption in future phase)
- Mount as volumes or environment variables
- Service account tokens
Nodes (v1):
- Node registration via kubelet
- Heartbeat and status reporting
- Capacity and allocatable resources
- Labels and taints for scheduling
Events (v1):
- Audit trail of cluster activities
- Retention policy (1 hour in-memory, longer in etcd)
- Debugging and troubleshooting
Phase 2: Storage & Config (Required for MVP)
PersistentVolumes (v1):
- Volume lifecycle independent of pods
- Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany (LightningStor support)
- Reclaim policy: Retain, Delete
- Status: Available, Bound, Released, Failed
PersistentVolumeClaims (v1):
- User request for storage
- Binding to PVs by storage class, capacity, access mode
- Volume expansion (if storage class allows)
StorageClasses (storage.k8s.io/v1):
- Dynamic provisioning via LightningStor CSI
- Parameters: volume type (ssd, hdd), replication factor, org_id, project_id
- Volume binding mode: Immediate or WaitForFirstConsumer
Phase 3: Advanced (Post-MVP)
StatefulSets (apps/v1):
- Ordered pod creation/deletion
- Stable network identities (pod-0, pod-1, ...)
- Persistent storage per pod via volumeClaimTemplates
- Use case: Databases, distributed systems
DaemonSets (apps/v1):
- One pod per node (e.g., log collectors, monitoring agents)
- Node selector and tolerations
Jobs (batch/v1):
- Run-to-completion workloads
- Parallelism and completions
- Retry policy
CronJobs (batch/v1):
- Scheduled jobs (cron syntax)
- Concurrency policy
NetworkPolicies (networking.k8s.io/v1):
- Ingress and egress rules
- Label-based pod selection
- Namespace selectors
- Requires NovaNET CNI support for OVN ACL translation
Ingress (networking.k8s.io/v1):
- HTTP/HTTPS routing via FiberLB L7
- Host-based and path-based routing
- TLS termination
Deferred APIs (Not in MVP)
- HorizontalPodAutoscaler (autoscaling/v2): Requires metrics-server
- VerticalPodAutoscaler: Complex, low priority
- PodDisruptionBudget: Useful for HA, but post-MVP
- LimitRange: Resource limits per namespace (future)
- ResourceQuota: Supported in Phase 1, but advanced features deferred
- CustomResourceDefinitions (CRDs): Framework exists, but no custom resources in Phase 1
- APIService: Aggregation layer not needed initially
Integration Specifications
1. NovaNET CNI Plugin
Purpose: Provide pod networking using NovaNET's OVN-based SDN.
Interface: CNI 1.0.0 specification (https://github.com/containernetworking/cni/blob/main/SPEC.md)
Components:
- CNI binary:
/opt/cni/bin/novanet - Configuration:
/etc/cni/net.d/10-novanet.conflist - IPAM plugin:
/opt/cni/bin/novanet-ipam(or integrated)
Responsibilities:
- Create network interface for pod (veth pair)
- Allocate IP address from namespace-specific subnet
- Connect pod to OVN logical switch
- Configure routing for pod egress
- Enforce network policies (Phase 2)
Configuration Schema:
{
"cniVersion": "1.0.0",
"name": "novanet",
"type": "novanet",
"ipam": {
"type": "novanet-ipam",
"subnet": "10.244.0.0/16",
"rangeStart": "10.244.0.10",
"rangeEnd": "10.244.255.254",
"routes": [
{"dst": "0.0.0.0/0"}
],
"gateway": "10.244.0.1"
},
"ovn": {
"northbound": "tcp:novanet-server:6641",
"southbound": "tcp:novanet-server:6642",
"encapType": "geneve"
},
"mtu": 1400,
"novanetEndpoint": "novanet-server:5000"
}
CNI Plugin Workflow:
-
ADD Command (pod creation):
Input: Container ID, network namespace path, interface name Process: - Call NovaNET gRPC API: AllocateIP(namespace, pod_name) - Create veth pair: one end in pod netns, one in host - Add host veth to OVN logical switch port - Configure pod veth: IP address, routes, MTU - Return: IP config, routes, DNS settings -
DEL Command (pod deletion):
Input: Container ID, network namespace path Process: - Call NovaNET gRPC API: ReleaseIP(namespace, pod_name) - Delete OVN logical switch port - Delete veth pair -
CHECK Command (health check):
Verify interface exists and has expected configuration
API Integration (NovaNET gRPC):
service NetworkService {
rpc AllocateIP(AllocateIPRequest) returns (AllocateIPResponse);
rpc ReleaseIP(ReleaseIPRequest) returns (ReleaseIPResponse);
rpc CreateLogicalSwitch(CreateLogicalSwitchRequest) returns (CreateLogicalSwitchResponse);
}
message AllocateIPRequest {
string namespace = 1;
string pod_name = 2;
string container_id = 3;
}
message AllocateIPResponse {
string ip_address = 1; // e.g., "10.244.1.5/24"
string gateway = 2;
repeated string dns_servers = 3;
}
OVN Topology:
- Logical Switch per Namespace:
k8s-<namespace>(e.g.,k8s-project-123) - Logical Router:
k8s-cluster-routerfor inter-namespace routing - Logical Switch Ports: One per pod (
<pod-name>-<container-id>) - ACLs: NetworkPolicy enforcement (Phase 2)
Network Policy Translation (Phase 2):
K8s NetworkPolicy:
podSelector: app=web
ingress:
- from:
- podSelector: app=frontend
ports:
- protocol: TCP
port: 80
→ OVN ACL:
direction: to-lport
match: "ip4.src == $frontend_pods && tcp.dst == 80"
action: allow-related
priority: 1000
Address Sets:
- Dynamic updates as pods are added/removed
- Efficient ACL matching for large pod groups
2. FiberLB LoadBalancer Controller
Purpose: Reconcile K8s Services of type LoadBalancer with FiberLB resources.
Architecture:
- Controller Process: Runs as a pod in
kube-systemnamespace or embedded in k3s server - Watch Resources: Services (type=LoadBalancer), Endpoints
- Manage Resources: FiberLB LoadBalancers, Listeners, Pools, Members
Controller Logic:
1. Service Watch Loop:
for event := range serviceWatcher {
if event.Type == Created || event.Type == Updated {
if service.Spec.Type == "LoadBalancer" {
reconcileLoadBalancer(service)
}
} else if event.Type == Deleted {
deleteLoadBalancer(service)
}
}
2. Reconcile Logic:
Input: Service object
Process:
1. Check if FiberLB LoadBalancer exists (by annotation or name mapping)
2. If not exists:
a. Allocate external IP from pool
b. Create FiberLB LoadBalancer resource (gRPC CreateLoadBalancer)
c. Store LoadBalancer ID in service annotation
3. For each service.Spec.Ports:
a. Create/update FiberLB Listener (protocol, port, algorithm)
4. Get service endpoints:
a. Create/update FiberLB Pool with backend members (pod IPs, ports)
5. Update service.Status.LoadBalancer.Ingress with external IP
6. If service spec changed:
a. Update FiberLB resources accordingly
3. Endpoint Watch Loop:
for event := range endpointWatcher {
service := getServiceForEndpoint(event.Object)
if service.Spec.Type == "LoadBalancer" {
updateLoadBalancerPool(service, event.Object)
}
}
Configuration:
- External IP Pool:
--external-ip-pool=192.168.100.0/24(CIDR or IP range) - FiberLB Endpoint:
--fiberlb-endpoint=fiberlb-server:7000(gRPC address) - IP Allocation: First-available or integration with IPAM service
Service Annotations:
apiVersion: v1
kind: Service
metadata:
name: web-service
annotations:
fiberlb.plasmacloud.io/load-balancer-id: "lb-abc123"
fiberlb.plasmacloud.io/algorithm: "round-robin" # round-robin | least-conn | ip-hash
fiberlb.plasmacloud.io/health-check-path: "/health"
fiberlb.plasmacloud.io/health-check-interval: "10s"
fiberlb.plasmacloud.io/health-check-timeout: "5s"
fiberlb.plasmacloud.io/health-check-retries: "3"
fiberlb.plasmacloud.io/session-affinity: "client-ip" # For sticky sessions
spec:
type: LoadBalancer
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 8080
status:
loadBalancer:
ingress:
- ip: 192.168.100.50
FiberLB gRPC API Integration:
service LoadBalancerService {
rpc CreateLoadBalancer(CreateLoadBalancerRequest) returns (LoadBalancer);
rpc UpdateLoadBalancer(UpdateLoadBalancerRequest) returns (LoadBalancer);
rpc DeleteLoadBalancer(DeleteLoadBalancerRequest) returns (Empty);
rpc CreateListener(CreateListenerRequest) returns (Listener);
rpc UpdatePool(UpdatePoolRequest) returns (Pool);
}
message CreateLoadBalancerRequest {
string name = 1;
string description = 2;
string external_ip = 3; // If empty, allocate from pool
string org_id = 4;
string project_id = 5;
}
message CreateListenerRequest {
string load_balancer_id = 1;
string protocol = 2; // TCP, UDP, HTTP, HTTPS
int32 port = 3;
string default_pool_id = 4;
HealthCheck health_check = 5;
}
message UpdatePoolRequest {
string pool_id = 1;
repeated PoolMember members = 2;
string algorithm = 3;
}
message PoolMember {
string address = 1; // Pod IP
int32 port = 2;
int32 weight = 3;
}
Health Checks:
- HTTP health checks: Use annotation
health-check-path - TCP health checks: Connection-based for non-HTTP services
- Health check failures remove pod from pool (auto-healing)
Edge Cases:
- Service deletion: Controller must clean up FiberLB resources and release external IP
- Endpoint churn: Debounce pool updates to avoid excessive FiberLB API calls
- IP exhaustion: Return error event on service, set status condition
3. IAM Authentication Webhook
Purpose: Authenticate K8s API requests using PlasmaCloud IAM tokens.
Architecture:
- Webhook Server: HTTPS endpoint (can be part of IAM service or standalone)
- Integration Point: kube-apiserver
--authentication-token-webhook-config-file - Protocol: K8s TokenReview API
Webhook Endpoint: POST /apis/iam.plasmacloud.io/v1/authenticate
Request Flow:
kubectl --token=<IAM-token> get pods
↓
kube-apiserver extracts Bearer token
↓
POST /apis/iam.plasmacloud.io/v1/authenticate
body: TokenReview with token
↓
IAM webhook validates token
↓
Response: authenticated=true, user info, groups
↓
kube-apiserver proceeds with RBAC authorization
Request Schema (from kube-apiserver):
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"spec": {
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}
}
Response Schema (from IAM webhook):
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"status": {
"authenticated": true,
"user": {
"username": "user@example.com",
"uid": "user-550e8400-e29b-41d4-a716-446655440000",
"groups": [
"org:org-123",
"project:proj-456",
"system:authenticated"
],
"extra": {
"org_id": ["org-123"],
"project_id": ["proj-456"],
"roles": ["org_admin"]
}
}
}
}
Error Response (invalid token):
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"status": {
"authenticated": false,
"error": "Invalid or expired token"
}
}
IAM Token Format:
- JWT: Signed by IAM service with shared secret or public/private key
- Claims: sub (user ID), email, org_id, project_id, roles, exp (expiration)
- Example:
{ "sub": "user-550e8400-e29b-41d4-a716-446655440000", "email": "user@example.com", "org_id": "org-123", "project_id": "proj-456", "roles": ["org_admin", "project_member"], "exp": 1672531200 }
User/Group Mapping:
| IAM Principal | K8s Username | K8s Groups |
|---|---|---|
| User (email) | user@example.com | org:<org_id>, project:<project_id>, system:authenticated |
| User (ID) | user- | org:<org_id>, project:<project_id>, system:authenticated |
| Service Account | sa-@ | org:<org_id>, project:<project_id>, system:serviceaccounts |
| Org Admin | admin@example.com | org:<org_id>, project:<all_projects>, k8s:org-admin |
RBAC Integration:
- Groups are used in RoleBindings and ClusterRoleBindings
- Example:
org:org-123group gets admin access to allproject-*namespaces for that org
Webhook Configuration File (/etc/k8shost/iam-webhook.yaml):
apiVersion: v1
kind: Config
clusters:
- name: iam-webhook
cluster:
server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
certificate-authority: /etc/k8shost/ca.crt
users:
- name: k8s-apiserver
user:
client-certificate: /etc/k8shost/apiserver-client.crt
client-key: /etc/k8shost/apiserver-client.key
current-context: webhook
contexts:
- context:
cluster: iam-webhook
user: k8s-apiserver
name: webhook
Performance Considerations:
- Caching: kube-apiserver caches successful authentications (--authentication-token-webhook-cache-ttl=2m)
- Timeouts: Webhook must respond within 10s (configurable)
- Rate Limiting: IAM webhook should handle high request volume (100s of req/s)
4. FlashDNS Service Discovery Controller
Purpose: Synchronize K8s Services and Pods to FlashDNS for cluster DNS resolution.
Architecture:
- Controller Process: Runs as pod in
kube-systemor embedded in k3s server - Watch Resources: Services, Endpoints, Pods
- Manage Resources: FlashDNS A/AAAA/SRV records
DNS Hierarchy:
- Pod A Records:
<pod-ip-dashed>.pod.cluster.local→ Pod IP- Example:
10-244-1-5.pod.cluster.local→10.244.1.5
- Example:
- Service A Records:
<service>.<namespace>.svc.cluster.local→ ClusterIP or external IP- Example:
web.default.svc.cluster.local→10.96.0.100
- Example:
- Headless Service:
<endpoint>.<service>.<namespace>.svc.cluster.local→ Endpoint IPs- Example:
web-0.web.default.svc.cluster.local→10.244.1.10
- Example:
- SRV Records:
_<port>._<protocol>.<service>.<namespace>.svc.cluster.local- Example:
_http._tcp.web.default.svc.cluster.local→0 50 80 web.default.svc.cluster.local
- Example:
Controller Logic:
1. Service Watch:
for event := range serviceWatcher {
service := event.Object
switch event.Type {
case Created, Updated:
if service.Spec.ClusterIP != "None":
// Regular service
createOrUpdateDNSRecord(
name: service.Name + "." + service.Namespace + ".svc.cluster.local",
type: "A",
value: service.Spec.ClusterIP
)
if len(service.Status.LoadBalancer.Ingress) > 0:
// LoadBalancer service - also add external IP
createOrUpdateDNSRecord(
name: service.Name + "." + service.Namespace + ".svc.cluster.local",
type: "A",
value: service.Status.LoadBalancer.Ingress[0].IP
)
else:
// Headless service - add endpoint records
endpoints := getEndpoints(service)
for _, ep := range endpoints:
createOrUpdateDNSRecord(
name: ep.Hostname + "." + service.Name + "." + service.Namespace + ".svc.cluster.local",
type: "A",
value: ep.IP
)
// Create SRV records for each port
for _, port := range service.Spec.Ports:
createSRVRecord(service, port)
case Deleted:
deleteDNSRecords(service)
}
}
2. Pod Watch (for pod DNS):
for event := range podWatcher {
pod := event.Object
switch event.Type {
case Created, Updated:
if pod.Status.PodIP != "":
dashedIP := strings.ReplaceAll(pod.Status.PodIP, ".", "-")
createOrUpdateDNSRecord(
name: dashedIP + ".pod.cluster.local",
type: "A",
value: pod.Status.PodIP
)
case Deleted:
deleteDNSRecord(pod)
}
}
FlashDNS gRPC API Integration:
service DNSService {
rpc CreateRecord(CreateRecordRequest) returns (DNSRecord);
rpc UpdateRecord(UpdateRecordRequest) returns (DNSRecord);
rpc DeleteRecord(DeleteRecordRequest) returns (Empty);
rpc ListRecords(ListRecordsRequest) returns (ListRecordsResponse);
}
message CreateRecordRequest {
string zone = 1; // "cluster.local"
string name = 2; // "web.default.svc"
string type = 3; // "A", "AAAA", "SRV", "CNAME"
string value = 4; // "10.96.0.100"
int32 ttl = 5; // 30 (seconds)
map<string, string> labels = 6; // k8s metadata
}
message DNSRecord {
string id = 1;
string zone = 2;
string name = 3;
string type = 4;
string value = 5;
int32 ttl = 6;
}
Configuration:
- FlashDNS Endpoint:
--flashdns-endpoint=flashdns-server:6000 - Cluster Domain:
--cluster-domain=cluster.local(default) - Record TTL:
--dns-ttl=30(seconds, low for fast updates)
Example DNS Records:
# Regular service
web.default.svc.cluster.local. 30 IN A 10.96.0.100
# Headless service with 3 pods
web.default.svc.cluster.local. 30 IN A 10.244.1.10
web.default.svc.cluster.local. 30 IN A 10.244.1.11
web.default.svc.cluster.local. 30 IN A 10.244.1.12
# StatefulSet pods (Phase 3)
web-0.web.default.svc.cluster.local. 30 IN A 10.244.1.10
web-1.web.default.svc.cluster.local. 30 IN A 10.244.1.11
# SRV record for service port
_http._tcp.web.default.svc.cluster.local. 30 IN SRV 0 50 80 web.default.svc.cluster.local.
# Pod DNS
10-244-1-10.pod.cluster.local. 30 IN A 10.244.1.10
Integration with kubelet:
- kubelet configures pod DNS via
/etc/resolv.conf nameserver: FlashDNS service IP (typically first IP in service CIDR, e.g.,10.96.0.10)search:<namespace>.svc.cluster.local svc.cluster.local cluster.local
Edge Cases:
- Service IP change: Update DNS record atomically
- Endpoint churn: Debounce updates for headless services with many endpoints
- DNS caching: Low TTL (30s) for fast convergence
5. LightningStor CSI Driver
Purpose: Provide dynamic PersistentVolume provisioning and lifecycle management.
CSI Driver Name: stor.plasmacloud.io
Architecture:
- Controller Plugin: Runs as StatefulSet or Deployment in
kube-system- Provisioning, deletion, attaching, detaching, snapshots
- Node Plugin: Runs as DaemonSet on every node
- Staging, publishing (mounting), unpublishing, unstaging
CSI Components:
1. Controller Service (Identity, Controller RPCs):
CreateVolume: Provision new volume via LightningStorDeleteVolume: Delete volumeControllerPublishVolume: Attach volume to nodeControllerUnpublishVolume: Detach volume from nodeValidateVolumeCapabilities: Check if volume supports requested capabilitiesListVolumes: List all volumesGetCapacity: Query available storage capacityCreateSnapshot,DeleteSnapshot: Volume snapshots (Phase 2)
2. Node Service (Node RPCs):
NodeStageVolume: Mount volume to global staging path on nodeNodeUnstageVolume: Unmount from staging pathNodePublishVolume: Bind mount from staging to pod pathNodeUnpublishVolume: Unmount from pod pathNodeGetInfo: Return node ID and topologyNodeGetCapabilities: Return node capabilities
CSI Driver Workflow:
Volume Provisioning:
1. User creates PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: lightningstor-ssd
2. CSI Controller watches PVC, calls CreateVolume:
CreateVolumeRequest {
name: "pvc-550e8400-e29b-41d4-a716-446655440000"
capacity_range: { required_bytes: 10737418240 }
volume_capabilities: [{ access_mode: SINGLE_NODE_WRITER }]
parameters: {
"type": "ssd",
"replication": "3",
"org_id": "org-123",
"project_id": "proj-456"
}
}
3. CSI Controller calls LightningStor gRPC CreateVolume:
LightningStor creates volume, returns volume_id
4. CSI Controller creates PV:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-550e8400-e29b-41d4-a716-446655440000
spec:
capacity:
storage: 10Gi
accessModes: [ReadWriteOnce]
persistentVolumeReclaimPolicy: Delete
storageClassName: lightningstor-ssd
csi:
driver: stor.plasmacloud.io
volumeHandle: vol-abc123
fsType: ext4
5. K8s binds PVC to PV
Volume Attachment (when pod is scheduled):
1. kube-controller-manager creates VolumeAttachment:
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
name: csi-<hash>
spec:
attacher: stor.plasmacloud.io
nodeName: worker-1
source:
persistentVolumeName: pvc-550e8400-e29b-41d4-a716-446655440000
2. CSI Controller watches VolumeAttachment, calls ControllerPublishVolume:
ControllerPublishVolumeRequest {
volume_id: "vol-abc123"
node_id: "worker-1"
volume_capability: { access_mode: SINGLE_NODE_WRITER }
}
3. CSI Controller calls LightningStor gRPC AttachVolume:
LightningStor attaches volume to node (e.g., iSCSI target, NBD)
4. CSI Controller updates VolumeAttachment status: attached=true
Volume Mounting (on node):
1. kubelet calls CSI Node plugin: NodeStageVolume
NodeStageVolumeRequest {
volume_id: "vol-abc123"
staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
volume_capability: { mount: { fs_type: "ext4" } }
}
2. CSI Node plugin:
- Discovers block device (e.g., /dev/nbd0) via LightningStor
- Formats if needed: mkfs.ext4 /dev/nbd0
- Mounts to staging path: mount /dev/nbd0 <staging_target_path>
3. kubelet calls CSI Node plugin: NodePublishVolume
NodePublishVolumeRequest {
volume_id: "vol-abc123"
staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
target_path: "/var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/pvc-<hash>/mount"
}
4. CSI Node plugin:
- Bind mount staging path to target path
- Pod can now read/write to volume
LightningStor gRPC API Integration:
service VolumeService {
rpc CreateVolume(CreateVolumeRequest) returns (Volume);
rpc DeleteVolume(DeleteVolumeRequest) returns (Empty);
rpc AttachVolume(AttachVolumeRequest) returns (VolumeAttachment);
rpc DetachVolume(DetachVolumeRequest) returns (Empty);
rpc GetVolume(GetVolumeRequest) returns (Volume);
rpc ListVolumes(ListVolumesRequest) returns (ListVolumesResponse);
}
message CreateVolumeRequest {
string name = 1;
int64 size_bytes = 2;
string volume_type = 3; // "ssd", "hdd"
int32 replication_factor = 4;
string org_id = 5;
string project_id = 6;
}
message Volume {
string id = 1;
string name = 2;
int64 size_bytes = 3;
string status = 4; // "available", "in-use", "error"
string volume_type = 5;
}
message AttachVolumeRequest {
string volume_id = 1;
string node_id = 2;
string attach_mode = 3; // "read-write", "read-only"
}
message VolumeAttachment {
string id = 1;
string volume_id = 2;
string node_id = 3;
string device_path = 4; // e.g., "/dev/nbd0"
string connection_info = 5; // JSON with iSCSI target, NBD socket, etc.
}
StorageClass Examples:
# SSD storage with 3x replication
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: lightningstor-ssd
provisioner: stor.plasmacloud.io
parameters:
type: "ssd"
replication: "3"
volumeBindingMode: WaitForFirstConsumer # Topology-aware scheduling
allowVolumeExpansion: true
reclaimPolicy: Delete
---
# HDD storage with 2x replication
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: lightningstor-hdd
provisioner: stor.plasmacloud.io
parameters:
type: "hdd"
replication: "2"
volumeBindingMode: Immediate
allowVolumeExpansion: true
reclaimPolicy: Retain # Keep volume after PVC deletion
Access Modes:
- ReadWriteOnce (RWO): Single node read-write (most common)
- ReadOnlyMany (ROX): Multiple nodes read-only
- ReadWriteMany (RWX): Multiple nodes read-write (requires shared filesystem like NFS, Phase 2)
Volume Expansion (if allowVolumeExpansion: true):
1. User edits PVC: spec.resources.requests.storage: 20Gi (was 10Gi)
2. CSI Controller calls ControllerExpandVolume
3. LightningStor expands volume backend
4. CSI Node plugin calls NodeExpandVolume
5. Filesystem resize: resize2fs /dev/nbd0
6. PlasmaVMC Integration
Phase 1 (MVP): Use containerd as default CRI
- k3s ships with containerd embedded
- Standard OCI container runtime
- No changes needed for Phase 1
Phase 3 (Future): Custom CRI for VM-backed pods
Motivation:
- Enhanced Isolation: Stronger security boundary than containers
- Multi-Tenant Security: Prevent container escape attacks
- Consistent Runtime: Unify VM and container workloads on PlasmaVMC
Architecture:
- PlasmaVMC implements CRI (Container Runtime Interface)
- Each pod runs as a lightweight VM (Firecracker microVM)
- Pod containers run inside VM (still using containerd within VM)
- kubelet communicates with PlasmaVMC CRI endpoint instead of containerd
CRI Interface Implementation:
RuntimeService:
RunPodSandbox: Create Firecracker microVM for podStopPodSandbox: Stop microVMRemovePodSandbox: Delete microVMPodSandboxStatus: Query microVM statusListPodSandbox: List all pod microVMsCreateContainer: Create container inside microVMStartContainer,StopContainer,RemoveContainer: Container lifecycleExecSync,Exec: Execute commands in containerAttach: Attach to container stdio
ImageService:
PullImage: Download container image (delegate to internal containerd)RemoveImage: Delete imageListImages: List cached imagesImageStatus: Query image metadata
Implementation Strategy:
┌─────────────────────────────────────────┐
│ kubelet (k3s agent) │
└─────────────┬───────────────────────────┘
│ CRI gRPC
▼
┌─────────────────────────────────────────┐
│ PlasmaVMC CRI Server (Rust) │
│ - RunPodSandbox → Create microVM │
│ - CreateContainer → Run in VM │
└─────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Firecracker VMM (per pod) │
│ ┌───────────────────────────────────┐ │
│ │ Pod VM (minimal Linux kernel) │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ containerd (in-VM) │ │ │
│ │ │ - Container 1 │ │ │
│ │ │ - Container 2 │ │ │
│ │ └──────────────────────────────┘ │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
Configuration (Phase 3):
services.k8shost = {
enable = true;
cri = "plasmavmc"; # Instead of "containerd"
plasmavmc = {
endpoint = "unix:///var/run/plasmavmc/cri.sock";
vmKernel = "/var/lib/plasmavmc/vmlinux.bin";
vmRootfs = "/var/lib/plasmavmc/rootfs.ext4";
};
};
Benefits:
- Stronger isolation for untrusted workloads
- Leverage existing PlasmaVMC infrastructure
- Consistent management across VM and K8s workloads
Challenges:
- Performance overhead (microVM startup time, memory overhead)
- Image caching complexity (need containerd inside VM)
- Networking integration (CNI must configure VM network)
Decision: Defer to Phase 3, focus on standard containerd for MVP.
Multi-Tenant Model
Namespace Strategy
Principle: One K8s namespace per PlasmaCloud project.
Namespace Naming:
- Project namespaces:
project-<project_id>(e.g.,project-550e8400-e29b-41d4-a716-446655440000) - Org shared namespaces (optional):
org-<org_id>-shared(for shared resources like monitoring) - System namespaces:
kube-system,kube-public,kube-node-lease,default
Namespace Lifecycle:
- Created automatically when project provisions K8s cluster
- Labeled with
org_id,project_idfor RBAC and billing - Deleted when project is deleted (with grace period)
Namespace Metadata:
apiVersion: v1
kind: Namespace
metadata:
name: project-550e8400-e29b-41d4-a716-446655440000
labels:
plasmacloud.io/org-id: "org-123"
plasmacloud.io/project-id: "proj-456"
plasmacloud.io/tenant-type: "project"
annotations:
plasmacloud.io/project-name: "my-web-app"
plasmacloud.io/created-by: "user@example.com"
RBAC Templates
Org Admin Role (full access to all project namespaces):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: org-admin
namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: org-admin-binding
namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
name: org:org-123
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: org-admin
apiGroup: rbac.authorization.k8s.io
Project Admin Role (full access to specific project namespace):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: project-admin
namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "storage.k8s.io"]
resources: ["*"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: project-admin-binding
namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
name: project:proj-456
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: project-admin
apiGroup: rbac.authorization.k8s.io
Project Viewer Role (read-only access):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: project-viewer
namespace: project-550e8400-e29b-41d4-a716-446655440000
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io"]
resources: ["pods", "services", "deployments", "replicasets", "configmaps", "secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: project-viewer-binding
namespace: project-550e8400-e29b-41d4-a716-446655440000
subjects:
- kind: Group
name: project:proj-456:viewer
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: project-viewer
apiGroup: rbac.authorization.k8s.io
ClusterRole for Node Access (for cluster admins):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: plasmacloud-cluster-admin
rules:
- apiGroups: [""]
resources: ["nodes", "persistentvolumes"]
verbs: ["*"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: plasmacloud-cluster-admin-binding
subjects:
- kind: Group
name: system:plasmacloud-admins
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: plasmacloud-cluster-admin
apiGroup: rbac.authorization.k8s.io
Network Isolation
Default NetworkPolicy (deny all, except DNS):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
podSelector: {} # Apply to all pods
policyTypes:
- Ingress
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53 # DNS
Allow Ingress from LoadBalancer:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-loadbalancer
namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
podSelector:
matchLabels:
app: web
policyTypes:
- Ingress
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0 # Allow from anywhere (LoadBalancer external traffic)
ports:
- protocol: TCP
port: 8080
Allow Inter-Namespace Communication (optional, for org-shared services):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-org-shared
namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
plasmacloud.io/org-id: "org-123"
plasmacloud.io/tenant-type: "org-shared"
NovaNET Enforcement:
- NetworkPolicies are translated to OVN ACLs by NovaNET CNI controller
- Enforced at OVN logical switch level (low-level packet filtering)
Resource Quotas
CPU and Memory Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: project-compute-quota
namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
hard:
requests.cpu: "10" # 10 CPU cores
requests.memory: "20Gi" # 20 GB RAM
limits.cpu: "20" # Allow bursting to 20 cores
limits.memory: "40Gi" # Allow bursting to 40 GB RAM
Storage Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: project-storage-quota
namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
hard:
persistentvolumeclaims: "10" # Max 10 PVCs
requests.storage: "100Gi" # Total storage requests
Object Count Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: project-object-quota
namespace: project-550e8400-e29b-41d4-a716-446655440000
spec:
hard:
pods: "50"
services: "20"
services.loadbalancers: "5" # Max 5 LoadBalancer services (limit external IPs)
configmaps: "50"
secrets: "50"
Quota Enforcement:
- K8s admission controller rejects resource creation exceeding quota
- User receives clear error message
- Quota usage visible in
kubectl describe quota
Deployment Model
Single-Server (Development/Small)
Target Use Case:
- Development and testing environments
- Small production workloads (<10 nodes)
- Cost-sensitive deployments
Architecture:
- Single k3s server node with embedded SQLite datastore
- Control plane and worker colocated
- No HA guarantees
k3s Server Command:
k3s server \
--data-dir=/var/lib/k8shost \
--disable=servicelb,traefik,flannel \
--flannel-backend=none \
--disable-network-policy \
--cluster-domain=cluster.local \
--service-cidr=10.96.0.0/12 \
--cluster-cidr=10.244.0.0/16 \
--authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
--bind-address=0.0.0.0 \
--advertise-address=192.168.1.100 \
--tls-san=k8s-api.example.com
NixOS Configuration:
{ config, lib, pkgs, ... }:
{
services.k8shost = {
enable = true;
mode = "server";
datastore = "sqlite"; # Embedded SQLite
disableComponents = ["servicelb" "traefik" "flannel"];
networking = {
serviceCIDR = "10.96.0.0/12";
clusterCIDR = "10.244.0.0/16";
clusterDomain = "cluster.local";
};
novanet = {
enable = true;
endpoint = "novanet-server:5000";
ovnNorthbound = "tcp:novanet-server:6641";
ovnSouthbound = "tcp:novanet-server:6642";
};
fiberlb = {
enable = true;
endpoint = "fiberlb-server:7000";
externalIpPool = "192.168.100.0/24";
};
iam = {
enable = true;
webhookEndpoint = "https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate";
caCertFile = "/etc/k8shost/ca.crt";
clientCertFile = "/etc/k8shost/client.crt";
clientKeyFile = "/etc/k8shost/client.key";
};
flashdns = {
enable = true;
endpoint = "flashdns-server:6000";
clusterDomain = "cluster.local";
recordTTL = 30;
};
lightningstor = {
enable = true;
endpoint = "lightningstor-server:8000";
csiNodeDaemonSet = true; # Deploy CSI node plugin as DaemonSet
};
};
# Open firewall for K8s API
networking.firewall.allowedTCPPorts = [ 6443 ];
}
Limitations:
- No HA (single point of failure)
- SQLite has limited concurrency
- Control plane downtime affects entire cluster
HA Cluster (Production)
Target Use Case:
- Production workloads requiring high availability
- Large clusters (>10 nodes)
- Mission-critical applications
Architecture:
- 3 or 5 k3s server nodes (odd number for quorum)
- Embedded etcd (Raft consensus, HA datastore)
- Load balancer in front of API servers
- Agent nodes for workload scheduling
k3s Server Command (each server node):
k3s server \
--data-dir=/var/lib/k8shost \
--disable=servicelb,traefik,flannel \
--flannel-backend=none \
--disable-network-policy \
--cluster-domain=cluster.local \
--service-cidr=10.96.0.0/12 \
--cluster-cidr=10.244.0.0/16 \
--authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
--cluster-init \ # First server only
--server https://k8s-api-lb.internal:6443 \ # Join existing cluster (not for first server)
--tls-san=k8s-api-lb.example.com \
--tls-san=k8s-api.example.com
k3s Agent Command (worker nodes):
k3s agent \
--server https://k8s-api-lb.internal:6443 \
--token <join-token>
NixOS Configuration (Server Node):
{ config, lib, pkgs, ... }:
{
services.k8shost = {
enable = true;
mode = "server";
datastore = "etcd"; # Embedded etcd for HA
clusterInit = true; # Set to false for joining servers
serverUrl = "https://k8s-api-lb.internal:6443"; # For joining servers
# ... same integrations as single-server ...
};
# High availability settings
systemd.services.k8shost = {
serviceConfig = {
Restart = "always";
RestartSec = "10s";
};
};
}
Load Balancer Configuration (FiberLB):
# External LoadBalancer for API access
apiVersion: v1
kind: LoadBalancer
metadata:
name: k8s-api-lb
spec:
listeners:
- protocol: TCP
port: 6443
backend_pool: k8s-api-servers
pools:
- name: k8s-api-servers
algorithm: round-robin
members:
- address: 192.168.1.101 # server-1
port: 6443
- address: 192.168.1.102 # server-2
port: 6443
- address: 192.168.1.103 # server-3
port: 6443
health_check:
type: tcp
interval: 10s
timeout: 5s
retries: 3
Datastore Options:
Option 1: Embedded etcd (Recommended for MVP)
Pros:
- Built-in to k3s, no external dependencies
- Proven, battle-tested (CNCF etcd project)
- Automatic HA with Raft consensus
- Easy setup (just
--cluster-init)
Cons:
- Another distributed datastore (in addition to Chainfire/FlareDB)
- etcd-specific operations (backup, restore, defragmentation)
Option 2: FlareDB as External Datastore
Pros:
- Unified storage layer for PlasmaCloud
- Leverage existing FlareDB deployment
- Simplified infrastructure (one less system to manage)
Cons:
- k3s requires etcd API compatibility
- FlareDB would need to implement etcd v3 API (significant effort)
- Untested for K8s workloads
Recommendation for MVP: Use embedded etcd for HA mode. Investigate FlareDB etcd compatibility in Phase 2 or 3.
Backup and Disaster Recovery:
# etcd snapshot (on any server node)
k3s etcd-snapshot save --name backup-$(date +%Y%m%d-%H%M%S)
# List snapshots
k3s etcd-snapshot ls
# Restore from snapshot
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/k8shost/server/db/snapshots/backup-20250101-120000
NixOS Module Integration
Module Structure:
nix/modules/
├── k8shost.nix # Main module
├── k8shost/
│ ├── controller.nix # FiberLB, FlashDNS controllers
│ ├── csi.nix # LightningStor CSI driver
│ └── cni.nix # NovaNET CNI plugin
Main Module (nix/modules/k8shost.nix):
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.services.k8shost;
in
{
options.services.k8shost = {
enable = mkEnableOption "PlasmaCloud K8s Hosting Service";
mode = mkOption {
type = types.enum ["server" "agent"];
default = "server";
description = "Run as server (control plane) or agent (worker)";
};
datastore = mkOption {
type = types.enum ["sqlite" "etcd"];
default = "sqlite";
description = "Datastore backend (sqlite for single-server, etcd for HA)";
};
disableComponents = mkOption {
type = types.listOf types.str;
default = ["servicelb" "traefik" "flannel"];
description = "k3s components to disable";
};
networking = {
serviceCIDR = mkOption {
type = types.str;
default = "10.96.0.0/12";
description = "CIDR for service ClusterIPs";
};
clusterCIDR = mkOption {
type = types.str;
default = "10.244.0.0/16";
description = "CIDR for pod IPs";
};
clusterDomain = mkOption {
type = types.str;
default = "cluster.local";
description = "Cluster DNS domain";
};
};
# Integration options (novanet, fiberlb, iam, flashdns, lightningstor)
# ...
};
config = mkIf cfg.enable {
# Install k3s package
environment.systemPackages = [ pkgs.k3s ];
# Create systemd service
systemd.services.k8shost = {
description = "PlasmaCloud K8s Hosting Service (k3s)";
after = [ "network.target" "iam.service" "novanet.service" ];
requires = [ "iam.service" "novanet.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "notify";
ExecStart = "${pkgs.k3s}/bin/k3s ${cfg.mode} ${concatStringsSep " " (buildServerArgs cfg)}";
KillMode = "process";
Delegate = "yes";
LimitNOFILE = 1048576;
LimitNPROC = "infinity";
LimitCORE = "infinity";
TasksMax = "infinity";
Restart = "always";
RestartSec = "5s";
};
};
# Create configuration files
environment.etc."k8shost/iam-webhook.yaml" = {
text = generateIAMWebhookConfig cfg.iam;
mode = "0600";
};
# Deploy controllers (FiberLB, FlashDNS, etc.)
# ... (as separate systemd services or in-cluster deployments)
};
}
API Server Configuration
k3s Server Flags (Complete)
k3s server \
# Data and cluster configuration
--data-dir=/var/lib/k8shost \
--cluster-init \ # For first server in HA cluster
--server https://k8s-api-lb.internal:6443 \ # Join existing HA cluster
--token <cluster-token> \ # Secure join token
# Disable default components
--disable=servicelb,traefik,flannel,local-storage \
--flannel-backend=none \
--disable-network-policy \
# Network configuration
--cluster-domain=cluster.local \
--service-cidr=10.96.0.0/12 \
--cluster-cidr=10.244.0.0/16 \
--service-node-port-range=30000-32767 \
# API server configuration
--bind-address=0.0.0.0 \
--advertise-address=192.168.1.100 \
--tls-san=k8s-api.example.com \
--tls-san=k8s-api-lb.example.com \
# Authentication
--authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
--authentication-token-webhook-cache-ttl=2m \
# Authorization (RBAC enabled by default)
# --authorization-mode=Node,RBAC \ # Default, no need to specify
# Audit logging
--kube-apiserver-arg=audit-log-path=/var/log/k8shost/audit.log \
--kube-apiserver-arg=audit-log-maxage=30 \
--kube-apiserver-arg=audit-log-maxbackup=10 \
--kube-apiserver-arg=audit-log-maxsize=100 \
# Feature gates (if needed)
# --kube-apiserver-arg=feature-gates=SomeFeature=true
Authentication Webhook Configuration
File: /etc/k8shost/iam-webhook.yaml
apiVersion: v1
kind: Config
clusters:
- name: iam-webhook
cluster:
server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
certificate-authority: /etc/k8shost/ca.crt
users:
- name: k8s-apiserver
user:
client-certificate: /etc/k8shost/apiserver-client.crt
client-key: /etc/k8shost/apiserver-client.key
current-context: webhook
contexts:
- context:
cluster: iam-webhook
user: k8s-apiserver
name: webhook
Certificate Management:
- CA certificate: Issued by PlasmaCloud IAM PKI
- Client certificate: For kube-apiserver to authenticate to IAM webhook
- Rotation: Certificates expire after 1 year, auto-renewed by IAM
Security
TLS/mTLS
Component Communication:
| Source | Destination | Protocol | Auth Method |
|---|---|---|---|
| kube-apiserver | IAM webhook | HTTPS + mTLS | Client cert |
| FiberLB controller | FiberLB gRPC | gRPC + TLS | IAM token |
| FlashDNS controller | FlashDNS gRPC | gRPC + TLS | IAM token |
| LightningStor CSI | LightningStor gRPC | gRPC + TLS | IAM token |
| NovaNET CNI | NovaNET gRPC | gRPC + TLS | IAM token |
| kubectl | kube-apiserver | HTTPS | IAM token (Bearer) |
Certificate Issuance:
- All certificates issued by IAM service (centralized PKI)
- Automatic renewal before expiration
- Certificate revocation via IAM CRL
Pod Security
Pod Security Standards (PSS):
- Baseline Profile: Enforced on all namespaces by default
- Deny privileged containers
- Deny host network/PID/IPC
- Deny hostPath volumes
- Deny privilege escalation
- Restricted Profile: Optional, for highly sensitive workloads
Example PodSecurityPolicy (deprecated in K8s 1.25, use PSS):
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- configMap
- emptyDir
- projected
- secret
- downwardAPI
- persistentVolumeClaim
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
fsGroup:
rule: RunAsAny
Security Contexts (enforced):
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: app
image: myapp:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Service Account Permissions:
- Minimal RBAC permissions by default
- Principle of least privilege
- No cluster-admin access for user workloads
Testing Strategy
Unit Tests
Controllers (Go):
// fiberlb_controller_test.go
func TestReconcileLoadBalancer(t *testing.T) {
// Mock K8s client
client := fake.NewSimpleClientset()
// Mock FiberLB gRPC client
mockFiberLB := &mockFiberLBClient{}
controller := NewFiberLBController(client, mockFiberLB)
// Create test service
svc := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{Name: "test-svc", Namespace: "default"},
Spec: corev1.ServiceSpec{Type: corev1.ServiceTypeLoadBalancer},
}
// Reconcile
err := controller.Reconcile(svc)
assert.NoError(t, err)
// Verify FiberLB API called
assert.Equal(t, 1, mockFiberLB.createLoadBalancerCalls)
}
CNI Plugin (Rust):
#[test]
fn test_cni_add() {
let mut mock_ovn = MockOVNClient::new();
mock_ovn.expect_allocate_ip()
.returning(|ns, pod| Ok("10.244.1.5/24".to_string()));
let plugin = NovaNETPlugin::new(mock_ovn);
let result = plugin.handle_add(/* ... */);
assert!(result.is_ok());
assert_eq!(result.unwrap().ip, "10.244.1.5");
}
CSI Driver (Go):
func TestCreateVolume(t *testing.T) {
mockLightningStor := &mockLightningStorClient{}
mockLightningStor.On("CreateVolume", mock.Anything).Return(&Volume{ID: "vol-123"}, nil)
driver := NewCSIDriver(mockLightningStor)
req := &csi.CreateVolumeRequest{
Name: "test-vol",
CapacityRange: &csi.CapacityRange{RequiredBytes: 10 * 1024 * 1024 * 1024},
}
resp, err := driver.CreateVolume(context.Background(), req)
assert.NoError(t, err)
assert.Equal(t, "vol-123", resp.Volume.VolumeId)
}
Integration Tests
Test Environment:
- Single-node k3s cluster (kind or k3s in Docker)
- Mock or real PlasmaCloud services (NovaNET, FiberLB, etc.)
- Automated setup and teardown
Test Cases:
1. Single-Pod Deployment:
#!/bin/bash
set -e
# Deploy nginx pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
EOF
# Wait for pod to be running
kubectl wait --for=condition=Ready pod/nginx --timeout=60s
# Verify pod IP allocated
POD_IP=$(kubectl get pod nginx -o jsonpath='{.status.podIP}')
[ -n "$POD_IP" ] || exit 1
# Cleanup
kubectl delete pod nginx
2. Service Exposure (LoadBalancer):
#!/bin/bash
set -e
# Create deployment
kubectl create deployment web --image=nginx:latest --replicas=2
# Expose as LoadBalancer
kubectl expose deployment web --type=LoadBalancer --port=80
# Wait for external IP
for i in {1..30}; do
EXTERNAL_IP=$(kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
[ -n "$EXTERNAL_IP" ] && break
sleep 2
done
[ -n "$EXTERNAL_IP" ] || exit 1
# Verify HTTP access
curl -f http://$EXTERNAL_IP || exit 1
# Cleanup
kubectl delete svc web
kubectl delete deployment web
3. PersistentVolume Provisioning:
#!/bin/bash
set -e
# Create PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
storageClassName: lightningstor-ssd
EOF
# Wait for PVC to be bound
kubectl wait --for=condition=Bound pvc/test-pvc --timeout=60s
# Create pod using PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: app
image: busybox
command: ["sh", "-c", "echo hello > /data/test.txt && sleep 3600"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-pvc
EOF
kubectl wait --for=condition=Ready pod/test-pod --timeout=60s
# Verify file written
kubectl exec test-pod -- cat /data/test.txt | grep hello || exit 1
# Cleanup
kubectl delete pod test-pod
kubectl delete pvc test-pvc
4. Multi-Tenant Isolation:
#!/bin/bash
set -e
# Create two namespaces
kubectl create namespace project-a
kubectl create namespace project-b
# Deploy pod in each
kubectl run pod-a --image=nginx -n project-a
kubectl run pod-b --image=nginx -n project-b
# Verify network isolation (if NetworkPolicies enabled)
# Pod A should NOT be able to reach Pod B
POD_B_IP=$(kubectl get pod pod-b -n project-b -o jsonpath='{.status.podIP}')
kubectl exec pod-a -n project-a -- curl --max-time 5 http://$POD_B_IP && exit 1 || true
# Cleanup
kubectl delete ns project-a project-b
E2E Test Scenario
End-to-End Test: Deploy Multi-Tier Application
#!/bin/bash
set -ex
NAMESPACE="project-123"
# 1. Create namespace
kubectl create namespace $NAMESPACE
# 2. Deploy PostgreSQL with PVC
kubectl apply -n $NAMESPACE -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 5Gi
storageClassName: lightningstor-ssd
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:15
env:
- name: POSTGRES_PASSWORD
value: testpass
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumes:
- name: data
persistentVolumeClaim:
claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres
ports:
- port: 5432
EOF
# 3. Deploy web application
kubectl apply -n $NAMESPACE -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: myapp:latest
env:
- name: DATABASE_URL
value: postgres://postgres:testpass@postgres:5432/mydb
ports:
- containerPort: 8080
EOF
# 4. Expose web via LoadBalancer
kubectl expose deployment web -n $NAMESPACE --type=LoadBalancer --port=80 --target-port=8080
# 5. Wait for resources
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=postgres --timeout=120s
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=web --timeout=120s
# 6. Verify LoadBalancer external IP
for i in {1..60}; do
EXTERNAL_IP=$(kubectl get svc web -n $NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
[ -n "$EXTERNAL_IP" ] && break
sleep 2
done
[ -n "$EXTERNAL_IP" ] || { echo "No external IP assigned"; exit 1; }
# 7. Verify DNS resolution
kubectl run -n $NAMESPACE --rm -it --restart=Never test-dns --image=busybox -- nslookup postgres.${NAMESPACE}.svc.cluster.local
# 8. Verify HTTP access
curl -f http://$EXTERNAL_IP/health || { echo "Health check failed"; exit 1; }
# 9. Verify PVC mounted
kubectl exec -n $NAMESPACE deployment/postgres -- ls /var/lib/postgresql/data | grep pg_wal
# 10. Verify network isolation (optional, if NetworkPolicies enabled)
# ...
# Cleanup
kubectl delete namespace $NAMESPACE
echo "E2E test passed!"
Implementation Phases
Phase 1: Foundation (4-5 weeks)
Week 1-2: k3s Setup and IAM Integration
- Install and configure k3s with disabled components
- Implement IAM authentication webhook server
- Configure kube-apiserver to use IAM webhook
- Create RBAC templates (org admin, project admin, viewer)
- Test: Authenticate with IAM token, verify RBAC enforcement
Week 3: NovaNET CNI Plugin
- Implement CNI binary (ADD, DEL, CHECK commands)
- Integrate with NovaNET gRPC API (AllocateIP, ReleaseIP)
- Configure OVN logical switches per namespace
- Test: Create pod, verify network interface and IP allocation
Week 4: FiberLB Controller
- Implement controller watch loop (Services, Endpoints)
- Integrate with FiberLB gRPC API (CreateLoadBalancer, UpdatePool)
- Implement external IP allocation from pool
- Test: Expose service as LoadBalancer, verify external IP and routing
Week 5: Basic RBAC and Multi-Tenancy
- Implement namespace-per-project provisioning
- Deploy default RBAC roles and bindings
- Test: Create multiple projects, verify isolation
Deliverables:
- Functional k3s cluster with IAM authentication
- Pod networking via NovaNET
- LoadBalancer services via FiberLB
- Multi-tenant namespaces with RBAC
Phase 2: Storage & DNS (5-6 weeks)
Week 6-7: LightningStor CSI Driver
- Implement CSI Controller Service (CreateVolume, DeleteVolume, ControllerPublishVolume)
- Implement CSI Node Service (NodeStageVolume, NodePublishVolume)
- Integrate with LightningStor gRPC API
- Deploy CSI driver as pods (controller + node DaemonSet)
- Create StorageClasses for SSD and HDD
- Test: Create PVC, attach to pod, write/read data
Week 8: FlashDNS Controller
- Implement controller watch loop (Services, Pods)
- Integrate with FlashDNS gRPC API (CreateRecord, UpdateRecord)
- Generate DNS records (A, SRV) for services and pods
- Configure kubelet DNS settings
- Test: Resolve service DNS from pod, verify DNS updates
Week 9: Network Policy Support
- Extend NovaNET CNI with NetworkPolicy controller
- Translate K8s NetworkPolicy to OVN ACLs
- Implement address sets for pod label selectors
- Test: Create NetworkPolicy, verify ingress/egress enforcement
Week 10-11: Integration Testing
- Write integration test suite (pod, service, PVC, DNS)
- Test multi-tier application deployment
- Performance testing (pod creation time, network throughput)
- Fix bugs and optimize
Deliverables:
- Persistent storage via LightningStor CSI
- Service discovery via FlashDNS
- Network policies enforced by NovaNET
- Comprehensive integration tests
Phase 3: Advanced Features (Post-MVP, 6-8 weeks)
StatefulSets:
- Verify StatefulSet controller functionality (built-in to k3s)
- Test with headless services and volumeClaimTemplates
- Example: Deploy Cassandra or Kafka cluster
PlasmaVMC CRI Integration:
- Implement CRI server in PlasmaVMC (Rust)
- Create Firecracker microVM per pod
- Test pod lifecycle (create, start, stop, delete)
- Performance benchmarking (startup time, resource overhead)
FlareDB as k3s Datastore:
- Investigate etcd API compatibility layer for FlareDB
- Implement etcd v3 gRPC API shim
- Test k3s with FlareDB backend
- Benchmarking and stability testing
Autoscaling:
- Deploy metrics-server
- Implement HorizontalPodAutoscaler
- Test autoscaling based on CPU/memory metrics
Ingress (L7 LoadBalancer):
- Implement Ingress controller using FiberLB L7 capabilities
- Support host-based and path-based routing
- TLS termination
Success Criteria
Functional Requirements:
- ✅ Deploy pods, services, deployments using kubectl
- ✅ LoadBalancer services receive external IPs from FiberLB
- ✅ PersistentVolumes provisioned from LightningStor and mounted to pods
- ✅ DNS resolution works for services and pods (via FlashDNS)
- ✅ Multi-tenant isolation enforced (namespaces, RBAC, network policies)
- ✅ IAM authentication and RBAC functional (token validation, user/group mapping)
- ✅ E2E test passes (multi-tier application deployment)
Performance Requirements:
- Pod creation time: <10 seconds (from API call to running state)
- Service LoadBalancer IP allocation: <5 seconds
- PersistentVolume provisioning: <30 seconds
- DNS record updates: <10 seconds (after service creation)
- Support 100+ pods per cluster
- Support 10+ concurrent namespaces
Operational Requirements:
- NixOS module for declarative deployment
- Cluster upgrade path (k3s version upgrades)
- Backup and restore procedures (etcd snapshots)
- Monitoring and alerting integration (Prometheus, Grafana)
- Logging aggregation (FluentBit → centralized log storage)
Next Steps (S3-S6)
S3: Workspace Scaffold
- Create
k8shost/workspace directory structure - Set up Go module for controllers (FiberLB, FlashDNS)
- Set up Rust workspace for CNI plugin
- Set up Go module for CSI driver
- Create NixOS module skeleton
Directory Structure:
k8shost/
├── controllers/ # Go: FiberLB, FlashDNS, IAM webhook
│ ├── fiberlb/
│ ├── flashdns/
│ ├── iamwebhook/
│ └── main.go
├── cni/ # Rust: NovaNET CNI plugin
│ ├── src/
│ └── Cargo.toml
├── csi/ # Go: LightningStor CSI driver
│ ├── controller/
│ ├── node/
│ └── main.go
├── nix/
│ └── modules/
│ └── k8shost.nix
└── tests/
├── integration/
└── e2e/
S4: Controllers Implementation
- Implement FiberLB controller (Service watch, gRPC integration)
- Implement FlashDNS controller (Service/Pod watch, DNS record sync)
- Implement IAM webhook server (TokenReview API, IAM validation)
- Unit tests for each controller
S5: CNI + CSI Implementation
- Implement NovaNET CNI plugin (ADD/DEL/CHECK, OVN integration)
- Implement LightningStor CSI driver (Controller and Node services)
- Deploy CSI driver as pods (Deployment + DaemonSet)
- Unit tests for CNI and CSI
S6: Integration Testing
- Set up integration test environment (k3s cluster + mock services)
- Write integration tests (pod, service, PVC, DNS, multi-tenant)
- Write E2E test (multi-tier application)
- CI/CD pipeline for automated testing
References
- k3s Architecture: https://docs.k3s.io/architecture
- k3s Installation: https://docs.k3s.io/installation
- k3s HA Setup: https://docs.k3s.io/datastore/ha-embedded
- CNI Specification: https://github.com/containernetworking/cni/blob/main/SPEC.md
- CSI Specification: https://github.com/container-storage-interface/spec/blob/master/spec.md
- K8s Authentication Webhooks: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#webhook-token-authentication
- K8s Authorization (RBAC): https://kubernetes.io/docs/reference/access-authn-authz/rbac/
- K8s Network Policies: https://kubernetes.io/docs/concepts/services-networking/network-policies/
- OVN Architecture: https://www.ovn.org/support/dist-docs/ovn-architecture.7.html
- Kubernetes API Reference: https://kubernetes.io/docs/reference/kubernetes-api/
Document Version: 1.0 Last Updated: 2025-12-09 Authors: PlasmaCloud Platform Team Status: Draft for Review