- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
2396 lines
71 KiB
Markdown
2396 lines
71 KiB
Markdown
# K8s Hosting Specification
|
|
|
|
## Overview
|
|
|
|
PlasmaCloud's K8s Hosting service provides managed Kubernetes clusters for multi-tenant container orchestration. This specification defines a k3s-based architecture that integrates deeply with existing PlasmaCloud infrastructure components: PrismNET for networking, FiberLB for load balancing, IAM for authentication/authorization, FlashDNS for service discovery, and LightningStor for persistent storage.
|
|
|
|
### Purpose
|
|
|
|
Enable customers to deploy and manage containerized workloads using standard Kubernetes APIs while benefiting from PlasmaCloud's integrated infrastructure services. The system provides:
|
|
|
|
- **Standard K8s API compatibility**: Use kubectl, Helm, and existing K8s tooling
|
|
- **Multi-tenant isolation**: Project-based namespaces with IAM-backed RBAC
|
|
- **Deep integration**: Leverage PrismNET SDN, FiberLB load balancing, LightningStor block storage
|
|
- **Production-ready**: HA control plane, automated failover, comprehensive monitoring
|
|
|
|
### Scope
|
|
|
|
**Phase 1 (MVP, 3-4 months):**
|
|
- Core K8s APIs (Pods, Services, Deployments, ReplicaSets, Namespaces, ConfigMaps, Secrets)
|
|
- LoadBalancer services via FiberLB
|
|
- Persistent storage via LightningStor CSI
|
|
- IAM authentication and RBAC
|
|
- PrismNET CNI for pod networking
|
|
- FlashDNS service discovery
|
|
|
|
**Future Phases:**
|
|
- PlasmaVMC integration for VM-backed pods (enhanced isolation)
|
|
- StatefulSets, DaemonSets, Jobs/CronJobs
|
|
- Network policies with PrismNET enforcement
|
|
- Horizontal Pod Autoscaler
|
|
- FlareDB as k3s datastore
|
|
|
|
### Architecture Decision Summary
|
|
|
|
**Base Technology: k3s**
|
|
- Lightweight K8s distribution (single binary, minimal dependencies)
|
|
- Production-proven (CNCF certified, widely deployed)
|
|
- Flexible architecture allowing component replacement
|
|
- Embedded SQLite (single-server) or etcd (HA cluster)
|
|
- 3-4 month timeline achievable
|
|
|
|
**Component Replacement Strategy:**
|
|
- **Disable**: servicelb (replaced by FiberLB), traefik (use FiberLB), flannel (replaced by PrismNET)
|
|
- **Keep**: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, containerd
|
|
- **Add**: Custom controllers for FiberLB, FlashDNS, IAM webhook, LightningStor CSI, PrismNET CNI
|
|
|
|
## Architecture
|
|
|
|
### Base: k3s with Selective Component Replacement
|
|
|
|
**k3s Core (Keep):**
|
|
- **kube-apiserver**: K8s REST API server with IAM webhook authentication
|
|
- **kube-scheduler**: Pod scheduling with resource awareness
|
|
- **kube-controller-manager**: Core controllers (replication, endpoints, service accounts, etc.)
|
|
- **kubelet**: Node agent managing pod lifecycle via containerd CRI
|
|
- **containerd**: Container runtime (Phase 1), later replaceable by PlasmaVMC CRI
|
|
- **kube-proxy**: Service networking (iptables/ipvs mode)
|
|
|
|
**k3s Components (Disable):**
|
|
- **servicelb**: Default LoadBalancer implementation → Replaced by FiberLB controller
|
|
- **traefik**: Ingress controller → Replaced by FiberLB L7 capabilities
|
|
- **flannel**: CNI plugin → Replaced by PrismNET CNI
|
|
- **local-path-provisioner**: Storage provisioner → Replaced by LightningStor CSI
|
|
|
|
**PlasmaCloud Custom Components (Add):**
|
|
- **PrismNET CNI Plugin**: Pod networking via OVN logical switches
|
|
- **FiberLB Controller**: LoadBalancer service reconciliation
|
|
- **IAM Webhook Server**: Token validation and user mapping
|
|
- **FlashDNS Controller**: Service DNS record synchronization
|
|
- **LightningStor CSI Driver**: PersistentVolume provisioning and attachment
|
|
|
|
### Component Topology
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ k3s Control Plane │
|
|
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
|
|
│ │ kube-apiserver│◄─┤ IAM Webhook├──┤ IAM Service │ │
|
|
│ │ │ │ │ │ (Authentication) │ │
|
|
│ └──────┬───────┘ └────────────┘ └──────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────▼───────┐ ┌──────────────┐ ┌────────────────┐ │
|
|
│ │kube-scheduler│ │kube-controller│ │ etcd/SQLite │ │
|
|
│ │ │ │ -manager │ │ (Datastore) │ │
|
|
│ └──────────────┘ └──────────────┘ └────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌──────────────────┼──────────────────┐
|
|
│ │ │
|
|
┌───────▼───────┐ ┌───────▼───────┐ ┌──────▼──────┐
|
|
│ FiberLB │ │ FlashDNS │ │ LightningStor│
|
|
│ Controller │ │ Controller │ │ CSI Plugin │
|
|
│ (Watch Svcs) │ │ (Sync DNS) │ │ (Provision) │
|
|
└───────┬───────┘ └───────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌──────────────┐ ┌──────────────┐ ┌────────────────┐
|
|
│ FiberLB │ │ FlashDNS │ │ LightningStor │
|
|
│ gRPC API │ │ gRPC API │ │ gRPC API │
|
|
└──────────────┘ └──────────────┘ └────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ k3s Worker Nodes │
|
|
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
|
|
│ │ kubelet │◄─┤containerd ├──┤ Pods (containers)│ │
|
|
│ │ │ │ CRI │ │ │ │
|
|
│ └──────┬───────┘ └────────────┘ └──────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────▼───────┐ ┌──────────────┐ │
|
|
│ │ PrismNET CNI │◄─┤ kube-proxy │ │
|
|
│ │ (Pod Network)│ │ (Service Net)│ │
|
|
│ └──────┬───────┘ └──────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────┐ │
|
|
│ │ PrismNET OVN │ │
|
|
│ │ (ovs-vswitchd)│ │
|
|
│ └──────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Data Flow Examples
|
|
|
|
**1. Pod Creation:**
|
|
```
|
|
kubectl create pod → kube-apiserver (IAM auth) → scheduler → kubelet → containerd
|
|
↓
|
|
PrismNET CNI
|
|
↓
|
|
OVN logical port
|
|
```
|
|
|
|
**2. LoadBalancer Service:**
|
|
```
|
|
kubectl expose → kube-apiserver → Service created → FiberLB controller watches
|
|
↓
|
|
FiberLB gRPC API
|
|
↓
|
|
External IP + L4 forwarding
|
|
```
|
|
|
|
**3. PersistentVolume:**
|
|
```
|
|
PVC created → kube-apiserver → CSI controller → LightningStor CSI driver
|
|
↓
|
|
LightningStor gRPC
|
|
↓
|
|
Volume created
|
|
↓
|
|
kubelet → CSI node plugin
|
|
↓
|
|
Mount to pod
|
|
```
|
|
|
|
## K8s API Subset
|
|
|
|
### Phase 1: Core APIs (Essential)
|
|
|
|
**Pods (v1):**
|
|
- Full CRUD operations (create, get, list, update, delete, patch)
|
|
- Watch API for real-time updates
|
|
- Logs streaming (`kubectl logs -f`)
|
|
- Exec into containers (`kubectl exec`)
|
|
- Port forwarding (`kubectl port-forward`)
|
|
- Status: Phase (Pending, Running, Succeeded, Failed), conditions, container states
|
|
|
|
**Services (v1):**
|
|
- **ClusterIP**: Internal cluster networking (default)
|
|
- **LoadBalancer**: External access via FiberLB
|
|
- **Headless**: StatefulSet support (clusterIP: None)
|
|
- Service discovery via FlashDNS
|
|
- Endpoint slices for large service backends
|
|
|
|
**Deployments (apps/v1):**
|
|
- Declarative desired state (replicas, pod template)
|
|
- Rolling updates with configurable strategy (maxSurge, maxUnavailable)
|
|
- Rollback to previous revision
|
|
- Pause/resume for canary deployments
|
|
- Scaling (manual in Phase 1)
|
|
|
|
**ReplicaSets (apps/v1):**
|
|
- Pod replication with label selectors
|
|
- Owned by Deployments (rarely created directly)
|
|
- Orphan/adopt pod ownership
|
|
|
|
**Namespaces (v1):**
|
|
- Tenant isolation (one namespace per project)
|
|
- Resource quota enforcement
|
|
- Network policy scope (Phase 2)
|
|
- RBAC scope
|
|
|
|
**ConfigMaps (v1):**
|
|
- Non-sensitive configuration data
|
|
- Mount as volumes or environment variables
|
|
- Update triggers pod restarts (via annotation)
|
|
|
|
**Secrets (v1):**
|
|
- Sensitive data (passwords, tokens, certificates)
|
|
- Base64 encoded in etcd (at-rest encryption in future phase)
|
|
- Mount as volumes or environment variables
|
|
- Service account tokens
|
|
|
|
**Nodes (v1):**
|
|
- Node registration via kubelet
|
|
- Heartbeat and status reporting
|
|
- Capacity and allocatable resources
|
|
- Labels and taints for scheduling
|
|
|
|
**Events (v1):**
|
|
- Audit trail of cluster activities
|
|
- Retention policy (1 hour in-memory, longer in etcd)
|
|
- Debugging and troubleshooting
|
|
|
|
### Phase 2: Storage & Config (Required for MVP)
|
|
|
|
**PersistentVolumes (v1):**
|
|
- Volume lifecycle independent of pods
|
|
- Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany (LightningStor support)
|
|
- Reclaim policy: Retain, Delete
|
|
- Status: Available, Bound, Released, Failed
|
|
|
|
**PersistentVolumeClaims (v1):**
|
|
- User request for storage
|
|
- Binding to PVs by storage class, capacity, access mode
|
|
- Volume expansion (if storage class allows)
|
|
|
|
**StorageClasses (storage.k8s.io/v1):**
|
|
- Dynamic provisioning via LightningStor CSI
|
|
- Parameters: volume type (ssd, hdd), replication factor, org_id, project_id
|
|
- Volume binding mode: Immediate or WaitForFirstConsumer
|
|
|
|
### Phase 3: Advanced (Post-MVP)
|
|
|
|
**StatefulSets (apps/v1):**
|
|
- Ordered pod creation/deletion
|
|
- Stable network identities (pod-0, pod-1, ...)
|
|
- Persistent storage per pod via volumeClaimTemplates
|
|
- Use case: Databases, distributed systems
|
|
|
|
**DaemonSets (apps/v1):**
|
|
- One pod per node (e.g., log collectors, monitoring agents)
|
|
- Node selector and tolerations
|
|
|
|
**Jobs (batch/v1):**
|
|
- Run-to-completion workloads
|
|
- Parallelism and completions
|
|
- Retry policy
|
|
|
|
**CronJobs (batch/v1):**
|
|
- Scheduled jobs (cron syntax)
|
|
- Concurrency policy
|
|
|
|
**NetworkPolicies (networking.k8s.io/v1):**
|
|
- Ingress and egress rules
|
|
- Label-based pod selection
|
|
- Namespace selectors
|
|
- Requires PrismNET CNI support for OVN ACL translation
|
|
|
|
**Ingress (networking.k8s.io/v1):**
|
|
- HTTP/HTTPS routing via FiberLB L7
|
|
- Host-based and path-based routing
|
|
- TLS termination
|
|
|
|
### Deferred APIs (Not in MVP)
|
|
|
|
- HorizontalPodAutoscaler (autoscaling/v2): Requires metrics-server
|
|
- VerticalPodAutoscaler: Complex, low priority
|
|
- PodDisruptionBudget: Useful for HA, but post-MVP
|
|
- LimitRange: Resource limits per namespace (future)
|
|
- ResourceQuota: Supported in Phase 1, but advanced features deferred
|
|
- CustomResourceDefinitions (CRDs): Framework exists, but no custom resources in Phase 1
|
|
- APIService: Aggregation layer not needed initially
|
|
|
|
## Integration Specifications
|
|
|
|
### 1. PrismNET CNI Plugin
|
|
|
|
**Purpose:** Provide pod networking using PrismNET's OVN-based SDN.
|
|
|
|
**Interface:** CNI 1.0.0 specification (https://github.com/containernetworking/cni/blob/main/SPEC.md)
|
|
|
|
**Components:**
|
|
- **CNI binary**: `/opt/cni/bin/prismnet`
|
|
- **Configuration**: `/etc/cni/net.d/10-prismnet.conflist`
|
|
- **IPAM plugin**: `/opt/cni/bin/prismnet-ipam` (or integrated)
|
|
|
|
**Responsibilities:**
|
|
- Create network interface for pod (veth pair)
|
|
- Allocate IP address from namespace-specific subnet
|
|
- Connect pod to OVN logical switch
|
|
- Configure routing for pod egress
|
|
- Enforce network policies (Phase 2)
|
|
|
|
**Configuration Schema:**
|
|
```json
|
|
{
|
|
"cniVersion": "1.0.0",
|
|
"name": "prismnet",
|
|
"type": "prismnet",
|
|
"ipam": {
|
|
"type": "prismnet-ipam",
|
|
"subnet": "10.244.0.0/16",
|
|
"rangeStart": "10.244.0.10",
|
|
"rangeEnd": "10.244.255.254",
|
|
"routes": [
|
|
{"dst": "0.0.0.0/0"}
|
|
],
|
|
"gateway": "10.244.0.1"
|
|
},
|
|
"ovn": {
|
|
"northbound": "tcp:prismnet-server:6641",
|
|
"southbound": "tcp:prismnet-server:6642",
|
|
"encapType": "geneve"
|
|
},
|
|
"mtu": 1400,
|
|
"prismnetEndpoint": "prismnet-server:5000"
|
|
}
|
|
```
|
|
|
|
**CNI Plugin Workflow:**
|
|
|
|
1. **ADD Command** (pod creation):
|
|
```
|
|
Input: Container ID, network namespace path, interface name
|
|
Process:
|
|
- Call PrismNET gRPC API: AllocateIP(namespace, pod_name)
|
|
- Create veth pair: one end in pod netns, one in host
|
|
- Add host veth to OVN logical switch port
|
|
- Configure pod veth: IP address, routes, MTU
|
|
- Return: IP config, routes, DNS settings
|
|
```
|
|
|
|
2. **DEL Command** (pod deletion):
|
|
```
|
|
Input: Container ID, network namespace path
|
|
Process:
|
|
- Call PrismNET gRPC API: ReleaseIP(namespace, pod_name)
|
|
- Delete OVN logical switch port
|
|
- Delete veth pair
|
|
```
|
|
|
|
3. **CHECK Command** (health check):
|
|
```
|
|
Verify interface exists and has expected configuration
|
|
```
|
|
|
|
**API Integration (PrismNET gRPC):**
|
|
|
|
```protobuf
|
|
service NetworkService {
|
|
rpc AllocateIP(AllocateIPRequest) returns (AllocateIPResponse);
|
|
rpc ReleaseIP(ReleaseIPRequest) returns (ReleaseIPResponse);
|
|
rpc CreateLogicalSwitch(CreateLogicalSwitchRequest) returns (CreateLogicalSwitchResponse);
|
|
}
|
|
|
|
message AllocateIPRequest {
|
|
string namespace = 1;
|
|
string pod_name = 2;
|
|
string container_id = 3;
|
|
}
|
|
|
|
message AllocateIPResponse {
|
|
string ip_address = 1; // e.g., "10.244.1.5/24"
|
|
string gateway = 2;
|
|
repeated string dns_servers = 3;
|
|
}
|
|
```
|
|
|
|
**OVN Topology:**
|
|
- **Logical Switch per Namespace**: `k8s-<namespace>` (e.g., `k8s-project-123`)
|
|
- **Logical Router**: `k8s-cluster-router` for inter-namespace routing
|
|
- **Logical Switch Ports**: One per pod (`<pod-name>-<container-id>`)
|
|
- **ACLs**: NetworkPolicy enforcement (Phase 2)
|
|
|
|
**Network Policy Translation (Phase 2):**
|
|
```
|
|
K8s NetworkPolicy:
|
|
podSelector: app=web
|
|
ingress:
|
|
- from:
|
|
- podSelector: app=frontend
|
|
ports:
|
|
- protocol: TCP
|
|
port: 80
|
|
|
|
→ OVN ACL:
|
|
direction: to-lport
|
|
match: "ip4.src == $frontend_pods && tcp.dst == 80"
|
|
action: allow-related
|
|
priority: 1000
|
|
```
|
|
|
|
**Address Sets:**
|
|
- Dynamic updates as pods are added/removed
|
|
- Efficient ACL matching for large pod groups
|
|
|
|
### 2. FiberLB LoadBalancer Controller
|
|
|
|
**Purpose:** Reconcile K8s Services of type LoadBalancer with FiberLB resources.
|
|
|
|
**Architecture:**
|
|
- **Controller Process**: Runs as a pod in `kube-system` namespace or embedded in k3s server
|
|
- **Watch Resources**: Services (type=LoadBalancer), Endpoints
|
|
- **Manage Resources**: FiberLB LoadBalancers, Listeners, Pools, Members
|
|
|
|
**Controller Logic:**
|
|
|
|
**1. Service Watch Loop:**
|
|
```go
|
|
for event := range serviceWatcher {
|
|
if event.Type == Created || event.Type == Updated {
|
|
if service.Spec.Type == "LoadBalancer" {
|
|
reconcileLoadBalancer(service)
|
|
}
|
|
} else if event.Type == Deleted {
|
|
deleteLoadBalancer(service)
|
|
}
|
|
}
|
|
```
|
|
|
|
**2. Reconcile Logic:**
|
|
```
|
|
Input: Service object
|
|
Process:
|
|
1. Check if FiberLB LoadBalancer exists (by annotation or name mapping)
|
|
2. If not exists:
|
|
a. Allocate external IP from pool
|
|
b. Create FiberLB LoadBalancer resource (gRPC CreateLoadBalancer)
|
|
c. Store LoadBalancer ID in service annotation
|
|
3. For each service.Spec.Ports:
|
|
a. Create/update FiberLB Listener (protocol, port, algorithm)
|
|
4. Get service endpoints:
|
|
a. Create/update FiberLB Pool with backend members (pod IPs, ports)
|
|
5. Update service.Status.LoadBalancer.Ingress with external IP
|
|
6. If service spec changed:
|
|
a. Update FiberLB resources accordingly
|
|
```
|
|
|
|
**3. Endpoint Watch Loop:**
|
|
```
|
|
for event := range endpointWatcher {
|
|
service := getServiceForEndpoint(event.Object)
|
|
if service.Spec.Type == "LoadBalancer" {
|
|
updateLoadBalancerPool(service, event.Object)
|
|
}
|
|
}
|
|
```
|
|
|
|
**Configuration:**
|
|
- **External IP Pool**: `--external-ip-pool=192.168.100.0/24` (CIDR or IP range)
|
|
- **FiberLB Endpoint**: `--fiberlb-endpoint=fiberlb-server:7000` (gRPC address)
|
|
- **IP Allocation**: First-available or integration with IPAM service
|
|
|
|
**Service Annotations:**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: web-service
|
|
annotations:
|
|
fiberlb.plasmacloud.io/load-balancer-id: "lb-abc123"
|
|
fiberlb.plasmacloud.io/algorithm: "round-robin" # round-robin | least-conn | ip-hash
|
|
fiberlb.plasmacloud.io/health-check-path: "/health"
|
|
fiberlb.plasmacloud.io/health-check-interval: "10s"
|
|
fiberlb.plasmacloud.io/health-check-timeout: "5s"
|
|
fiberlb.plasmacloud.io/health-check-retries: "3"
|
|
fiberlb.plasmacloud.io/session-affinity: "client-ip" # For sticky sessions
|
|
spec:
|
|
type: LoadBalancer
|
|
selector:
|
|
app: web
|
|
ports:
|
|
- protocol: TCP
|
|
port: 80
|
|
targetPort: 8080
|
|
status:
|
|
loadBalancer:
|
|
ingress:
|
|
- ip: 192.168.100.50
|
|
```
|
|
|
|
**FiberLB gRPC API Integration:**
|
|
```protobuf
|
|
service LoadBalancerService {
|
|
rpc CreateLoadBalancer(CreateLoadBalancerRequest) returns (LoadBalancer);
|
|
rpc UpdateLoadBalancer(UpdateLoadBalancerRequest) returns (LoadBalancer);
|
|
rpc DeleteLoadBalancer(DeleteLoadBalancerRequest) returns (Empty);
|
|
rpc CreateListener(CreateListenerRequest) returns (Listener);
|
|
rpc UpdatePool(UpdatePoolRequest) returns (Pool);
|
|
}
|
|
|
|
message CreateLoadBalancerRequest {
|
|
string name = 1;
|
|
string description = 2;
|
|
string external_ip = 3; // If empty, allocate from pool
|
|
string org_id = 4;
|
|
string project_id = 5;
|
|
}
|
|
|
|
message CreateListenerRequest {
|
|
string load_balancer_id = 1;
|
|
string protocol = 2; // TCP, UDP, HTTP, HTTPS
|
|
int32 port = 3;
|
|
string default_pool_id = 4;
|
|
HealthCheck health_check = 5;
|
|
}
|
|
|
|
message UpdatePoolRequest {
|
|
string pool_id = 1;
|
|
repeated PoolMember members = 2;
|
|
string algorithm = 3;
|
|
}
|
|
|
|
message PoolMember {
|
|
string address = 1; // Pod IP
|
|
int32 port = 2;
|
|
int32 weight = 3;
|
|
}
|
|
```
|
|
|
|
**Health Checks:**
|
|
- HTTP health checks: Use annotation `health-check-path`
|
|
- TCP health checks: Connection-based for non-HTTP services
|
|
- Health check failures remove pod from pool (auto-healing)
|
|
|
|
**Edge Cases:**
|
|
- **Service deletion**: Controller must clean up FiberLB resources and release external IP
|
|
- **Endpoint churn**: Debounce pool updates to avoid excessive FiberLB API calls
|
|
- **IP exhaustion**: Return error event on service, set status condition
|
|
|
|
### 3. IAM Authentication Webhook
|
|
|
|
**Purpose:** Authenticate K8s API requests using PlasmaCloud IAM tokens.
|
|
|
|
**Architecture:**
|
|
- **Webhook Server**: HTTPS endpoint (can be part of IAM service or standalone)
|
|
- **Integration Point**: kube-apiserver `--authentication-token-webhook-config-file`
|
|
- **Protocol**: K8s TokenReview API
|
|
|
|
**Webhook Endpoint:** `POST /apis/iam.plasmacloud.io/v1/authenticate`
|
|
|
|
**Request Flow:**
|
|
```
|
|
kubectl --token=<IAM-token> get pods
|
|
↓
|
|
kube-apiserver extracts Bearer token
|
|
↓
|
|
POST /apis/iam.plasmacloud.io/v1/authenticate
|
|
body: TokenReview with token
|
|
↓
|
|
IAM webhook validates token
|
|
↓
|
|
Response: authenticated=true, user info, groups
|
|
↓
|
|
kube-apiserver proceeds with RBAC authorization
|
|
```
|
|
|
|
**Request Schema (from kube-apiserver):**
|
|
```json
|
|
{
|
|
"apiVersion": "authentication.k8s.io/v1",
|
|
"kind": "TokenReview",
|
|
"spec": {
|
|
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response Schema (from IAM webhook):**
|
|
```json
|
|
{
|
|
"apiVersion": "authentication.k8s.io/v1",
|
|
"kind": "TokenReview",
|
|
"status": {
|
|
"authenticated": true,
|
|
"user": {
|
|
"username": "user@example.com",
|
|
"uid": "user-550e8400-e29b-41d4-a716-446655440000",
|
|
"groups": [
|
|
"org:org-123",
|
|
"project:proj-456",
|
|
"system:authenticated"
|
|
],
|
|
"extra": {
|
|
"org_id": ["org-123"],
|
|
"project_id": ["proj-456"],
|
|
"roles": ["org_admin"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Error Response (invalid token):**
|
|
```json
|
|
{
|
|
"apiVersion": "authentication.k8s.io/v1",
|
|
"kind": "TokenReview",
|
|
"status": {
|
|
"authenticated": false,
|
|
"error": "Invalid or expired token"
|
|
}
|
|
}
|
|
```
|
|
|
|
**IAM Token Format:**
|
|
- **JWT**: Signed by IAM service with shared secret or public/private key
|
|
- **Claims**: sub (user ID), email, org_id, project_id, roles, exp (expiration)
|
|
- **Example**:
|
|
```json
|
|
{
|
|
"sub": "user-550e8400-e29b-41d4-a716-446655440000",
|
|
"email": "user@example.com",
|
|
"org_id": "org-123",
|
|
"project_id": "proj-456",
|
|
"roles": ["org_admin", "project_member"],
|
|
"exp": 1672531200
|
|
}
|
|
```
|
|
|
|
**User/Group Mapping:**
|
|
|
|
| IAM Principal | K8s Username | K8s Groups |
|
|
|---------------|--------------|------------|
|
|
| User (email) | user@example.com | org:<org_id>, project:<project_id>, system:authenticated |
|
|
| User (ID) | user-<uuid> | org:<org_id>, project:<project_id>, system:authenticated |
|
|
| Service Account | sa-<name>@<project> | org:<org_id>, project:<project_id>, system:serviceaccounts |
|
|
| Org Admin | admin@example.com | org:<org_id>, project:<all_projects>, k8s:org-admin |
|
|
|
|
**RBAC Integration:**
|
|
- Groups are used in RoleBindings and ClusterRoleBindings
|
|
- Example: `org:org-123` group gets admin access to all `project-*` namespaces for that org
|
|
|
|
**Webhook Configuration File (`/etc/k8shost/iam-webhook.yaml`):**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Config
|
|
clusters:
|
|
- name: iam-webhook
|
|
cluster:
|
|
server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
|
|
certificate-authority: /etc/k8shost/ca.crt
|
|
users:
|
|
- name: k8s-apiserver
|
|
user:
|
|
client-certificate: /etc/k8shost/apiserver-client.crt
|
|
client-key: /etc/k8shost/apiserver-client.key
|
|
current-context: webhook
|
|
contexts:
|
|
- context:
|
|
cluster: iam-webhook
|
|
user: k8s-apiserver
|
|
name: webhook
|
|
```
|
|
|
|
**Performance Considerations:**
|
|
- **Caching**: kube-apiserver caches successful authentications (--authentication-token-webhook-cache-ttl=2m)
|
|
- **Timeouts**: Webhook must respond within 10s (configurable)
|
|
- **Rate Limiting**: IAM webhook should handle high request volume (100s of req/s)
|
|
|
|
### 4. FlashDNS Service Discovery Controller
|
|
|
|
**Purpose:** Synchronize K8s Services and Pods to FlashDNS for cluster DNS resolution.
|
|
|
|
**Architecture:**
|
|
- **Controller Process**: Runs as pod in `kube-system` or embedded in k3s server
|
|
- **Watch Resources**: Services, Endpoints, Pods
|
|
- **Manage Resources**: FlashDNS A/AAAA/SRV records
|
|
|
|
**DNS Hierarchy:**
|
|
- **Pod A Records**: `<pod-ip-dashed>.pod.cluster.local` → Pod IP
|
|
- Example: `10-244-1-5.pod.cluster.local` → `10.244.1.5`
|
|
- **Service A Records**: `<service>.<namespace>.svc.cluster.local` → ClusterIP or external IP
|
|
- Example: `web.default.svc.cluster.local` → `10.96.0.100`
|
|
- **Headless Service**: `<endpoint>.<service>.<namespace>.svc.cluster.local` → Endpoint IPs
|
|
- Example: `web-0.web.default.svc.cluster.local` → `10.244.1.10`
|
|
- **SRV Records**: `_<port>._<protocol>.<service>.<namespace>.svc.cluster.local`
|
|
- Example: `_http._tcp.web.default.svc.cluster.local` → `0 50 80 web.default.svc.cluster.local`
|
|
|
|
**Controller Logic:**
|
|
|
|
**1. Service Watch:**
|
|
```
|
|
for event := range serviceWatcher {
|
|
service := event.Object
|
|
switch event.Type {
|
|
case Created, Updated:
|
|
if service.Spec.ClusterIP != "None":
|
|
// Regular service
|
|
createOrUpdateDNSRecord(
|
|
name: service.Name + "." + service.Namespace + ".svc.cluster.local",
|
|
type: "A",
|
|
value: service.Spec.ClusterIP
|
|
)
|
|
|
|
if len(service.Status.LoadBalancer.Ingress) > 0:
|
|
// LoadBalancer service - also add external IP
|
|
createOrUpdateDNSRecord(
|
|
name: service.Name + "." + service.Namespace + ".svc.cluster.local",
|
|
type: "A",
|
|
value: service.Status.LoadBalancer.Ingress[0].IP
|
|
)
|
|
else:
|
|
// Headless service - add endpoint records
|
|
endpoints := getEndpoints(service)
|
|
for _, ep := range endpoints:
|
|
createOrUpdateDNSRecord(
|
|
name: ep.Hostname + "." + service.Name + "." + service.Namespace + ".svc.cluster.local",
|
|
type: "A",
|
|
value: ep.IP
|
|
)
|
|
|
|
// Create SRV records for each port
|
|
for _, port := range service.Spec.Ports:
|
|
createSRVRecord(service, port)
|
|
|
|
case Deleted:
|
|
deleteDNSRecords(service)
|
|
}
|
|
}
|
|
```
|
|
|
|
**2. Pod Watch (for pod DNS):**
|
|
```
|
|
for event := range podWatcher {
|
|
pod := event.Object
|
|
switch event.Type {
|
|
case Created, Updated:
|
|
if pod.Status.PodIP != "":
|
|
dashedIP := strings.ReplaceAll(pod.Status.PodIP, ".", "-")
|
|
createOrUpdateDNSRecord(
|
|
name: dashedIP + ".pod.cluster.local",
|
|
type: "A",
|
|
value: pod.Status.PodIP
|
|
)
|
|
case Deleted:
|
|
deleteDNSRecord(pod)
|
|
}
|
|
}
|
|
```
|
|
|
|
**FlashDNS gRPC API Integration:**
|
|
```protobuf
|
|
service DNSService {
|
|
rpc CreateRecord(CreateRecordRequest) returns (DNSRecord);
|
|
rpc UpdateRecord(UpdateRecordRequest) returns (DNSRecord);
|
|
rpc DeleteRecord(DeleteRecordRequest) returns (Empty);
|
|
rpc ListRecords(ListRecordsRequest) returns (ListRecordsResponse);
|
|
}
|
|
|
|
message CreateRecordRequest {
|
|
string zone = 1; // "cluster.local"
|
|
string name = 2; // "web.default.svc"
|
|
string type = 3; // "A", "AAAA", "SRV", "CNAME"
|
|
string value = 4; // "10.96.0.100"
|
|
int32 ttl = 5; // 30 (seconds)
|
|
map<string, string> labels = 6; // k8s metadata
|
|
}
|
|
|
|
message DNSRecord {
|
|
string id = 1;
|
|
string zone = 2;
|
|
string name = 3;
|
|
string type = 4;
|
|
string value = 5;
|
|
int32 ttl = 6;
|
|
}
|
|
```
|
|
|
|
**Configuration:**
|
|
- **FlashDNS Endpoint**: `--flashdns-endpoint=flashdns-server:6000`
|
|
- **Cluster Domain**: `--cluster-domain=cluster.local` (default)
|
|
- **Record TTL**: `--dns-ttl=30` (seconds, low for fast updates)
|
|
|
|
**Example DNS Records:**
|
|
|
|
```
|
|
# Regular service
|
|
web.default.svc.cluster.local. 30 IN A 10.96.0.100
|
|
|
|
# Headless service with 3 pods
|
|
web.default.svc.cluster.local. 30 IN A 10.244.1.10
|
|
web.default.svc.cluster.local. 30 IN A 10.244.1.11
|
|
web.default.svc.cluster.local. 30 IN A 10.244.1.12
|
|
|
|
# StatefulSet pods (Phase 3)
|
|
web-0.web.default.svc.cluster.local. 30 IN A 10.244.1.10
|
|
web-1.web.default.svc.cluster.local. 30 IN A 10.244.1.11
|
|
|
|
# SRV record for service port
|
|
_http._tcp.web.default.svc.cluster.local. 30 IN SRV 0 50 80 web.default.svc.cluster.local.
|
|
|
|
# Pod DNS
|
|
10-244-1-10.pod.cluster.local. 30 IN A 10.244.1.10
|
|
```
|
|
|
|
**Integration with kubelet:**
|
|
- kubelet configures pod DNS via `/etc/resolv.conf`
|
|
- `nameserver`: FlashDNS service IP (typically first IP in service CIDR, e.g., `10.96.0.10`)
|
|
- `search`: `<namespace>.svc.cluster.local svc.cluster.local cluster.local`
|
|
|
|
**Edge Cases:**
|
|
- **Service IP change**: Update DNS record atomically
|
|
- **Endpoint churn**: Debounce updates for headless services with many endpoints
|
|
- **DNS caching**: Low TTL (30s) for fast convergence
|
|
|
|
### 5. LightningStor CSI Driver
|
|
|
|
**Purpose:** Provide dynamic PersistentVolume provisioning and lifecycle management.
|
|
|
|
**CSI Driver Name:** `stor.plasmacloud.io`
|
|
|
|
**Architecture:**
|
|
- **Controller Plugin**: Runs as StatefulSet or Deployment in `kube-system`
|
|
- Provisioning, deletion, attaching, detaching, snapshots
|
|
- **Node Plugin**: Runs as DaemonSet on every node
|
|
- Staging, publishing (mounting), unpublishing, unstaging
|
|
|
|
**CSI Components:**
|
|
|
|
**1. Controller Service (Identity, Controller RPCs):**
|
|
- `CreateVolume`: Provision new volume via LightningStor
|
|
- `DeleteVolume`: Delete volume
|
|
- `ControllerPublishVolume`: Attach volume to node
|
|
- `ControllerUnpublishVolume`: Detach volume from node
|
|
- `ValidateVolumeCapabilities`: Check if volume supports requested capabilities
|
|
- `ListVolumes`: List all volumes
|
|
- `GetCapacity`: Query available storage capacity
|
|
- `CreateSnapshot`, `DeleteSnapshot`: Volume snapshots (Phase 2)
|
|
|
|
**2. Node Service (Node RPCs):**
|
|
- `NodeStageVolume`: Mount volume to global staging path on node
|
|
- `NodeUnstageVolume`: Unmount from staging path
|
|
- `NodePublishVolume`: Bind mount from staging to pod path
|
|
- `NodeUnpublishVolume`: Unmount from pod path
|
|
- `NodeGetInfo`: Return node ID and topology
|
|
- `NodeGetCapabilities`: Return node capabilities
|
|
|
|
**CSI Driver Workflow:**
|
|
|
|
**Volume Provisioning:**
|
|
```
|
|
1. User creates PVC:
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: my-pvc
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
resources:
|
|
requests:
|
|
storage: 10Gi
|
|
storageClassName: lightningstor-ssd
|
|
|
|
2. CSI Controller watches PVC, calls CreateVolume:
|
|
CreateVolumeRequest {
|
|
name: "pvc-550e8400-e29b-41d4-a716-446655440000"
|
|
capacity_range: { required_bytes: 10737418240 }
|
|
volume_capabilities: [{ access_mode: SINGLE_NODE_WRITER }]
|
|
parameters: {
|
|
"type": "ssd",
|
|
"replication": "3",
|
|
"org_id": "org-123",
|
|
"project_id": "proj-456"
|
|
}
|
|
}
|
|
|
|
3. CSI Controller calls LightningStor gRPC CreateVolume:
|
|
LightningStor creates volume, returns volume_id
|
|
|
|
4. CSI Controller creates PV:
|
|
apiVersion: v1
|
|
kind: PersistentVolume
|
|
metadata:
|
|
name: pvc-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
capacity:
|
|
storage: 10Gi
|
|
accessModes: [ReadWriteOnce]
|
|
persistentVolumeReclaimPolicy: Delete
|
|
storageClassName: lightningstor-ssd
|
|
csi:
|
|
driver: stor.plasmacloud.io
|
|
volumeHandle: vol-abc123
|
|
fsType: ext4
|
|
|
|
5. K8s binds PVC to PV
|
|
```
|
|
|
|
**Volume Attachment (when pod is scheduled):**
|
|
```
|
|
1. kube-controller-manager creates VolumeAttachment:
|
|
apiVersion: storage.k8s.io/v1
|
|
kind: VolumeAttachment
|
|
metadata:
|
|
name: csi-<hash>
|
|
spec:
|
|
attacher: stor.plasmacloud.io
|
|
nodeName: worker-1
|
|
source:
|
|
persistentVolumeName: pvc-550e8400-e29b-41d4-a716-446655440000
|
|
|
|
2. CSI Controller watches VolumeAttachment, calls ControllerPublishVolume:
|
|
ControllerPublishVolumeRequest {
|
|
volume_id: "vol-abc123"
|
|
node_id: "worker-1"
|
|
volume_capability: { access_mode: SINGLE_NODE_WRITER }
|
|
}
|
|
|
|
3. CSI Controller calls LightningStor gRPC AttachVolume:
|
|
LightningStor attaches volume to node (e.g., iSCSI target, NBD)
|
|
|
|
4. CSI Controller updates VolumeAttachment status: attached=true
|
|
```
|
|
|
|
**Volume Mounting (on node):**
|
|
```
|
|
1. kubelet calls CSI Node plugin: NodeStageVolume
|
|
NodeStageVolumeRequest {
|
|
volume_id: "vol-abc123"
|
|
staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
|
|
volume_capability: { mount: { fs_type: "ext4" } }
|
|
}
|
|
|
|
2. CSI Node plugin:
|
|
- Discovers block device (e.g., /dev/nbd0) via LightningStor
|
|
- Formats if needed: mkfs.ext4 /dev/nbd0
|
|
- Mounts to staging path: mount /dev/nbd0 <staging_target_path>
|
|
|
|
3. kubelet calls CSI Node plugin: NodePublishVolume
|
|
NodePublishVolumeRequest {
|
|
volume_id: "vol-abc123"
|
|
staging_target_path: "/var/lib/kubelet/plugins/kubernetes.io/csi/stor.plasmacloud.io/<hash>/globalmount"
|
|
target_path: "/var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/pvc-<hash>/mount"
|
|
}
|
|
|
|
4. CSI Node plugin:
|
|
- Bind mount staging path to target path
|
|
- Pod can now read/write to volume
|
|
```
|
|
|
|
**LightningStor gRPC API Integration:**
|
|
```protobuf
|
|
service VolumeService {
|
|
rpc CreateVolume(CreateVolumeRequest) returns (Volume);
|
|
rpc DeleteVolume(DeleteVolumeRequest) returns (Empty);
|
|
rpc AttachVolume(AttachVolumeRequest) returns (VolumeAttachment);
|
|
rpc DetachVolume(DetachVolumeRequest) returns (Empty);
|
|
rpc GetVolume(GetVolumeRequest) returns (Volume);
|
|
rpc ListVolumes(ListVolumesRequest) returns (ListVolumesResponse);
|
|
}
|
|
|
|
message CreateVolumeRequest {
|
|
string name = 1;
|
|
int64 size_bytes = 2;
|
|
string volume_type = 3; // "ssd", "hdd"
|
|
int32 replication_factor = 4;
|
|
string org_id = 5;
|
|
string project_id = 6;
|
|
}
|
|
|
|
message Volume {
|
|
string id = 1;
|
|
string name = 2;
|
|
int64 size_bytes = 3;
|
|
string status = 4; // "available", "in-use", "error"
|
|
string volume_type = 5;
|
|
}
|
|
|
|
message AttachVolumeRequest {
|
|
string volume_id = 1;
|
|
string node_id = 2;
|
|
string attach_mode = 3; // "read-write", "read-only"
|
|
}
|
|
|
|
message VolumeAttachment {
|
|
string id = 1;
|
|
string volume_id = 2;
|
|
string node_id = 3;
|
|
string device_path = 4; // e.g., "/dev/nbd0"
|
|
string connection_info = 5; // JSON with iSCSI target, NBD socket, etc.
|
|
}
|
|
```
|
|
|
|
**StorageClass Examples:**
|
|
```yaml
|
|
# SSD storage with 3x replication
|
|
apiVersion: storage.k8s.io/v1
|
|
kind: StorageClass
|
|
metadata:
|
|
name: lightningstor-ssd
|
|
provisioner: stor.plasmacloud.io
|
|
parameters:
|
|
type: "ssd"
|
|
replication: "3"
|
|
volumeBindingMode: WaitForFirstConsumer # Topology-aware scheduling
|
|
allowVolumeExpansion: true
|
|
reclaimPolicy: Delete
|
|
|
|
---
|
|
# HDD storage with 2x replication
|
|
apiVersion: storage.k8s.io/v1
|
|
kind: StorageClass
|
|
metadata:
|
|
name: lightningstor-hdd
|
|
provisioner: stor.plasmacloud.io
|
|
parameters:
|
|
type: "hdd"
|
|
replication: "2"
|
|
volumeBindingMode: Immediate
|
|
allowVolumeExpansion: true
|
|
reclaimPolicy: Retain # Keep volume after PVC deletion
|
|
```
|
|
|
|
**Access Modes:**
|
|
- **ReadWriteOnce (RWO)**: Single node read-write (most common)
|
|
- **ReadOnlyMany (ROX)**: Multiple nodes read-only
|
|
- **ReadWriteMany (RWX)**: Multiple nodes read-write (requires shared filesystem like NFS, Phase 2)
|
|
|
|
**Volume Expansion (if allowVolumeExpansion: true):**
|
|
```
|
|
1. User edits PVC: spec.resources.requests.storage: 20Gi (was 10Gi)
|
|
2. CSI Controller calls ControllerExpandVolume
|
|
3. LightningStor expands volume backend
|
|
4. CSI Node plugin calls NodeExpandVolume
|
|
5. Filesystem resize: resize2fs /dev/nbd0
|
|
```
|
|
|
|
### 6. PlasmaVMC Integration
|
|
|
|
**Phase 1 (MVP):** Use containerd as default CRI
|
|
- k3s ships with containerd embedded
|
|
- Standard OCI container runtime
|
|
- No changes needed for Phase 1
|
|
|
|
**Phase 3 (Future):** Custom CRI for VM-backed pods
|
|
|
|
**Motivation:**
|
|
- **Enhanced Isolation**: Stronger security boundary than containers
|
|
- **Multi-Tenant Security**: Prevent container escape attacks
|
|
- **Consistent Runtime**: Unify VM and container workloads on PlasmaVMC
|
|
|
|
**Architecture:**
|
|
- PlasmaVMC implements CRI (Container Runtime Interface)
|
|
- Each pod runs as a lightweight VM (Firecracker microVM)
|
|
- Pod containers run inside VM (still using containerd within VM)
|
|
- kubelet communicates with PlasmaVMC CRI endpoint instead of containerd
|
|
|
|
**CRI Interface Implementation:**
|
|
|
|
**RuntimeService:**
|
|
- `RunPodSandbox`: Create Firecracker microVM for pod
|
|
- `StopPodSandbox`: Stop microVM
|
|
- `RemovePodSandbox`: Delete microVM
|
|
- `PodSandboxStatus`: Query microVM status
|
|
- `ListPodSandbox`: List all pod microVMs
|
|
- `CreateContainer`: Create container inside microVM
|
|
- `StartContainer`, `StopContainer`, `RemoveContainer`: Container lifecycle
|
|
- `ExecSync`, `Exec`: Execute commands in container
|
|
- `Attach`: Attach to container stdio
|
|
|
|
**ImageService:**
|
|
- `PullImage`: Download container image (delegate to internal containerd)
|
|
- `RemoveImage`: Delete image
|
|
- `ListImages`: List cached images
|
|
- `ImageStatus`: Query image metadata
|
|
|
|
**Implementation Strategy:**
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ kubelet (k3s agent) │
|
|
└─────────────┬───────────────────────────┘
|
|
│ CRI gRPC
|
|
▼
|
|
┌─────────────────────────────────────────┐
|
|
│ PlasmaVMC CRI Server (Rust) │
|
|
│ - RunPodSandbox → Create microVM │
|
|
│ - CreateContainer → Run in VM │
|
|
└─────────────┬───────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────┐
|
|
│ Firecracker VMM (per pod) │
|
|
│ ┌───────────────────────────────────┐ │
|
|
│ │ Pod VM (minimal Linux kernel) │ │
|
|
│ │ ┌──────────────────────────────┐ │ │
|
|
│ │ │ containerd (in-VM) │ │ │
|
|
│ │ │ - Container 1 │ │ │
|
|
│ │ │ - Container 2 │ │ │
|
|
│ │ └──────────────────────────────┘ │ │
|
|
│ └───────────────────────────────────┘ │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
**Configuration (Phase 3):**
|
|
```nix
|
|
services.k8shost = {
|
|
enable = true;
|
|
cri = "plasmavmc"; # Instead of "containerd"
|
|
plasmavmc = {
|
|
endpoint = "unix:///var/run/plasmavmc/cri.sock";
|
|
vmKernel = "/var/lib/plasmavmc/vmlinux.bin";
|
|
vmRootfs = "/var/lib/plasmavmc/rootfs.ext4";
|
|
};
|
|
};
|
|
```
|
|
|
|
**Benefits:**
|
|
- Stronger isolation for untrusted workloads
|
|
- Leverage existing PlasmaVMC infrastructure
|
|
- Consistent management across VM and K8s workloads
|
|
|
|
**Challenges:**
|
|
- Performance overhead (microVM startup time, memory overhead)
|
|
- Image caching complexity (need containerd inside VM)
|
|
- Networking integration (CNI must configure VM network)
|
|
|
|
**Decision:** Defer to Phase 3, focus on standard containerd for MVP.
|
|
|
|
## Multi-Tenant Model
|
|
|
|
### Namespace Strategy
|
|
|
|
**Principle:** One K8s namespace per PlasmaCloud project.
|
|
|
|
**Namespace Naming:**
|
|
- **Project namespaces**: `project-<project_id>` (e.g., `project-550e8400-e29b-41d4-a716-446655440000`)
|
|
- **Org shared namespaces** (optional): `org-<org_id>-shared` (for shared resources like monitoring)
|
|
- **System namespaces**: `kube-system`, `kube-public`, `kube-node-lease`, `default`
|
|
|
|
**Namespace Lifecycle:**
|
|
- Created automatically when project provisions K8s cluster
|
|
- Labeled with `org_id`, `project_id` for RBAC and billing
|
|
- Deleted when project is deleted (with grace period)
|
|
|
|
**Namespace Metadata:**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: project-550e8400-e29b-41d4-a716-446655440000
|
|
labels:
|
|
plasmacloud.io/org-id: "org-123"
|
|
plasmacloud.io/project-id: "proj-456"
|
|
plasmacloud.io/tenant-type: "project"
|
|
annotations:
|
|
plasmacloud.io/project-name: "my-web-app"
|
|
plasmacloud.io/created-by: "user@example.com"
|
|
```
|
|
|
|
### RBAC Templates
|
|
|
|
**Org Admin Role (full access to all project namespaces):**
|
|
```yaml
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: Role
|
|
metadata:
|
|
name: org-admin
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
rules:
|
|
- apiGroups: ["*"]
|
|
resources: ["*"]
|
|
verbs: ["*"]
|
|
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: RoleBinding
|
|
metadata:
|
|
name: org-admin-binding
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
subjects:
|
|
- kind: Group
|
|
name: org:org-123
|
|
apiGroup: rbac.authorization.k8s.io
|
|
roleRef:
|
|
kind: Role
|
|
name: org-admin
|
|
apiGroup: rbac.authorization.k8s.io
|
|
```
|
|
|
|
**Project Admin Role (full access to specific project namespace):**
|
|
```yaml
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: Role
|
|
metadata:
|
|
name: project-admin
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
rules:
|
|
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "storage.k8s.io"]
|
|
resources: ["*"]
|
|
verbs: ["*"]
|
|
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: RoleBinding
|
|
metadata:
|
|
name: project-admin-binding
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
subjects:
|
|
- kind: Group
|
|
name: project:proj-456
|
|
apiGroup: rbac.authorization.k8s.io
|
|
roleRef:
|
|
kind: Role
|
|
name: project-admin
|
|
apiGroup: rbac.authorization.k8s.io
|
|
```
|
|
|
|
**Project Viewer Role (read-only access):**
|
|
```yaml
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: Role
|
|
metadata:
|
|
name: project-viewer
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
rules:
|
|
- apiGroups: ["", "apps", "batch", "networking.k8s.io"]
|
|
resources: ["pods", "services", "deployments", "replicasets", "configmaps", "secrets"]
|
|
verbs: ["get", "list", "watch"]
|
|
- apiGroups: [""]
|
|
resources: ["pods/log"]
|
|
verbs: ["get", "list"]
|
|
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: RoleBinding
|
|
metadata:
|
|
name: project-viewer-binding
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
subjects:
|
|
- kind: Group
|
|
name: project:proj-456:viewer
|
|
apiGroup: rbac.authorization.k8s.io
|
|
roleRef:
|
|
kind: Role
|
|
name: project-viewer
|
|
apiGroup: rbac.authorization.k8s.io
|
|
```
|
|
|
|
**ClusterRole for Node Access (for cluster admins):**
|
|
```yaml
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRole
|
|
metadata:
|
|
name: plasmacloud-cluster-admin
|
|
rules:
|
|
- apiGroups: [""]
|
|
resources: ["nodes", "persistentvolumes"]
|
|
verbs: ["*"]
|
|
- apiGroups: ["storage.k8s.io"]
|
|
resources: ["storageclasses"]
|
|
verbs: ["*"]
|
|
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRoleBinding
|
|
metadata:
|
|
name: plasmacloud-cluster-admin-binding
|
|
subjects:
|
|
- kind: Group
|
|
name: system:plasmacloud-admins
|
|
apiGroup: rbac.authorization.k8s.io
|
|
roleRef:
|
|
kind: ClusterRole
|
|
name: plasmacloud-cluster-admin
|
|
apiGroup: rbac.authorization.k8s.io
|
|
```
|
|
|
|
### Network Isolation
|
|
|
|
**Default NetworkPolicy (deny all, except DNS):**
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: default-deny-all
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
podSelector: {} # Apply to all pods
|
|
policyTypes:
|
|
- Ingress
|
|
- Egress
|
|
egress:
|
|
- to:
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
kubernetes.io/metadata.name: kube-system
|
|
ports:
|
|
- protocol: UDP
|
|
port: 53 # DNS
|
|
```
|
|
|
|
**Allow Ingress from LoadBalancer:**
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: allow-loadbalancer
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
podSelector:
|
|
matchLabels:
|
|
app: web
|
|
policyTypes:
|
|
- Ingress
|
|
ingress:
|
|
- from:
|
|
- ipBlock:
|
|
cidr: 0.0.0.0/0 # Allow from anywhere (LoadBalancer external traffic)
|
|
ports:
|
|
- protocol: TCP
|
|
port: 8080
|
|
```
|
|
|
|
**Allow Inter-Namespace Communication (optional, for org-shared services):**
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: allow-org-shared
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
podSelector: {}
|
|
policyTypes:
|
|
- Egress
|
|
egress:
|
|
- to:
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
plasmacloud.io/org-id: "org-123"
|
|
plasmacloud.io/tenant-type: "org-shared"
|
|
```
|
|
|
|
**PrismNET Enforcement:**
|
|
- NetworkPolicies are translated to OVN ACLs by PrismNET CNI controller
|
|
- Enforced at OVN logical switch level (low-level packet filtering)
|
|
|
|
### Resource Quotas
|
|
|
|
**CPU and Memory Quotas:**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ResourceQuota
|
|
metadata:
|
|
name: project-compute-quota
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
hard:
|
|
requests.cpu: "10" # 10 CPU cores
|
|
requests.memory: "20Gi" # 20 GB RAM
|
|
limits.cpu: "20" # Allow bursting to 20 cores
|
|
limits.memory: "40Gi" # Allow bursting to 40 GB RAM
|
|
```
|
|
|
|
**Storage Quotas:**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ResourceQuota
|
|
metadata:
|
|
name: project-storage-quota
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
hard:
|
|
persistentvolumeclaims: "10" # Max 10 PVCs
|
|
requests.storage: "100Gi" # Total storage requests
|
|
```
|
|
|
|
**Object Count Quotas:**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ResourceQuota
|
|
metadata:
|
|
name: project-object-quota
|
|
namespace: project-550e8400-e29b-41d4-a716-446655440000
|
|
spec:
|
|
hard:
|
|
pods: "50"
|
|
services: "20"
|
|
services.loadbalancers: "5" # Max 5 LoadBalancer services (limit external IPs)
|
|
configmaps: "50"
|
|
secrets: "50"
|
|
```
|
|
|
|
**Quota Enforcement:**
|
|
- K8s admission controller rejects resource creation exceeding quota
|
|
- User receives clear error message
|
|
- Quota usage visible in `kubectl describe quota`
|
|
|
|
## Deployment Model
|
|
|
|
### Single-Server (Development/Small)
|
|
|
|
**Target Use Case:**
|
|
- Development and testing environments
|
|
- Small production workloads (<10 nodes)
|
|
- Cost-sensitive deployments
|
|
|
|
**Architecture:**
|
|
- Single k3s server node with embedded SQLite datastore
|
|
- Control plane and worker colocated
|
|
- No HA guarantees
|
|
|
|
**k3s Server Command:**
|
|
```bash
|
|
k3s server \
|
|
--data-dir=/var/lib/k8shost \
|
|
--disable=servicelb,traefik,flannel \
|
|
--flannel-backend=none \
|
|
--disable-network-policy \
|
|
--cluster-domain=cluster.local \
|
|
--service-cidr=10.96.0.0/12 \
|
|
--cluster-cidr=10.244.0.0/16 \
|
|
--authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
|
|
--bind-address=0.0.0.0 \
|
|
--advertise-address=192.168.1.100 \
|
|
--tls-san=k8s-api.example.com
|
|
```
|
|
|
|
**NixOS Configuration:**
|
|
```nix
|
|
{ config, lib, pkgs, ... }:
|
|
|
|
{
|
|
services.k8shost = {
|
|
enable = true;
|
|
mode = "server";
|
|
datastore = "sqlite"; # Embedded SQLite
|
|
disableComponents = ["servicelb" "traefik" "flannel"];
|
|
|
|
networking = {
|
|
serviceCIDR = "10.96.0.0/12";
|
|
clusterCIDR = "10.244.0.0/16";
|
|
clusterDomain = "cluster.local";
|
|
};
|
|
|
|
prismnet = {
|
|
enable = true;
|
|
endpoint = "prismnet-server:5000";
|
|
ovnNorthbound = "tcp:prismnet-server:6641";
|
|
ovnSouthbound = "tcp:prismnet-server:6642";
|
|
};
|
|
|
|
fiberlb = {
|
|
enable = true;
|
|
endpoint = "fiberlb-server:7000";
|
|
externalIpPool = "192.168.100.0/24";
|
|
};
|
|
|
|
iam = {
|
|
enable = true;
|
|
webhookEndpoint = "https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate";
|
|
caCertFile = "/etc/k8shost/ca.crt";
|
|
clientCertFile = "/etc/k8shost/client.crt";
|
|
clientKeyFile = "/etc/k8shost/client.key";
|
|
};
|
|
|
|
flashdns = {
|
|
enable = true;
|
|
endpoint = "flashdns-server:6000";
|
|
clusterDomain = "cluster.local";
|
|
recordTTL = 30;
|
|
};
|
|
|
|
lightningstor = {
|
|
enable = true;
|
|
endpoint = "lightningstor-server:8000";
|
|
csiNodeDaemonSet = true; # Deploy CSI node plugin as DaemonSet
|
|
};
|
|
};
|
|
|
|
# Open firewall for K8s API
|
|
networking.firewall.allowedTCPPorts = [ 6443 ];
|
|
}
|
|
```
|
|
|
|
**Limitations:**
|
|
- No HA (single point of failure)
|
|
- SQLite has limited concurrency
|
|
- Control plane downtime affects entire cluster
|
|
|
|
### HA Cluster (Production)
|
|
|
|
**Target Use Case:**
|
|
- Production workloads requiring high availability
|
|
- Large clusters (>10 nodes)
|
|
- Mission-critical applications
|
|
|
|
**Architecture:**
|
|
- 3 or 5 k3s server nodes (odd number for quorum)
|
|
- Embedded etcd (Raft consensus, HA datastore)
|
|
- Load balancer in front of API servers
|
|
- Agent nodes for workload scheduling
|
|
|
|
**k3s Server Command (each server node):**
|
|
```bash
|
|
k3s server \
|
|
--data-dir=/var/lib/k8shost \
|
|
--disable=servicelb,traefik,flannel \
|
|
--flannel-backend=none \
|
|
--disable-network-policy \
|
|
--cluster-domain=cluster.local \
|
|
--service-cidr=10.96.0.0/12 \
|
|
--cluster-cidr=10.244.0.0/16 \
|
|
--authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
|
|
--cluster-init \ # First server only
|
|
--server https://k8s-api-lb.internal:6443 \ # Join existing cluster (not for first server)
|
|
--tls-san=k8s-api-lb.example.com \
|
|
--tls-san=k8s-api.example.com
|
|
```
|
|
|
|
**k3s Agent Command (worker nodes):**
|
|
```bash
|
|
k3s agent \
|
|
--server https://k8s-api-lb.internal:6443 \
|
|
--token <join-token>
|
|
```
|
|
|
|
**NixOS Configuration (Server Node):**
|
|
```nix
|
|
{ config, lib, pkgs, ... }:
|
|
|
|
{
|
|
services.k8shost = {
|
|
enable = true;
|
|
mode = "server";
|
|
datastore = "etcd"; # Embedded etcd for HA
|
|
clusterInit = true; # Set to false for joining servers
|
|
serverUrl = "https://k8s-api-lb.internal:6443"; # For joining servers
|
|
|
|
# ... same integrations as single-server ...
|
|
};
|
|
|
|
# High availability settings
|
|
systemd.services.k8shost = {
|
|
serviceConfig = {
|
|
Restart = "always";
|
|
RestartSec = "10s";
|
|
};
|
|
};
|
|
}
|
|
```
|
|
|
|
**Load Balancer Configuration (FiberLB):**
|
|
```yaml
|
|
# External LoadBalancer for API access
|
|
apiVersion: v1
|
|
kind: LoadBalancer
|
|
metadata:
|
|
name: k8s-api-lb
|
|
spec:
|
|
listeners:
|
|
- protocol: TCP
|
|
port: 6443
|
|
backend_pool: k8s-api-servers
|
|
pools:
|
|
- name: k8s-api-servers
|
|
algorithm: round-robin
|
|
members:
|
|
- address: 192.168.1.101 # server-1
|
|
port: 6443
|
|
- address: 192.168.1.102 # server-2
|
|
port: 6443
|
|
- address: 192.168.1.103 # server-3
|
|
port: 6443
|
|
health_check:
|
|
type: tcp
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 3
|
|
```
|
|
|
|
**Datastore Options:**
|
|
|
|
#### Option 1: Embedded etcd (Recommended for MVP)
|
|
**Pros:**
|
|
- Built-in to k3s, no external dependencies
|
|
- Proven, battle-tested (CNCF etcd project)
|
|
- Automatic HA with Raft consensus
|
|
- Easy setup (just `--cluster-init`)
|
|
|
|
**Cons:**
|
|
- Another distributed datastore (in addition to Chainfire/FlareDB)
|
|
- etcd-specific operations (backup, restore, defragmentation)
|
|
|
|
#### Option 2: FlareDB as External Datastore
|
|
**Pros:**
|
|
- Unified storage layer for PlasmaCloud
|
|
- Leverage existing FlareDB deployment
|
|
- Simplified infrastructure (one less system to manage)
|
|
|
|
**Cons:**
|
|
- k3s requires etcd API compatibility
|
|
- FlareDB would need to implement etcd v3 API (significant effort)
|
|
- Untested for K8s workloads
|
|
|
|
**Recommendation for MVP:** Use embedded etcd for HA mode. Investigate FlareDB etcd compatibility in Phase 2 or 3.
|
|
|
|
**Backup and Disaster Recovery:**
|
|
```bash
|
|
# etcd snapshot (on any server node)
|
|
k3s etcd-snapshot save --name backup-$(date +%Y%m%d-%H%M%S)
|
|
|
|
# List snapshots
|
|
k3s etcd-snapshot ls
|
|
|
|
# Restore from snapshot
|
|
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/k8shost/server/db/snapshots/backup-20250101-120000
|
|
```
|
|
|
|
### NixOS Module Integration
|
|
|
|
**Module Structure:**
|
|
```
|
|
nix/modules/
|
|
├── k8shost.nix # Main module
|
|
├── k8shost/
|
|
│ ├── controller.nix # FiberLB, FlashDNS controllers
|
|
│ ├── csi.nix # LightningStor CSI driver
|
|
│ └── cni.nix # PrismNET CNI plugin
|
|
```
|
|
|
|
**Main Module (`nix/modules/k8shost.nix`):**
|
|
```nix
|
|
{ config, lib, pkgs, ... }:
|
|
|
|
with lib;
|
|
|
|
let
|
|
cfg = config.services.k8shost;
|
|
in
|
|
{
|
|
options.services.k8shost = {
|
|
enable = mkEnableOption "PlasmaCloud K8s Hosting Service";
|
|
|
|
mode = mkOption {
|
|
type = types.enum ["server" "agent"];
|
|
default = "server";
|
|
description = "Run as server (control plane) or agent (worker)";
|
|
};
|
|
|
|
datastore = mkOption {
|
|
type = types.enum ["sqlite" "etcd"];
|
|
default = "sqlite";
|
|
description = "Datastore backend (sqlite for single-server, etcd for HA)";
|
|
};
|
|
|
|
disableComponents = mkOption {
|
|
type = types.listOf types.str;
|
|
default = ["servicelb" "traefik" "flannel"];
|
|
description = "k3s components to disable";
|
|
};
|
|
|
|
networking = {
|
|
serviceCIDR = mkOption {
|
|
type = types.str;
|
|
default = "10.96.0.0/12";
|
|
description = "CIDR for service ClusterIPs";
|
|
};
|
|
|
|
clusterCIDR = mkOption {
|
|
type = types.str;
|
|
default = "10.244.0.0/16";
|
|
description = "CIDR for pod IPs";
|
|
};
|
|
|
|
clusterDomain = mkOption {
|
|
type = types.str;
|
|
default = "cluster.local";
|
|
description = "Cluster DNS domain";
|
|
};
|
|
};
|
|
|
|
# Integration options (prismnet, fiberlb, iam, flashdns, lightningstor)
|
|
# ...
|
|
};
|
|
|
|
config = mkIf cfg.enable {
|
|
# Install k3s package
|
|
environment.systemPackages = [ pkgs.k3s ];
|
|
|
|
# Create systemd service
|
|
systemd.services.k8shost = {
|
|
description = "PlasmaCloud K8s Hosting Service (k3s)";
|
|
after = [ "network.target" "iam.service" "prismnet.service" ];
|
|
requires = [ "iam.service" "prismnet.service" ];
|
|
wantedBy = [ "multi-user.target" ];
|
|
|
|
serviceConfig = {
|
|
Type = "notify";
|
|
ExecStart = "${pkgs.k3s}/bin/k3s ${cfg.mode} ${concatStringsSep " " (buildServerArgs cfg)}";
|
|
KillMode = "process";
|
|
Delegate = "yes";
|
|
LimitNOFILE = 1048576;
|
|
LimitNPROC = "infinity";
|
|
LimitCORE = "infinity";
|
|
TasksMax = "infinity";
|
|
Restart = "always";
|
|
RestartSec = "5s";
|
|
};
|
|
};
|
|
|
|
# Create configuration files
|
|
environment.etc."k8shost/iam-webhook.yaml" = {
|
|
text = generateIAMWebhookConfig cfg.iam;
|
|
mode = "0600";
|
|
};
|
|
|
|
# Deploy controllers (FiberLB, FlashDNS, etc.)
|
|
# ... (as separate systemd services or in-cluster deployments)
|
|
};
|
|
}
|
|
```
|
|
|
|
## API Server Configuration
|
|
|
|
### k3s Server Flags (Complete)
|
|
|
|
```bash
|
|
k3s server \
|
|
# Data and cluster configuration
|
|
--data-dir=/var/lib/k8shost \
|
|
--cluster-init \ # For first server in HA cluster
|
|
--server https://k8s-api-lb.internal:6443 \ # Join existing HA cluster
|
|
--token <cluster-token> \ # Secure join token
|
|
|
|
# Disable default components
|
|
--disable=servicelb,traefik,flannel,local-storage \
|
|
--flannel-backend=none \
|
|
--disable-network-policy \
|
|
|
|
# Network configuration
|
|
--cluster-domain=cluster.local \
|
|
--service-cidr=10.96.0.0/12 \
|
|
--cluster-cidr=10.244.0.0/16 \
|
|
--service-node-port-range=30000-32767 \
|
|
|
|
# API server configuration
|
|
--bind-address=0.0.0.0 \
|
|
--advertise-address=192.168.1.100 \
|
|
--tls-san=k8s-api.example.com \
|
|
--tls-san=k8s-api-lb.example.com \
|
|
|
|
# Authentication
|
|
--authentication-token-webhook-config-file=/etc/k8shost/iam-webhook.yaml \
|
|
--authentication-token-webhook-cache-ttl=2m \
|
|
|
|
# Authorization (RBAC enabled by default)
|
|
# --authorization-mode=Node,RBAC \ # Default, no need to specify
|
|
|
|
# Audit logging
|
|
--kube-apiserver-arg=audit-log-path=/var/log/k8shost/audit.log \
|
|
--kube-apiserver-arg=audit-log-maxage=30 \
|
|
--kube-apiserver-arg=audit-log-maxbackup=10 \
|
|
--kube-apiserver-arg=audit-log-maxsize=100 \
|
|
|
|
# Feature gates (if needed)
|
|
# --kube-apiserver-arg=feature-gates=SomeFeature=true
|
|
```
|
|
|
|
### Authentication Webhook Configuration
|
|
|
|
**File: `/etc/k8shost/iam-webhook.yaml`**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Config
|
|
clusters:
|
|
- name: iam-webhook
|
|
cluster:
|
|
server: https://iam-server:3000/apis/iam.plasmacloud.io/v1/authenticate
|
|
certificate-authority: /etc/k8shost/ca.crt
|
|
users:
|
|
- name: k8s-apiserver
|
|
user:
|
|
client-certificate: /etc/k8shost/apiserver-client.crt
|
|
client-key: /etc/k8shost/apiserver-client.key
|
|
current-context: webhook
|
|
contexts:
|
|
- context:
|
|
cluster: iam-webhook
|
|
user: k8s-apiserver
|
|
name: webhook
|
|
```
|
|
|
|
**Certificate Management:**
|
|
- CA certificate: Issued by PlasmaCloud IAM PKI
|
|
- Client certificate: For kube-apiserver to authenticate to IAM webhook
|
|
- Rotation: Certificates expire after 1 year, auto-renewed by IAM
|
|
|
|
## Security
|
|
|
|
### TLS/mTLS
|
|
|
|
**Component Communication:**
|
|
| Source | Destination | Protocol | Auth Method |
|
|
|--------|-------------|----------|-------------|
|
|
| kube-apiserver | IAM webhook | HTTPS + mTLS | Client cert |
|
|
| FiberLB controller | FiberLB gRPC | gRPC + TLS | IAM token |
|
|
| FlashDNS controller | FlashDNS gRPC | gRPC + TLS | IAM token |
|
|
| LightningStor CSI | LightningStor gRPC | gRPC + TLS | IAM token |
|
|
| PrismNET CNI | PrismNET gRPC | gRPC + TLS | IAM token |
|
|
| kubectl | kube-apiserver | HTTPS | IAM token (Bearer) |
|
|
|
|
**Certificate Issuance:**
|
|
- All certificates issued by IAM service (centralized PKI)
|
|
- Automatic renewal before expiration
|
|
- Certificate revocation via IAM CRL
|
|
|
|
### Pod Security
|
|
|
|
**Pod Security Standards (PSS):**
|
|
- **Baseline Profile**: Enforced on all namespaces by default
|
|
- Deny privileged containers
|
|
- Deny host network/PID/IPC
|
|
- Deny hostPath volumes
|
|
- Deny privilege escalation
|
|
- **Restricted Profile**: Optional, for highly sensitive workloads
|
|
|
|
**Example PodSecurityPolicy (deprecated in K8s 1.25, use PSS):**
|
|
```yaml
|
|
apiVersion: policy/v1beta1
|
|
kind: PodSecurityPolicy
|
|
metadata:
|
|
name: restricted
|
|
spec:
|
|
privileged: false
|
|
allowPrivilegeEscalation: false
|
|
requiredDropCapabilities:
|
|
- ALL
|
|
volumes:
|
|
- configMap
|
|
- emptyDir
|
|
- projected
|
|
- secret
|
|
- downwardAPI
|
|
- persistentVolumeClaim
|
|
runAsUser:
|
|
rule: MustRunAsNonRoot
|
|
seLinux:
|
|
rule: RunAsAny
|
|
fsGroup:
|
|
rule: RunAsAny
|
|
```
|
|
|
|
**Security Contexts (enforced):**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: secure-pod
|
|
spec:
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
runAsUser: 1000
|
|
fsGroup: 2000
|
|
containers:
|
|
- name: app
|
|
image: myapp:latest
|
|
securityContext:
|
|
allowPrivilegeEscalation: false
|
|
readOnlyRootFilesystem: true
|
|
capabilities:
|
|
drop:
|
|
- ALL
|
|
```
|
|
|
|
**Service Account Permissions:**
|
|
- Minimal RBAC permissions by default
|
|
- Principle of least privilege
|
|
- No cluster-admin access for user workloads
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
|
|
**Controllers (Go):**
|
|
```go
|
|
// fiberlb_controller_test.go
|
|
func TestReconcileLoadBalancer(t *testing.T) {
|
|
// Mock K8s client
|
|
client := fake.NewSimpleClientset()
|
|
|
|
// Mock FiberLB gRPC client
|
|
mockFiberLB := &mockFiberLBClient{}
|
|
|
|
controller := NewFiberLBController(client, mockFiberLB)
|
|
|
|
// Create test service
|
|
svc := &corev1.Service{
|
|
ObjectMeta: metav1.ObjectMeta{Name: "test-svc", Namespace: "default"},
|
|
Spec: corev1.ServiceSpec{Type: corev1.ServiceTypeLoadBalancer},
|
|
}
|
|
|
|
// Reconcile
|
|
err := controller.Reconcile(svc)
|
|
assert.NoError(t, err)
|
|
|
|
// Verify FiberLB API called
|
|
assert.Equal(t, 1, mockFiberLB.createLoadBalancerCalls)
|
|
}
|
|
```
|
|
|
|
**CNI Plugin (Rust):**
|
|
```rust
|
|
#[test]
|
|
fn test_cni_add() {
|
|
let mut mock_ovn = MockOVNClient::new();
|
|
mock_ovn.expect_allocate_ip()
|
|
.returning(|ns, pod| Ok("10.244.1.5/24".to_string()));
|
|
|
|
let plugin = PrismNETPlugin::new(mock_ovn);
|
|
let result = plugin.handle_add(/* ... */);
|
|
|
|
assert!(result.is_ok());
|
|
assert_eq!(result.unwrap().ip, "10.244.1.5");
|
|
}
|
|
```
|
|
|
|
**CSI Driver (Go):**
|
|
```go
|
|
func TestCreateVolume(t *testing.T) {
|
|
mockLightningStor := &mockLightningStorClient{}
|
|
mockLightningStor.On("CreateVolume", mock.Anything).Return(&Volume{ID: "vol-123"}, nil)
|
|
|
|
driver := NewCSIDriver(mockLightningStor)
|
|
|
|
req := &csi.CreateVolumeRequest{
|
|
Name: "test-vol",
|
|
CapacityRange: &csi.CapacityRange{RequiredBytes: 10 * 1024 * 1024 * 1024},
|
|
}
|
|
|
|
resp, err := driver.CreateVolume(context.Background(), req)
|
|
assert.NoError(t, err)
|
|
assert.Equal(t, "vol-123", resp.Volume.VolumeId)
|
|
}
|
|
```
|
|
|
|
### Integration Tests
|
|
|
|
**Test Environment:**
|
|
- Single-node k3s cluster (kind or k3s in Docker)
|
|
- Mock or real PlasmaCloud services (PrismNET, FiberLB, etc.)
|
|
- Automated setup and teardown
|
|
|
|
**Test Cases:**
|
|
|
|
**1. Single-Pod Deployment:**
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
# Deploy nginx pod
|
|
kubectl apply -f - <<EOF
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: nginx
|
|
spec:
|
|
containers:
|
|
- name: nginx
|
|
image: nginx:latest
|
|
ports:
|
|
- containerPort: 80
|
|
EOF
|
|
|
|
# Wait for pod to be running
|
|
kubectl wait --for=condition=Ready pod/nginx --timeout=60s
|
|
|
|
# Verify pod IP allocated
|
|
POD_IP=$(kubectl get pod nginx -o jsonpath='{.status.podIP}')
|
|
[ -n "$POD_IP" ] || exit 1
|
|
|
|
# Cleanup
|
|
kubectl delete pod nginx
|
|
```
|
|
|
|
**2. Service Exposure (LoadBalancer):**
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
# Create deployment
|
|
kubectl create deployment web --image=nginx:latest --replicas=2
|
|
|
|
# Expose as LoadBalancer
|
|
kubectl expose deployment web --type=LoadBalancer --port=80
|
|
|
|
# Wait for external IP
|
|
for i in {1..30}; do
|
|
EXTERNAL_IP=$(kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
|
|
[ -n "$EXTERNAL_IP" ] && break
|
|
sleep 2
|
|
done
|
|
|
|
[ -n "$EXTERNAL_IP" ] || exit 1
|
|
|
|
# Verify HTTP access
|
|
curl -f http://$EXTERNAL_IP || exit 1
|
|
|
|
# Cleanup
|
|
kubectl delete svc web
|
|
kubectl delete deployment web
|
|
```
|
|
|
|
**3. PersistentVolume Provisioning:**
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
# Create PVC
|
|
kubectl apply -f - <<EOF
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: test-pvc
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
resources:
|
|
requests:
|
|
storage: 1Gi
|
|
storageClassName: lightningstor-ssd
|
|
EOF
|
|
|
|
# Wait for PVC to be bound
|
|
kubectl wait --for=condition=Bound pvc/test-pvc --timeout=60s
|
|
|
|
# Create pod using PVC
|
|
kubectl apply -f - <<EOF
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: test-pod
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: busybox
|
|
command: ["sh", "-c", "echo hello > /data/test.txt && sleep 3600"]
|
|
volumeMounts:
|
|
- name: data
|
|
mountPath: /data
|
|
volumes:
|
|
- name: data
|
|
persistentVolumeClaim:
|
|
claimName: test-pvc
|
|
EOF
|
|
|
|
kubectl wait --for=condition=Ready pod/test-pod --timeout=60s
|
|
|
|
# Verify file written
|
|
kubectl exec test-pod -- cat /data/test.txt | grep hello || exit 1
|
|
|
|
# Cleanup
|
|
kubectl delete pod test-pod
|
|
kubectl delete pvc test-pvc
|
|
```
|
|
|
|
**4. Multi-Tenant Isolation:**
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
# Create two namespaces
|
|
kubectl create namespace project-a
|
|
kubectl create namespace project-b
|
|
|
|
# Deploy pod in each
|
|
kubectl run pod-a --image=nginx -n project-a
|
|
kubectl run pod-b --image=nginx -n project-b
|
|
|
|
# Verify network isolation (if NetworkPolicies enabled)
|
|
# Pod A should NOT be able to reach Pod B
|
|
POD_B_IP=$(kubectl get pod pod-b -n project-b -o jsonpath='{.status.podIP}')
|
|
kubectl exec pod-a -n project-a -- curl --max-time 5 http://$POD_B_IP && exit 1 || true
|
|
|
|
# Cleanup
|
|
kubectl delete ns project-a project-b
|
|
```
|
|
|
|
### E2E Test Scenario
|
|
|
|
**End-to-End Test: Deploy Multi-Tier Application**
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
set -ex
|
|
|
|
NAMESPACE="project-123"
|
|
|
|
# 1. Create namespace
|
|
kubectl create namespace $NAMESPACE
|
|
|
|
# 2. Deploy PostgreSQL with PVC
|
|
kubectl apply -n $NAMESPACE -f - <<EOF
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: postgres-pvc
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
resources:
|
|
requests:
|
|
storage: 5Gi
|
|
storageClassName: lightningstor-ssd
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: postgres
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: postgres
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: postgres
|
|
spec:
|
|
containers:
|
|
- name: postgres
|
|
image: postgres:15
|
|
env:
|
|
- name: POSTGRES_PASSWORD
|
|
value: testpass
|
|
ports:
|
|
- containerPort: 5432
|
|
volumeMounts:
|
|
- name: data
|
|
mountPath: /var/lib/postgresql/data
|
|
volumes:
|
|
- name: data
|
|
persistentVolumeClaim:
|
|
claimName: postgres-pvc
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: postgres
|
|
spec:
|
|
selector:
|
|
app: postgres
|
|
ports:
|
|
- port: 5432
|
|
EOF
|
|
|
|
# 3. Deploy web application
|
|
kubectl apply -n $NAMESPACE -f - <<EOF
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: web
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: web
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: web
|
|
spec:
|
|
containers:
|
|
- name: web
|
|
image: myapp:latest
|
|
env:
|
|
- name: DATABASE_URL
|
|
value: postgres://postgres:testpass@postgres:5432/mydb
|
|
ports:
|
|
- containerPort: 8080
|
|
EOF
|
|
|
|
# 4. Expose web via LoadBalancer
|
|
kubectl expose deployment web -n $NAMESPACE --type=LoadBalancer --port=80 --target-port=8080
|
|
|
|
# 5. Wait for resources
|
|
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=postgres --timeout=120s
|
|
kubectl wait -n $NAMESPACE --for=condition=Ready pod -l app=web --timeout=120s
|
|
|
|
# 6. Verify LoadBalancer external IP
|
|
for i in {1..60}; do
|
|
EXTERNAL_IP=$(kubectl get svc web -n $NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
|
|
[ -n "$EXTERNAL_IP" ] && break
|
|
sleep 2
|
|
done
|
|
[ -n "$EXTERNAL_IP" ] || { echo "No external IP assigned"; exit 1; }
|
|
|
|
# 7. Verify DNS resolution
|
|
kubectl run -n $NAMESPACE --rm -it --restart=Never test-dns --image=busybox -- nslookup postgres.${NAMESPACE}.svc.cluster.local
|
|
|
|
# 8. Verify HTTP access
|
|
curl -f http://$EXTERNAL_IP/health || { echo "Health check failed"; exit 1; }
|
|
|
|
# 9. Verify PVC mounted
|
|
kubectl exec -n $NAMESPACE deployment/postgres -- ls /var/lib/postgresql/data | grep pg_wal
|
|
|
|
# 10. Verify network isolation (optional, if NetworkPolicies enabled)
|
|
# ...
|
|
|
|
# Cleanup
|
|
kubectl delete namespace $NAMESPACE
|
|
|
|
echo "E2E test passed!"
|
|
```
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Foundation (4-5 weeks)
|
|
|
|
**Week 1-2: k3s Setup and IAM Integration**
|
|
- [ ] Install and configure k3s with disabled components
|
|
- [ ] Implement IAM authentication webhook server
|
|
- [ ] Configure kube-apiserver to use IAM webhook
|
|
- [ ] Create RBAC templates (org admin, project admin, viewer)
|
|
- [ ] Test: Authenticate with IAM token, verify RBAC enforcement
|
|
|
|
**Week 3: PrismNET CNI Plugin**
|
|
- [ ] Implement CNI binary (ADD, DEL, CHECK commands)
|
|
- [ ] Integrate with PrismNET gRPC API (AllocateIP, ReleaseIP)
|
|
- [ ] Configure OVN logical switches per namespace
|
|
- [ ] Test: Create pod, verify network interface and IP allocation
|
|
|
|
**Week 4: FiberLB Controller**
|
|
- [ ] Implement controller watch loop (Services, Endpoints)
|
|
- [ ] Integrate with FiberLB gRPC API (CreateLoadBalancer, UpdatePool)
|
|
- [ ] Implement external IP allocation from pool
|
|
- [ ] Test: Expose service as LoadBalancer, verify external IP and routing
|
|
|
|
**Week 5: Basic RBAC and Multi-Tenancy**
|
|
- [ ] Implement namespace-per-project provisioning
|
|
- [ ] Deploy default RBAC roles and bindings
|
|
- [ ] Test: Create multiple projects, verify isolation
|
|
|
|
**Deliverables:**
|
|
- Functional k3s cluster with IAM authentication
|
|
- Pod networking via PrismNET
|
|
- LoadBalancer services via FiberLB
|
|
- Multi-tenant namespaces with RBAC
|
|
|
|
### Phase 2: Storage & DNS (5-6 weeks)
|
|
|
|
**Week 6-7: LightningStor CSI Driver**
|
|
- [ ] Implement CSI Controller Service (CreateVolume, DeleteVolume, ControllerPublishVolume)
|
|
- [ ] Implement CSI Node Service (NodeStageVolume, NodePublishVolume)
|
|
- [ ] Integrate with LightningStor gRPC API
|
|
- [ ] Deploy CSI driver as pods (controller + node DaemonSet)
|
|
- [ ] Create StorageClasses for SSD and HDD
|
|
- [ ] Test: Create PVC, attach to pod, write/read data
|
|
|
|
**Week 8: FlashDNS Controller**
|
|
- [ ] Implement controller watch loop (Services, Pods)
|
|
- [ ] Integrate with FlashDNS gRPC API (CreateRecord, UpdateRecord)
|
|
- [ ] Generate DNS records (A, SRV) for services and pods
|
|
- [ ] Configure kubelet DNS settings
|
|
- [ ] Test: Resolve service DNS from pod, verify DNS updates
|
|
|
|
**Week 9: Network Policy Support**
|
|
- [ ] Extend PrismNET CNI with NetworkPolicy controller
|
|
- [ ] Translate K8s NetworkPolicy to OVN ACLs
|
|
- [ ] Implement address sets for pod label selectors
|
|
- [ ] Test: Create NetworkPolicy, verify ingress/egress enforcement
|
|
|
|
**Week 10-11: Integration Testing**
|
|
- [ ] Write integration test suite (pod, service, PVC, DNS)
|
|
- [ ] Test multi-tier application deployment
|
|
- [ ] Performance testing (pod creation time, network throughput)
|
|
- [ ] Fix bugs and optimize
|
|
|
|
**Deliverables:**
|
|
- Persistent storage via LightningStor CSI
|
|
- Service discovery via FlashDNS
|
|
- Network policies enforced by PrismNET
|
|
- Comprehensive integration tests
|
|
|
|
### Phase 3: Advanced Features (Post-MVP, 6-8 weeks)
|
|
|
|
**StatefulSets:**
|
|
- [ ] Verify StatefulSet controller functionality (built-in to k3s)
|
|
- [ ] Test with headless services and volumeClaimTemplates
|
|
- [ ] Example: Deploy Cassandra or Kafka cluster
|
|
|
|
**PlasmaVMC CRI Integration:**
|
|
- [ ] Implement CRI server in PlasmaVMC (Rust)
|
|
- [ ] Create Firecracker microVM per pod
|
|
- [ ] Test pod lifecycle (create, start, stop, delete)
|
|
- [ ] Performance benchmarking (startup time, resource overhead)
|
|
|
|
**FlareDB as k3s Datastore:**
|
|
- [ ] Investigate etcd API compatibility layer for FlareDB
|
|
- [ ] Implement etcd v3 gRPC API shim
|
|
- [ ] Test k3s with FlareDB backend
|
|
- [ ] Benchmarking and stability testing
|
|
|
|
**Autoscaling:**
|
|
- [ ] Deploy metrics-server
|
|
- [ ] Implement HorizontalPodAutoscaler
|
|
- [ ] Test autoscaling based on CPU/memory metrics
|
|
|
|
**Ingress (L7 LoadBalancer):**
|
|
- [ ] Implement Ingress controller using FiberLB L7 capabilities
|
|
- [ ] Support host-based and path-based routing
|
|
- [ ] TLS termination
|
|
|
|
## Success Criteria
|
|
|
|
**Functional Requirements:**
|
|
1. ✅ Deploy pods, services, deployments using kubectl
|
|
2. ✅ LoadBalancer services receive external IPs from FiberLB
|
|
3. ✅ PersistentVolumes provisioned from LightningStor and mounted to pods
|
|
4. ✅ DNS resolution works for services and pods (via FlashDNS)
|
|
5. ✅ Multi-tenant isolation enforced (namespaces, RBAC, network policies)
|
|
6. ✅ IAM authentication and RBAC functional (token validation, user/group mapping)
|
|
7. ✅ E2E test passes (multi-tier application deployment)
|
|
|
|
**Performance Requirements:**
|
|
1. Pod creation time: <10 seconds (from API call to running state)
|
|
2. Service LoadBalancer IP allocation: <5 seconds
|
|
3. PersistentVolume provisioning: <30 seconds
|
|
4. DNS record updates: <10 seconds (after service creation)
|
|
5. Support 100+ pods per cluster
|
|
6. Support 10+ concurrent namespaces
|
|
|
|
**Operational Requirements:**
|
|
1. NixOS module for declarative deployment
|
|
2. Cluster upgrade path (k3s version upgrades)
|
|
3. Backup and restore procedures (etcd snapshots)
|
|
4. Monitoring and alerting integration (Prometheus, Grafana)
|
|
5. Logging aggregation (FluentBit → centralized log storage)
|
|
|
|
## Next Steps (S3-S6)
|
|
|
|
### S3: Workspace Scaffold
|
|
- Create `k8shost/` workspace directory structure
|
|
- Set up Go module for controllers (FiberLB, FlashDNS)
|
|
- Set up Rust workspace for CNI plugin
|
|
- Set up Go module for CSI driver
|
|
- Create NixOS module skeleton
|
|
|
|
**Directory Structure:**
|
|
```
|
|
k8shost/
|
|
├── controllers/ # Go: FiberLB, FlashDNS, IAM webhook
|
|
│ ├── fiberlb/
|
|
│ ├── flashdns/
|
|
│ ├── iamwebhook/
|
|
│ └── main.go
|
|
├── cni/ # Rust: PrismNET CNI plugin
|
|
│ ├── src/
|
|
│ └── Cargo.toml
|
|
├── csi/ # Go: LightningStor CSI driver
|
|
│ ├── controller/
|
|
│ ├── node/
|
|
│ └── main.go
|
|
├── nix/
|
|
│ └── modules/
|
|
│ └── k8shost.nix
|
|
└── tests/
|
|
├── integration/
|
|
└── e2e/
|
|
```
|
|
|
|
### S4: Controllers Implementation
|
|
- Implement FiberLB controller (Service watch, gRPC integration)
|
|
- Implement FlashDNS controller (Service/Pod watch, DNS record sync)
|
|
- Implement IAM webhook server (TokenReview API, IAM validation)
|
|
- Unit tests for each controller
|
|
|
|
### S5: CNI + CSI Implementation
|
|
- Implement PrismNET CNI plugin (ADD/DEL/CHECK, OVN integration)
|
|
- Implement LightningStor CSI driver (Controller and Node services)
|
|
- Deploy CSI driver as pods (Deployment + DaemonSet)
|
|
- Unit tests for CNI and CSI
|
|
|
|
### S6: Integration Testing
|
|
- Set up integration test environment (k3s cluster + mock services)
|
|
- Write integration tests (pod, service, PVC, DNS, multi-tenant)
|
|
- Write E2E test (multi-tier application)
|
|
- CI/CD pipeline for automated testing
|
|
|
|
## References
|
|
|
|
- **k3s Architecture**: https://docs.k3s.io/architecture
|
|
- **k3s Installation**: https://docs.k3s.io/installation
|
|
- **k3s HA Setup**: https://docs.k3s.io/datastore/ha-embedded
|
|
- **CNI Specification**: https://github.com/containernetworking/cni/blob/main/SPEC.md
|
|
- **CSI Specification**: https://github.com/container-storage-interface/spec/blob/master/spec.md
|
|
- **K8s Authentication Webhooks**: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#webhook-token-authentication
|
|
- **K8s Authorization (RBAC)**: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
|
|
- **K8s Network Policies**: https://kubernetes.io/docs/concepts/services-networking/network-policies/
|
|
- **OVN Architecture**: https://www.ovn.org/support/dist-docs/ovn-architecture.7.html
|
|
- **Kubernetes API Reference**: https://kubernetes.io/docs/reference/kubernetes-api/
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2025-12-09
|
|
**Authors:** PlasmaCloud Platform Team
|
|
**Status:** Draft for Review
|