- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| README.md | ||
Aegis (IAM) Specification
Version: 1.0 | Status: Draft | Last Updated: 2025-12-08
1. Overview
1.1 Purpose
Aegis is the Identity and Access Management (IAM) platform providing authentication, authorization, and multi-tenant access control for all cloud services. It implements RBAC (Role-Based Access Control) with ABAC (Attribute-Based Access Control) extensions.
The name "Aegis" (shield of Zeus) reflects its role as the protective layer that guards access to all platform resources.
1.2 Scope
- In scope: Principals (users, service accounts, groups), roles, permissions, policy bindings, scope hierarchy (System > Org > Project > Resource), internal token issuance/validation, external identity federation (OIDC/JWT), authorization decision service (PDP), audit event generation
- Out of scope: User password management (delegated to external IdP), UI for authentication, API gateway/rate limiting
1.3 Design Goals
- AWS IAM / GCP IAM compatible: Familiar concepts and API patterns
- Multi-tenant from day one: Full org/project hierarchy with scope isolation
- Flexible RBAC + ABAC hybrid: Roles with conditional permissions
- High-performance authorization: Sub-millisecond decisions with caching
- Zero-trust security: Default deny, explicit grants, audit everything
- Cloud-grade scalability: Handle millions of decisions per second
2. Architecture
2.1 Crate Structure
iam/
├── crates/
│ ├── iam-api/ # gRPC service implementations
│ ├── iam-audit/ # Audit logging (planned)
│ ├── iam-authn/ # Authentication (tokens, OIDC)
│ ├── iam-authz/ # Authorization engine (PDP)
│ ├── iam-client/ # Rust client library
│ ├── iam-server/ # Server binary
│ ├── iam-store/ # Storage backends (Chainfire, FlareDB, Memory)
│ └── iam-types/ # Core types
└── proto/
└── iam.proto # gRPC definitions
2.2 Authorization Flow
[Client Request] → [IamAuthz Service]
↓
[Fetch Principal]
↓
[Build Resource Context]
↓
[PolicyEvaluator]
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
[Get Bindings] [Get Roles] [Cache Lookup]
↓ ↓ ↓
└───────────────┼───────────────┘
↓
[Evaluate Permissions]
↓
[Condition Check]
↓
[ALLOW / DENY]
2.3 Dependencies
| Crate | Version | Purpose |
|---|---|---|
| tokio | 1.x | Async runtime |
| tonic | 0.12 | gRPC framework |
| prost | 0.13 | Protocol buffers |
| dashmap | 6.x | Concurrent cache |
| ipnetwork | 0.20 | CIDR matching |
| glob-match | 0.2 | Resource pattern matching |
3. Core Concepts
3.1 Principals
Identities that can be authenticated and authorized.
pub struct Principal {
pub id: String, // Unique identifier
pub kind: PrincipalKind, // User | ServiceAccount | Group
pub name: String, // Display name
pub org_id: Option<String>, // Organization membership
pub project_id: Option<String>, // For service accounts
pub email: Option<String>, // For users
pub oidc_sub: Option<String>, // Federated identity subject
pub node_id: Option<String>, // For node-bound service accounts
pub metadata: HashMap<String, String>,
pub created_at: u64,
pub updated_at: u64,
pub enabled: bool,
}
pub enum PrincipalKind {
User, // Human users
ServiceAccount, // Machine identities
Group, // Collections (future)
}
Principal Reference: kind:id format
user:aliceservice_account:compute-agent
3.2 Roles
Named collections of permissions.
pub struct Role {
pub name: String, // e.g., "ProjectAdmin"
pub display_name: String,
pub description: String,
pub scope: Scope, // Where role can be assigned
pub permissions: Vec<Permission>,
pub builtin: bool, // System-defined, immutable
pub created_at: u64,
pub updated_at: u64,
}
Builtin Roles:
| Role | Scope | Description |
|---|---|---|
| SystemAdmin | System | Full cluster access |
| OrgAdmin | Org | Full organization access |
| ProjectAdmin | Project | Full project access |
| ProjectMember | Project | Own resources + read all |
| ReadOnly | Project | Read-only project access |
| ServiceRole-ComputeAgent | Resource | Node-scoped compute |
| ServiceRole-StorageAgent | Resource | Node-scoped storage |
3.3 Permissions
Individual access rights within roles.
pub struct Permission {
pub action: String, // e.g., "compute:instances:create"
pub resource_pattern: String, // e.g., "org/*/project/${project}/instances/*"
pub condition: Option<Condition>,
}
Action Format: service:resource:operation
- Wildcards:
*,compute:*,compute:instances:* - Examples:
compute:instances:create,storage:volumes:delete
Resource Pattern Format: org/{org_id}/project/{project_id}/{kind}/{id}
- Wildcards:
org/*/project/*/instances/* - Variables:
${principal.id},${project}
3.4 Policy Bindings
Assignments of roles to principals within a scope.
pub struct PolicyBinding {
pub id: String, // UUID
pub principal_ref: PrincipalRef,
pub role_ref: String, // "roles/ProjectAdmin"
pub scope: Scope,
pub condition: Option<Condition>,
pub created_at: u64,
pub updated_at: u64,
pub created_by: String,
pub expires_at: Option<u64>, // Time-limited access
pub enabled: bool,
}
4. Scope Hierarchy
Four-level hierarchical boundary for permissions.
System (level 0) ← Cluster-wide
└─ Organization (level 1) ← Tenant boundary
└─ Project (level 2) ← Workload isolation
└─ Resource (level 3) ← Individual resource
4.1 Scope Types
pub enum Scope {
System,
Org { id: String },
Project { id: String, org_id: String },
Resource { id: String, project_id: String, org_id: String },
}
4.2 Scope Containment
impl Scope {
// System contains everything
// Org contains its projects and resources
// Project contains its resources
fn contains(&self, other: &Scope) -> bool;
// Get parent scope
fn parent(&self) -> Option<Scope>;
// Get all ancestors up to System
fn ancestors(&self) -> Vec<Scope>;
}
4.3 Scope Storage Keys
system
org/{org_id}
org/{org_id}/project/{project_id}
org/{org_id}/project/{project_id}/resource/{resource_id}
5. API
5.1 Authorization Service (PDP)
service IamAuthz {
rpc Authorize(AuthorizeRequest) returns (AuthorizeResponse);
rpc BatchAuthorize(BatchAuthorizeRequest) returns (BatchAuthorizeResponse);
}
message AuthorizeRequest {
PrincipalRef principal = 1;
string action = 2; // "compute:instances:create"
ResourceRef resource = 3;
AuthzContext context = 4; // IP, timestamp, metadata
}
message AuthorizeResponse {
bool allowed = 1;
string reason = 2;
string matched_binding = 3;
string matched_role = 4;
}
message ResourceRef {
string kind = 1; // "instance"
string id = 2; // "vm-123"
string org_id = 3; // Required
string project_id = 4; // Required
optional string owner_id = 5;
optional string node_id = 6;
optional string region = 7;
map<string, string> tags = 8;
}
5.2 Admin Service (Management)
service IamAdmin {
// Principals
rpc CreatePrincipal(CreatePrincipalRequest) returns (Principal);
rpc GetPrincipal(GetPrincipalRequest) returns (Principal);
rpc UpdatePrincipal(UpdatePrincipalRequest) returns (Principal);
rpc DeletePrincipal(DeletePrincipalRequest) returns (Empty);
rpc ListPrincipals(ListPrincipalsRequest) returns (ListPrincipalsResponse);
// Roles
rpc CreateRole(CreateRoleRequest) returns (Role);
rpc GetRole(GetRoleRequest) returns (Role);
rpc UpdateRole(UpdateRoleRequest) returns (Role);
rpc DeleteRole(DeleteRoleRequest) returns (Empty);
rpc ListRoles(ListRolesRequest) returns (ListRolesResponse);
// Bindings
rpc CreateBinding(CreateBindingRequest) returns (PolicyBinding);
rpc GetBinding(GetBindingRequest) returns (PolicyBinding);
rpc UpdateBinding(UpdateBindingRequest) returns (PolicyBinding);
rpc DeleteBinding(DeleteBindingRequest) returns (Empty);
rpc ListBindings(ListBindingsRequest) returns (ListBindingsResponse);
}
5.3 Token Service
service IamToken {
rpc IssueToken(IssueTokenRequest) returns (IssueTokenResponse);
rpc ValidateToken(ValidateTokenRequest) returns (ValidateTokenResponse);
rpc RevokeToken(RevokeTokenRequest) returns (Empty);
rpc RefreshToken(RefreshTokenRequest) returns (RefreshTokenResponse);
}
message InternalTokenClaims {
string principal_id = 1;
PrincipalKind principal_kind = 2;
string principal_name = 3;
repeated string roles = 4; // Pre-loaded roles
Scope scope = 5;
optional string org_id = 6;
optional string project_id = 7;
optional string node_id = 8;
uint64 iat = 9; // Issued at (TSO)
uint64 exp = 10; // Expires at (TSO)
string session_id = 11;
AuthMethod auth_method = 12; // Jwt | Mtls | ApiKey
}
6. Authorization Logic
6.1 Evaluation Algorithm
evaluate(request):
1. Default DENY
2. resource_scope = Scope::from(request.resource)
3. bindings = get_effective_bindings(principal, resource_scope)
4. For each binding where binding.is_active(now):
a. role = get_role(binding.role_ref)
b. If binding.condition exists and !evaluate_condition(binding.condition):
continue
c. If evaluate_role(role, request):
return ALLOW
5. Return DENY
6.2 Role Permission Evaluation
evaluate_role(role, request):
For each permission in role.permissions:
1. If !matches_action(permission.action, request.action):
continue
2. resource_path = request.resource.to_path()
pattern = substitute_variables(permission.resource_pattern)
If !matches_resource(pattern, resource_path):
continue
3. If permission.condition exists and !evaluate_condition(permission.condition):
continue
4. return true // Permission matches
return false
6.3 Action Matching
matches_action("compute:*", "compute:instances:create") // true
matches_action("compute:instances:*", "compute:volumes:create") // false
matches_action("*", "anything:here:works") // true
6.4 Resource Matching
// Path format: org/{org}/project/{proj}/{kind}/{id}
matches_resource("org/*/project/*/instance/*",
"org/org-1/project/proj-1/instance/vm-1") // true
matches_resource("org/org-1/project/proj-1/*",
"org/org-1/project/proj-1/instance/vm-1") // true (trailing /*)
7. Conditions (ABAC)
7.1 Condition Types
pub enum Condition {
// String
StringEquals { key: String, value: String },
StringNotEquals { key: String, value: String },
StringLike { key: String, pattern: String }, // Glob pattern
StringEqualsAny { key: String, values: Vec<String> },
// Numeric
NumericEquals { key: String, value: i64 },
NumericLessThan { key: String, value: i64 },
NumericGreaterThan { key: String, value: i64 },
// Network
IpAddress { key: String, cidr: String }, // CIDR matching
NotIpAddress { key: String, cidr: String },
// Temporal
TimeBetween { start: String, end: String }, // HH:MM or Unix timestamp
// Existence
Exists { key: String },
// Boolean
Bool { key: String, value: bool },
// Logical
And { conditions: Vec<Condition> },
Or { conditions: Vec<Condition> },
Not { condition: Box<Condition> },
}
7.2 Variable Context
// Available variables for condition evaluation
principal.id, principal.kind, principal.name
principal.org_id, principal.project_id, principal.node_id
principal.email, principal.metadata.{key}
resource.kind, resource.id
resource.org_id, resource.project_id
resource.owner, resource.node, resource.region
resource.tags.{key}
request.source_ip, request.time
request.method, request.path
request.metadata.{key}
7.3 Variable Substitution
// In permission patterns
"org/${principal.org_id}/project/${project}/*"
// In conditions
Condition::string_equals("resource.owner", "${principal.id}")
7.4 Example: Owner-Only Access
Permission {
action: "compute:instances:*",
resource_pattern: "org/*/project/*/instance/*",
condition: Some(Condition::string_equals(
"resource.owner",
"${principal.id}"
)),
}
8. Storage
8.1 Backend Abstraction
pub trait StorageBackend: Send + Sync {
async fn get(&self, key: &str) -> Result<Option<(Vec<u8>, u64)>>;
async fn put(&self, key: &str, value: &[u8]) -> Result<u64>;
async fn cas(&self, key: &str, expected: u64, value: &[u8]) -> Result<CasResult>;
async fn delete(&self, key: &str) -> Result<bool>;
async fn scan_prefix(&self, prefix: &str, limit: usize) -> Result<Vec<KvPair>>;
}
Supported Backends:
- Chainfire: Production distributed KV
- FlareDB: Alternative distributed DB
- Memory: Testing
8.2 Key Schema
Principals:
iam/principals/{kind}/{id} # Primary
iam/principals/by-org/{org_id}/{kind}/{id} # Org index
iam/principals/by-project/{project_id}/{id} # Project index
iam/principals/by-email/{email} # Email lookup
iam/principals/by-oidc/{iss_hash}/{sub} # OIDC lookup
Roles:
iam/roles/{name} # Primary
iam/roles/by-scope/{scope}/{name} # Scope index
iam/roles/builtin/{name} # Builtin marker
Bindings:
iam/bindings/scope/{scope}/principal/{principal}/{id} # Primary
iam/bindings/by-principal/{principal}/{id} # Principal index
iam/bindings/by-role/{role}/{id} # Role index
8.3 Caching
pub struct PolicyCache {
bindings: DashMap<PrincipalRef, Vec<PolicyBinding>>,
roles: DashMap<String, Role>,
config: CacheConfig,
}
impl PolicyCache {
fn get_bindings(&self, principal: &PrincipalRef) -> Option<Vec<PolicyBinding>>;
fn put_bindings(&self, principal: &PrincipalRef, bindings: Vec<PolicyBinding>);
fn invalidate_principal(&self, principal: &PrincipalRef);
fn invalidate_role(&self, name: &str);
}
9. Configuration
9.1 Config File Format (TOML)
[server]
addr = "0.0.0.0:50051"
[server.tls]
cert_file = "/etc/aegis/tls/server.crt"
key_file = "/etc/aegis/tls/server.key"
ca_file = "/etc/aegis/tls/ca.crt" # For client cert verification
require_client_cert = true # Enable mTLS
[store]
backend = "chainfire" # "memory" | "chainfire" | "flaredb"
chainfire_endpoints = ["http://localhost:2379"]
# flaredb_endpoint = "http://localhost:5000"
# flaredb_namespace = "iam"
[authn]
[authn.jwt]
jwks_url = "https://auth.example.com/.well-known/jwks.json"
issuer = "https://auth.example.com"
audience = "aegis"
jwks_cache_ttl_seconds = 3600
[authn.internal_token]
signing_key = "base64-encoded-256-bit-key"
issuer = "aegis"
default_ttl_seconds = 3600 # 1 hour
max_ttl_seconds = 604800 # 7 days
[logging]
level = "info" # "debug" | "info" | "warn" | "error"
format = "json" # "json" | "text"
9.2 Environment Variables
| Variable | Default | Description |
|---|---|---|
IAM_CONFIG |
- | Path to config file |
IAM_ADDR |
0.0.0.0:50051 |
Server listen address |
IAM_LOG_LEVEL |
info |
Log level |
IAM_SIGNING_KEY |
- | Token signing key (overrides config) |
IAM_STORE_BACKEND |
memory |
Storage backend type |
9.3 CLI Arguments
aegis-server [OPTIONS]
Options:
-c, --config <PATH> Config file path
-a, --addr <ADDR> Listen address (overrides config)
-l, --log-level <LEVEL> Log level
-h, --help Print help
-V, --version Print version
10. Multi-Tenancy
10.1 Organization Isolation
- All principals have
org_id(except System scope) - All resources require
org_idandproject_id - Scope containment enforces org boundaries
10.2 Project Isolation
- Service accounts bound to projects
- Resources belong to projects
- Permissions scoped to
project/${project}/*
10.3 Cross-Tenant Access Patterns
| Pattern | Scope | Use Case |
|---|---|---|
| System Admin | System | Platform operators |
| Org Admin | Org | Organization administrators |
| Project Admin | Project | Project owners |
| Node Agent | Resource | Node-bound service accounts |
10.4 Node-Bound Service Accounts
// Service account with node binding
Principal {
kind: ServiceAccount,
node_id: Some("node-001"),
...
}
// Permission with node condition
Permission {
action: "compute:*",
resource_pattern: "org/*/project/*/instance/*",
condition: Some(Condition::string_equals(
"resource.node",
"${principal.node_id}"
)),
}
11. Security
11.1 Authentication
External Identity (OIDC/JWT):
- Validate JWT signature using JWKS from configured IdP
- Verify issuer, audience, and expiration claims
- Map OIDC
subclaim to internal principal - JWKS cached with configurable TTL
Internal Tokens:
- HMAC-SHA256 signed tokens for service-to-service auth
- Contains: principal_id, kind, roles, scope, org_id, project_id, exp, iat, session_id
- Short-lived (default 1 hour, max 7 days)
- Revocable via session_id
mTLS:
- Optional client certificate authentication
- Certificate CN mapped to service account ID
- Used for node-to-control-plane communication
11.2 Authorization Properties
- Default Deny: No binding = denied
- Explicit Allow: Must match binding + role + permission
- Scope Enforcement: Automatic via containment
- Temporal Bounds:
expires_atfor time-limited access - Soft Disable:
enabledflag for quick revocation
11.3 Immutable Builtins
- System roles cannot be modified/deleted
- Prevents privilege escalation via role modification
11.4 Audit Trail
created_byon all entities- Timestamps for creation/modification
- Audit event generation via iam-audit crate
12. Operations
12.1 Deployment
Single Node:
aegis-server --config /etc/aegis/aegis.toml
Cluster Mode:
- Multiple Aegis instances behind load balancer
- Shared storage backend (Chainfire or FlareDB)
- Stateless - any instance can handle any request
- Session affinity not required
High Availability:
- Deploy 3+ instances across availability zones
- Use Chainfire Raft cluster for storage
- Health checks on
/healthendpoint
12.2 Initialization
// Initialize builtin roles (idempotent)
role_store.init_builtin_roles().await?;
12.3 Client Library
use iam_client::IamClient;
let client = IamClient::connect("http://127.0.0.1:9090").await?;
// Check authorization
let allowed = client.authorize(
PrincipalRef::user("alice"),
"compute:instances:create",
ResourceRef::new("instance", "org-1", "proj-1", "vm-1"),
).await?;
// Create binding
client.create_binding(CreateBindingRequest {
principal: PrincipalRef::user("alice"),
role: "roles/ProjectAdmin".into(),
scope: Scope::project("proj-1", "org-1"),
..Default::default()
}).await?;
12.4 Monitoring
Metrics (Prometheus format):
| Metric | Type | Description |
|---|---|---|
aegis_authz_requests_total |
Counter | Total authorization requests |
aegis_authz_decisions{result} |
Counter | Decisions by allow/deny |
aegis_authz_latency_seconds |
Histogram | Authorization latency |
aegis_token_issued_total |
Counter | Tokens issued |
aegis_token_validated_total |
Counter | Token validations |
aegis_cache_hits_total |
Counter | Policy cache hits |
aegis_cache_misses_total |
Counter | Policy cache misses |
aegis_bindings_total |
Gauge | Total active bindings |
aegis_principals_total |
Gauge | Total principals |
Health Endpoints:
GET /health- Liveness checkGET /ready- Readiness check (storage connected)
12.5 Backup & Recovery
Backup:
- Export all principals, roles, and bindings via Admin API
- Or snapshot underlying storage (Chainfire/FlareDB)
- Recommended: Daily full backup + continuous WAL archiving
Recovery:
- Restore from storage snapshot
- Or reimport via Admin API
- Builtin roles auto-created on startup
13. Compatibility
13.1 API Versioning
- gRPC package:
iam.v1 - Semantic versioning for breaking changes
- Backward compatible additions within major version
- Deprecation warnings before removal
13.2 Wire Protocol
- Protocol Buffers v3
- gRPC with HTTP/2 transport
- TLS 1.3 required in production
13.3 Storage Migration
- Schema version tracked in metadata key
- Automatic migration on startup
- Backward compatible within major version
Appendix
A. Error Codes
| Error | Meaning |
|---|---|
| PRINCIPAL_NOT_FOUND | Principal does not exist |
| ROLE_NOT_FOUND | Role does not exist |
| BINDING_NOT_FOUND | Binding does not exist |
| BUILTIN_IMMUTABLE | Cannot modify builtin role |
| SCOPE_VIOLATION | Operation violates scope boundary |
| CONDITION_FAILED | Condition evaluation failed |
B. Proto Scope Messages
message Scope {
oneof scope {
bool system = 1;
OrgScope org = 2;
ProjectScope project = 3;
ResourceScope resource = 4;
}
}
message OrgScope { string id = 1; }
message ProjectScope { string id = 1; string org_id = 2; }
message ResourceScope { string id = 1; string project_id = 2; string org_id = 3; }
C. Port Assignments
| Port | Protocol | Purpose |
|---|---|---|
| 9090 | gRPC | IAM API |
D. Performance Considerations
- Cache bindings and roles for hot path
- Batch authorization for bulk checks
- Prefix scans for hierarchical queries
- CAS for conflict-free updates
E. Glossary
- Principal: An identity that can be authenticated (user, service account, or group)
- Role: A named collection of permissions that can be assigned to principals
- Permission: A specific action allowed on a resource pattern with optional conditions
- Binding: Assignment of a role to a principal within a specific scope
- Scope: Hierarchical boundary for permission application (System > Org > Project > Resource)
- Condition: ABAC expression that must evaluate to true for access to be granted
- PDP: Policy Decision Point - the authorization evaluation engine
- RBAC: Role-Based Access Control - permissions assigned via roles
- ABAC: Attribute-Based Access Control - permissions based on attributes/conditions
F. Example Policies
Allow user to manage own instances:
{
"principal": "user:alice",
"role": "roles/ProjectMember",
"scope": { "type": "project", "id": "web-app", "org_id": "acme" }
}
Time-limited admin access:
{
"principal": "user:bob",
"role": "roles/ProjectAdmin",
"scope": { "type": "project", "id": "staging", "org_id": "acme" },
"expires_at": 1735689600,
"condition": {
"expression": {
"type": "time_between",
"start": "09:00",
"end": "18:00"
}
}
}
Node-bound service account:
{
"principal": "service_account:compute-agent-node-1",
"role": "roles/ServiceRole-ComputeAgent",
"scope": { "type": "system" },
"condition": {
"expression": {
"type": "string_equals",
"key": "resource.node",
"value": "${principal.node_id}"
}
}
}
IP-restricted access:
{
"principal": "user:admin",
"role": "roles/SystemAdmin",
"scope": { "type": "system" },
"condition": {
"expression": {
"type": "ip_address",
"key": "request.source_ip",
"cidr": "10.0.0.0/8"
}
}
}