photoncloud-monorepo/specifications/aegis
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00
..
README.md T036: Add VM cluster deployment configs for nixos-anywhere 2025-12-11 09:59:19 +09:00

Aegis (IAM) Specification

Version: 1.0 | Status: Draft | Last Updated: 2025-12-08

1. Overview

1.1 Purpose

Aegis is the Identity and Access Management (IAM) platform providing authentication, authorization, and multi-tenant access control for all cloud services. It implements RBAC (Role-Based Access Control) with ABAC (Attribute-Based Access Control) extensions.

The name "Aegis" (shield of Zeus) reflects its role as the protective layer that guards access to all platform resources.

1.2 Scope

  • In scope: Principals (users, service accounts, groups), roles, permissions, policy bindings, scope hierarchy (System > Org > Project > Resource), internal token issuance/validation, external identity federation (OIDC/JWT), authorization decision service (PDP), audit event generation
  • Out of scope: User password management (delegated to external IdP), UI for authentication, API gateway/rate limiting

1.3 Design Goals

  • AWS IAM / GCP IAM compatible: Familiar concepts and API patterns
  • Multi-tenant from day one: Full org/project hierarchy with scope isolation
  • Flexible RBAC + ABAC hybrid: Roles with conditional permissions
  • High-performance authorization: Sub-millisecond decisions with caching
  • Zero-trust security: Default deny, explicit grants, audit everything
  • Cloud-grade scalability: Handle millions of decisions per second

2. Architecture

2.1 Crate Structure

iam/
├── crates/
│   ├── iam-api/      # gRPC service implementations
│   ├── iam-audit/    # Audit logging (planned)
│   ├── iam-authn/    # Authentication (tokens, OIDC)
│   ├── iam-authz/    # Authorization engine (PDP)
│   ├── iam-client/   # Rust client library
│   ├── iam-server/   # Server binary
│   ├── iam-store/    # Storage backends (Chainfire, FlareDB, Memory)
│   └── iam-types/    # Core types
└── proto/
    └── iam.proto     # gRPC definitions

2.2 Authorization Flow

[Client Request] → [IamAuthz Service]
                         ↓
                  [Fetch Principal]
                         ↓
                  [Build Resource Context]
                         ↓
                  [PolicyEvaluator]
                         ↓
         ┌───────────────┼───────────────┐
         ↓               ↓               ↓
  [Get Bindings]   [Get Roles]    [Cache Lookup]
         ↓               ↓               ↓
         └───────────────┼───────────────┘
                         ↓
              [Evaluate Permissions]
                         ↓
                [Condition Check]
                         ↓
                [ALLOW / DENY]

2.3 Dependencies

Crate Version Purpose
tokio 1.x Async runtime
tonic 0.12 gRPC framework
prost 0.13 Protocol buffers
dashmap 6.x Concurrent cache
ipnetwork 0.20 CIDR matching
glob-match 0.2 Resource pattern matching

3. Core Concepts

3.1 Principals

Identities that can be authenticated and authorized.

pub struct Principal {
    pub id: String,              // Unique identifier
    pub kind: PrincipalKind,     // User | ServiceAccount | Group
    pub name: String,            // Display name
    pub org_id: Option<String>,  // Organization membership
    pub project_id: Option<String>, // For service accounts
    pub email: Option<String>,   // For users
    pub oidc_sub: Option<String>, // Federated identity subject
    pub node_id: Option<String>, // For node-bound service accounts
    pub metadata: HashMap<String, String>,
    pub created_at: u64,
    pub updated_at: u64,
    pub enabled: bool,
}

pub enum PrincipalKind {
    User,           // Human users
    ServiceAccount, // Machine identities
    Group,          // Collections (future)
}

Principal Reference: kind:id format

  • user:alice
  • service_account:compute-agent

3.2 Roles

Named collections of permissions.

pub struct Role {
    pub name: String,            // e.g., "ProjectAdmin"
    pub display_name: String,
    pub description: String,
    pub scope: Scope,            // Where role can be assigned
    pub permissions: Vec<Permission>,
    pub builtin: bool,           // System-defined, immutable
    pub created_at: u64,
    pub updated_at: u64,
}

Builtin Roles:

Role Scope Description
SystemAdmin System Full cluster access
OrgAdmin Org Full organization access
ProjectAdmin Project Full project access
ProjectMember Project Own resources + read all
ReadOnly Project Read-only project access
ServiceRole-ComputeAgent Resource Node-scoped compute
ServiceRole-StorageAgent Resource Node-scoped storage

3.3 Permissions

Individual access rights within roles.

pub struct Permission {
    pub action: String,          // e.g., "compute:instances:create"
    pub resource_pattern: String, // e.g., "org/*/project/${project}/instances/*"
    pub condition: Option<Condition>,
}

Action Format: service:resource:operation

  • Wildcards: *, compute:*, compute:instances:*
  • Examples: compute:instances:create, storage:volumes:delete

Resource Pattern Format: org/{org_id}/project/{project_id}/{kind}/{id}

  • Wildcards: org/*/project/*/instances/*
  • Variables: ${principal.id}, ${project}

3.4 Policy Bindings

Assignments of roles to principals within a scope.

pub struct PolicyBinding {
    pub id: String,              // UUID
    pub principal_ref: PrincipalRef,
    pub role_ref: String,        // "roles/ProjectAdmin"
    pub scope: Scope,
    pub condition: Option<Condition>,
    pub created_at: u64,
    pub updated_at: u64,
    pub created_by: String,
    pub expires_at: Option<u64>, // Time-limited access
    pub enabled: bool,
}

4. Scope Hierarchy

Four-level hierarchical boundary for permissions.

System (level 0)          ← Cluster-wide
  └─ Organization (level 1)   ← Tenant boundary
      └─ Project (level 2)        ← Workload isolation
          └─ Resource (level 3)       ← Individual resource

4.1 Scope Types

pub enum Scope {
    System,
    Org { id: String },
    Project { id: String, org_id: String },
    Resource { id: String, project_id: String, org_id: String },
}

4.2 Scope Containment

impl Scope {
    // System contains everything
    // Org contains its projects and resources
    // Project contains its resources
    fn contains(&self, other: &Scope) -> bool;

    // Get parent scope
    fn parent(&self) -> Option<Scope>;

    // Get all ancestors up to System
    fn ancestors(&self) -> Vec<Scope>;
}

4.3 Scope Storage Keys

system
org/{org_id}
org/{org_id}/project/{project_id}
org/{org_id}/project/{project_id}/resource/{resource_id}

5. API

5.1 Authorization Service (PDP)

service IamAuthz {
  rpc Authorize(AuthorizeRequest) returns (AuthorizeResponse);
  rpc BatchAuthorize(BatchAuthorizeRequest) returns (BatchAuthorizeResponse);
}

message AuthorizeRequest {
  PrincipalRef principal = 1;
  string action = 2;           // "compute:instances:create"
  ResourceRef resource = 3;
  AuthzContext context = 4;    // IP, timestamp, metadata
}

message AuthorizeResponse {
  bool allowed = 1;
  string reason = 2;
  string matched_binding = 3;
  string matched_role = 4;
}

message ResourceRef {
  string kind = 1;             // "instance"
  string id = 2;               // "vm-123"
  string org_id = 3;           // Required
  string project_id = 4;       // Required
  optional string owner_id = 5;
  optional string node_id = 6;
  optional string region = 7;
  map<string, string> tags = 8;
}

5.2 Admin Service (Management)

service IamAdmin {
  // Principals
  rpc CreatePrincipal(CreatePrincipalRequest) returns (Principal);
  rpc GetPrincipal(GetPrincipalRequest) returns (Principal);
  rpc UpdatePrincipal(UpdatePrincipalRequest) returns (Principal);
  rpc DeletePrincipal(DeletePrincipalRequest) returns (Empty);
  rpc ListPrincipals(ListPrincipalsRequest) returns (ListPrincipalsResponse);

  // Roles
  rpc CreateRole(CreateRoleRequest) returns (Role);
  rpc GetRole(GetRoleRequest) returns (Role);
  rpc UpdateRole(UpdateRoleRequest) returns (Role);
  rpc DeleteRole(DeleteRoleRequest) returns (Empty);
  rpc ListRoles(ListRolesRequest) returns (ListRolesResponse);

  // Bindings
  rpc CreateBinding(CreateBindingRequest) returns (PolicyBinding);
  rpc GetBinding(GetBindingRequest) returns (PolicyBinding);
  rpc UpdateBinding(UpdateBindingRequest) returns (PolicyBinding);
  rpc DeleteBinding(DeleteBindingRequest) returns (Empty);
  rpc ListBindings(ListBindingsRequest) returns (ListBindingsResponse);
}

5.3 Token Service

service IamToken {
  rpc IssueToken(IssueTokenRequest) returns (IssueTokenResponse);
  rpc ValidateToken(ValidateTokenRequest) returns (ValidateTokenResponse);
  rpc RevokeToken(RevokeTokenRequest) returns (Empty);
  rpc RefreshToken(RefreshTokenRequest) returns (RefreshTokenResponse);
}

message InternalTokenClaims {
  string principal_id = 1;
  PrincipalKind principal_kind = 2;
  string principal_name = 3;
  repeated string roles = 4;   // Pre-loaded roles
  Scope scope = 5;
  optional string org_id = 6;
  optional string project_id = 7;
  optional string node_id = 8;
  uint64 iat = 9;              // Issued at (TSO)
  uint64 exp = 10;             // Expires at (TSO)
  string session_id = 11;
  AuthMethod auth_method = 12; // Jwt | Mtls | ApiKey
}

6. Authorization Logic

6.1 Evaluation Algorithm

evaluate(request):
  1. Default DENY
  2. resource_scope = Scope::from(request.resource)
  3. bindings = get_effective_bindings(principal, resource_scope)
  4. For each binding where binding.is_active(now):
     a. role = get_role(binding.role_ref)
     b. If binding.condition exists and !evaluate_condition(binding.condition):
        continue
     c. If evaluate_role(role, request):
        return ALLOW
  5. Return DENY

6.2 Role Permission Evaluation

evaluate_role(role, request):
  For each permission in role.permissions:
    1. If !matches_action(permission.action, request.action):
       continue
    2. resource_path = request.resource.to_path()
       pattern = substitute_variables(permission.resource_pattern)
       If !matches_resource(pattern, resource_path):
       continue
    3. If permission.condition exists and !evaluate_condition(permission.condition):
       continue
    4. return true  // Permission matches
  return false

6.3 Action Matching

matches_action("compute:*", "compute:instances:create")     // true
matches_action("compute:instances:*", "compute:volumes:create") // false
matches_action("*", "anything:here:works")                  // true

6.4 Resource Matching

// Path format: org/{org}/project/{proj}/{kind}/{id}
matches_resource("org/*/project/*/instance/*",
                 "org/org-1/project/proj-1/instance/vm-1")  // true
matches_resource("org/org-1/project/proj-1/*",
                 "org/org-1/project/proj-1/instance/vm-1")  // true (trailing /*)

7. Conditions (ABAC)

7.1 Condition Types

pub enum Condition {
    // String
    StringEquals { key: String, value: String },
    StringNotEquals { key: String, value: String },
    StringLike { key: String, pattern: String },     // Glob pattern
    StringEqualsAny { key: String, values: Vec<String> },

    // Numeric
    NumericEquals { key: String, value: i64 },
    NumericLessThan { key: String, value: i64 },
    NumericGreaterThan { key: String, value: i64 },

    // Network
    IpAddress { key: String, cidr: String },         // CIDR matching
    NotIpAddress { key: String, cidr: String },

    // Temporal
    TimeBetween { start: String, end: String },      // HH:MM or Unix timestamp

    // Existence
    Exists { key: String },

    // Boolean
    Bool { key: String, value: bool },

    // Logical
    And { conditions: Vec<Condition> },
    Or { conditions: Vec<Condition> },
    Not { condition: Box<Condition> },
}

7.2 Variable Context

// Available variables for condition evaluation
principal.id, principal.kind, principal.name
principal.org_id, principal.project_id, principal.node_id
principal.email, principal.metadata.{key}

resource.kind, resource.id
resource.org_id, resource.project_id
resource.owner, resource.node, resource.region
resource.tags.{key}

request.source_ip, request.time
request.method, request.path
request.metadata.{key}

7.3 Variable Substitution

// In permission patterns
"org/${principal.org_id}/project/${project}/*"

// In conditions
Condition::string_equals("resource.owner", "${principal.id}")

7.4 Example: Owner-Only Access

Permission {
    action: "compute:instances:*",
    resource_pattern: "org/*/project/*/instance/*",
    condition: Some(Condition::string_equals(
        "resource.owner",
        "${principal.id}"
    )),
}

8. Storage

8.1 Backend Abstraction

pub trait StorageBackend: Send + Sync {
    async fn get(&self, key: &str) -> Result<Option<(Vec<u8>, u64)>>;
    async fn put(&self, key: &str, value: &[u8]) -> Result<u64>;
    async fn cas(&self, key: &str, expected: u64, value: &[u8]) -> Result<CasResult>;
    async fn delete(&self, key: &str) -> Result<bool>;
    async fn scan_prefix(&self, prefix: &str, limit: usize) -> Result<Vec<KvPair>>;
}

Supported Backends:

  • Chainfire: Production distributed KV
  • FlareDB: Alternative distributed DB
  • Memory: Testing

8.2 Key Schema

Principals:

iam/principals/{kind}/{id}                    # Primary
iam/principals/by-org/{org_id}/{kind}/{id}    # Org index
iam/principals/by-project/{project_id}/{id}   # Project index
iam/principals/by-email/{email}               # Email lookup
iam/principals/by-oidc/{iss_hash}/{sub}       # OIDC lookup

Roles:

iam/roles/{name}                              # Primary
iam/roles/by-scope/{scope}/{name}             # Scope index
iam/roles/builtin/{name}                      # Builtin marker

Bindings:

iam/bindings/scope/{scope}/principal/{principal}/{id}  # Primary
iam/bindings/by-principal/{principal}/{id}             # Principal index
iam/bindings/by-role/{role}/{id}                       # Role index

8.3 Caching

pub struct PolicyCache {
    bindings: DashMap<PrincipalRef, Vec<PolicyBinding>>,
    roles: DashMap<String, Role>,
    config: CacheConfig,
}

impl PolicyCache {
    fn get_bindings(&self, principal: &PrincipalRef) -> Option<Vec<PolicyBinding>>;
    fn put_bindings(&self, principal: &PrincipalRef, bindings: Vec<PolicyBinding>);
    fn invalidate_principal(&self, principal: &PrincipalRef);
    fn invalidate_role(&self, name: &str);
}

9. Configuration

9.1 Config File Format (TOML)

[server]
addr = "0.0.0.0:50051"

[server.tls]
cert_file = "/etc/aegis/tls/server.crt"
key_file = "/etc/aegis/tls/server.key"
ca_file = "/etc/aegis/tls/ca.crt"        # For client cert verification
require_client_cert = true                # Enable mTLS

[store]
backend = "chainfire"                     # "memory" | "chainfire" | "flaredb"
chainfire_endpoints = ["http://localhost:2379"]
# flaredb_endpoint = "http://localhost:5000"
# flaredb_namespace = "iam"

[authn]
[authn.jwt]
jwks_url = "https://auth.example.com/.well-known/jwks.json"
issuer = "https://auth.example.com"
audience = "aegis"
jwks_cache_ttl_seconds = 3600

[authn.internal_token]
signing_key = "base64-encoded-256-bit-key"
issuer = "aegis"
default_ttl_seconds = 3600                # 1 hour
max_ttl_seconds = 604800                  # 7 days

[logging]
level = "info"                            # "debug" | "info" | "warn" | "error"
format = "json"                           # "json" | "text"

9.2 Environment Variables

Variable Default Description
IAM_CONFIG - Path to config file
IAM_ADDR 0.0.0.0:50051 Server listen address
IAM_LOG_LEVEL info Log level
IAM_SIGNING_KEY - Token signing key (overrides config)
IAM_STORE_BACKEND memory Storage backend type

9.3 CLI Arguments

aegis-server [OPTIONS]

Options:
  -c, --config <PATH>     Config file path
  -a, --addr <ADDR>       Listen address (overrides config)
  -l, --log-level <LEVEL> Log level
  -h, --help              Print help
  -V, --version           Print version

10. Multi-Tenancy

10.1 Organization Isolation

  • All principals have org_id (except System scope)
  • All resources require org_id and project_id
  • Scope containment enforces org boundaries

10.2 Project Isolation

  • Service accounts bound to projects
  • Resources belong to projects
  • Permissions scoped to project/${project}/*

10.3 Cross-Tenant Access Patterns

Pattern Scope Use Case
System Admin System Platform operators
Org Admin Org Organization administrators
Project Admin Project Project owners
Node Agent Resource Node-bound service accounts

10.4 Node-Bound Service Accounts

// Service account with node binding
Principal {
    kind: ServiceAccount,
    node_id: Some("node-001"),
    ...
}

// Permission with node condition
Permission {
    action: "compute:*",
    resource_pattern: "org/*/project/*/instance/*",
    condition: Some(Condition::string_equals(
        "resource.node",
        "${principal.node_id}"
    )),
}

11. Security

11.1 Authentication

External Identity (OIDC/JWT):

  • Validate JWT signature using JWKS from configured IdP
  • Verify issuer, audience, and expiration claims
  • Map OIDC sub claim to internal principal
  • JWKS cached with configurable TTL

Internal Tokens:

  • HMAC-SHA256 signed tokens for service-to-service auth
  • Contains: principal_id, kind, roles, scope, org_id, project_id, exp, iat, session_id
  • Short-lived (default 1 hour, max 7 days)
  • Revocable via session_id

mTLS:

  • Optional client certificate authentication
  • Certificate CN mapped to service account ID
  • Used for node-to-control-plane communication

11.2 Authorization Properties

  • Default Deny: No binding = denied
  • Explicit Allow: Must match binding + role + permission
  • Scope Enforcement: Automatic via containment
  • Temporal Bounds: expires_at for time-limited access
  • Soft Disable: enabled flag for quick revocation

11.3 Immutable Builtins

  • System roles cannot be modified/deleted
  • Prevents privilege escalation via role modification

11.4 Audit Trail

  • created_by on all entities
  • Timestamps for creation/modification
  • Audit event generation via iam-audit crate

12. Operations

12.1 Deployment

Single Node:

aegis-server --config /etc/aegis/aegis.toml

Cluster Mode:

  • Multiple Aegis instances behind load balancer
  • Shared storage backend (Chainfire or FlareDB)
  • Stateless - any instance can handle any request
  • Session affinity not required

High Availability:

  • Deploy 3+ instances across availability zones
  • Use Chainfire Raft cluster for storage
  • Health checks on /health endpoint

12.2 Initialization

// Initialize builtin roles (idempotent)
role_store.init_builtin_roles().await?;

12.3 Client Library

use iam_client::IamClient;

let client = IamClient::connect("http://127.0.0.1:9090").await?;

// Check authorization
let allowed = client.authorize(
    PrincipalRef::user("alice"),
    "compute:instances:create",
    ResourceRef::new("instance", "org-1", "proj-1", "vm-1"),
).await?;

// Create binding
client.create_binding(CreateBindingRequest {
    principal: PrincipalRef::user("alice"),
    role: "roles/ProjectAdmin".into(),
    scope: Scope::project("proj-1", "org-1"),
    ..Default::default()
}).await?;

12.4 Monitoring

Metrics (Prometheus format):

Metric Type Description
aegis_authz_requests_total Counter Total authorization requests
aegis_authz_decisions{result} Counter Decisions by allow/deny
aegis_authz_latency_seconds Histogram Authorization latency
aegis_token_issued_total Counter Tokens issued
aegis_token_validated_total Counter Token validations
aegis_cache_hits_total Counter Policy cache hits
aegis_cache_misses_total Counter Policy cache misses
aegis_bindings_total Gauge Total active bindings
aegis_principals_total Gauge Total principals

Health Endpoints:

  • GET /health - Liveness check
  • GET /ready - Readiness check (storage connected)

12.5 Backup & Recovery

Backup:

  • Export all principals, roles, and bindings via Admin API
  • Or snapshot underlying storage (Chainfire/FlareDB)
  • Recommended: Daily full backup + continuous WAL archiving

Recovery:

  • Restore from storage snapshot
  • Or reimport via Admin API
  • Builtin roles auto-created on startup

13. Compatibility

13.1 API Versioning

  • gRPC package: iam.v1
  • Semantic versioning for breaking changes
  • Backward compatible additions within major version
  • Deprecation warnings before removal

13.2 Wire Protocol

  • Protocol Buffers v3
  • gRPC with HTTP/2 transport
  • TLS 1.3 required in production

13.3 Storage Migration

  • Schema version tracked in metadata key
  • Automatic migration on startup
  • Backward compatible within major version

Appendix

A. Error Codes

Error Meaning
PRINCIPAL_NOT_FOUND Principal does not exist
ROLE_NOT_FOUND Role does not exist
BINDING_NOT_FOUND Binding does not exist
BUILTIN_IMMUTABLE Cannot modify builtin role
SCOPE_VIOLATION Operation violates scope boundary
CONDITION_FAILED Condition evaluation failed

B. Proto Scope Messages

message Scope {
  oneof scope {
    bool system = 1;
    OrgScope org = 2;
    ProjectScope project = 3;
    ResourceScope resource = 4;
  }
}

message OrgScope { string id = 1; }
message ProjectScope { string id = 1; string org_id = 2; }
message ResourceScope { string id = 1; string project_id = 2; string org_id = 3; }

C. Port Assignments

Port Protocol Purpose
9090 gRPC IAM API

D. Performance Considerations

  • Cache bindings and roles for hot path
  • Batch authorization for bulk checks
  • Prefix scans for hierarchical queries
  • CAS for conflict-free updates

E. Glossary

  • Principal: An identity that can be authenticated (user, service account, or group)
  • Role: A named collection of permissions that can be assigned to principals
  • Permission: A specific action allowed on a resource pattern with optional conditions
  • Binding: Assignment of a role to a principal within a specific scope
  • Scope: Hierarchical boundary for permission application (System > Org > Project > Resource)
  • Condition: ABAC expression that must evaluate to true for access to be granted
  • PDP: Policy Decision Point - the authorization evaluation engine
  • RBAC: Role-Based Access Control - permissions assigned via roles
  • ABAC: Attribute-Based Access Control - permissions based on attributes/conditions

F. Example Policies

Allow user to manage own instances:

{
  "principal": "user:alice",
  "role": "roles/ProjectMember",
  "scope": { "type": "project", "id": "web-app", "org_id": "acme" }
}

Time-limited admin access:

{
  "principal": "user:bob",
  "role": "roles/ProjectAdmin",
  "scope": { "type": "project", "id": "staging", "org_id": "acme" },
  "expires_at": 1735689600,
  "condition": {
    "expression": {
      "type": "time_between",
      "start": "09:00",
      "end": "18:00"
    }
  }
}

Node-bound service account:

{
  "principal": "service_account:compute-agent-node-1",
  "role": "roles/ServiceRole-ComputeAgent",
  "scope": { "type": "system" },
  "condition": {
    "expression": {
      "type": "string_equals",
      "key": "resource.node",
      "value": "${principal.node_id}"
    }
  }
}

IP-restricted access:

{
  "principal": "user:admin",
  "role": "roles/SystemAdmin",
  "scope": { "type": "system" },
  "condition": {
    "expression": {
      "type": "ip_address",
      "key": "request.source_ip",
      "cidr": "10.0.0.0/8"
    }
  }
}