photoncloud-monorepo/specifications/aegis/README.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

830 lines
24 KiB
Markdown

# Aegis (IAM) Specification
> Version: 1.0 | Status: Draft | Last Updated: 2025-12-08
## 1. Overview
### 1.1 Purpose
Aegis is the Identity and Access Management (IAM) platform providing authentication, authorization, and multi-tenant access control for all cloud services. It implements RBAC (Role-Based Access Control) with ABAC (Attribute-Based Access Control) extensions.
The name "Aegis" (shield of Zeus) reflects its role as the protective layer that guards access to all platform resources.
### 1.2 Scope
- **In scope**: Principals (users, service accounts, groups), roles, permissions, policy bindings, scope hierarchy (System > Org > Project > Resource), internal token issuance/validation, external identity federation (OIDC/JWT), authorization decision service (PDP), audit event generation
- **Out of scope**: User password management (delegated to external IdP), UI for authentication, API gateway/rate limiting
### 1.3 Design Goals
- **AWS IAM / GCP IAM compatible**: Familiar concepts and API patterns
- **Multi-tenant from day one**: Full org/project hierarchy with scope isolation
- **Flexible RBAC + ABAC hybrid**: Roles with conditional permissions
- **High-performance authorization**: Sub-millisecond decisions with caching
- **Zero-trust security**: Default deny, explicit grants, audit everything
- **Cloud-grade scalability**: Handle millions of decisions per second
## 2. Architecture
### 2.1 Crate Structure
```
iam/
├── crates/
│ ├── iam-api/ # gRPC service implementations
│ ├── iam-audit/ # Audit logging (planned)
│ ├── iam-authn/ # Authentication (tokens, OIDC)
│ ├── iam-authz/ # Authorization engine (PDP)
│ ├── iam-client/ # Rust client library
│ ├── iam-server/ # Server binary
│ ├── iam-store/ # Storage backends (Chainfire, FlareDB, Memory)
│ └── iam-types/ # Core types
└── proto/
└── iam.proto # gRPC definitions
```
### 2.2 Authorization Flow
```
[Client Request] → [IamAuthz Service]
[Fetch Principal]
[Build Resource Context]
[PolicyEvaluator]
┌───────────────┼───────────────┐
↓ ↓ ↓
[Get Bindings] [Get Roles] [Cache Lookup]
↓ ↓ ↓
└───────────────┼───────────────┘
[Evaluate Permissions]
[Condition Check]
[ALLOW / DENY]
```
### 2.3 Dependencies
| Crate | Version | Purpose |
|-------|---------|---------|
| tokio | 1.x | Async runtime |
| tonic | 0.12 | gRPC framework |
| prost | 0.13 | Protocol buffers |
| dashmap | 6.x | Concurrent cache |
| ipnetwork | 0.20 | CIDR matching |
| glob-match | 0.2 | Resource pattern matching |
## 3. Core Concepts
### 3.1 Principals
Identities that can be authenticated and authorized.
```rust
pub struct Principal {
pub id: String, // Unique identifier
pub kind: PrincipalKind, // User | ServiceAccount | Group
pub name: String, // Display name
pub org_id: Option<String>, // Organization membership
pub project_id: Option<String>, // For service accounts
pub email: Option<String>, // For users
pub oidc_sub: Option<String>, // Federated identity subject
pub node_id: Option<String>, // For node-bound service accounts
pub metadata: HashMap<String, String>,
pub created_at: u64,
pub updated_at: u64,
pub enabled: bool,
}
pub enum PrincipalKind {
User, // Human users
ServiceAccount, // Machine identities
Group, // Collections (future)
}
```
**Principal Reference**: `kind:id` format
- `user:alice`
- `service_account:compute-agent`
### 3.2 Roles
Named collections of permissions.
```rust
pub struct Role {
pub name: String, // e.g., "ProjectAdmin"
pub display_name: String,
pub description: String,
pub scope: Scope, // Where role can be assigned
pub permissions: Vec<Permission>,
pub builtin: bool, // System-defined, immutable
pub created_at: u64,
pub updated_at: u64,
}
```
**Builtin Roles**:
| Role | Scope | Description |
|------|-------|-------------|
| SystemAdmin | System | Full cluster access |
| OrgAdmin | Org | Full organization access |
| ProjectAdmin | Project | Full project access |
| ProjectMember | Project | Own resources + read all |
| ReadOnly | Project | Read-only project access |
| ServiceRole-ComputeAgent | Resource | Node-scoped compute |
| ServiceRole-StorageAgent | Resource | Node-scoped storage |
### 3.3 Permissions
Individual access rights within roles.
```rust
pub struct Permission {
pub action: String, // e.g., "compute:instances:create"
pub resource_pattern: String, // e.g., "org/*/project/${project}/instances/*"
pub condition: Option<Condition>,
}
```
**Action Format**: `service:resource:operation`
- Wildcards: `*`, `compute:*`, `compute:instances:*`
- Examples: `compute:instances:create`, `storage:volumes:delete`
**Resource Pattern Format**: `org/{org_id}/project/{project_id}/{kind}/{id}`
- Wildcards: `org/*/project/*/instances/*`
- Variables: `${principal.id}`, `${project}`
### 3.4 Policy Bindings
Assignments of roles to principals within a scope.
```rust
pub struct PolicyBinding {
pub id: String, // UUID
pub principal_ref: PrincipalRef,
pub role_ref: String, // "roles/ProjectAdmin"
pub scope: Scope,
pub condition: Option<Condition>,
pub created_at: u64,
pub updated_at: u64,
pub created_by: String,
pub expires_at: Option<u64>, // Time-limited access
pub enabled: bool,
}
```
## 4. Scope Hierarchy
Four-level hierarchical boundary for permissions.
```
System (level 0) ← Cluster-wide
└─ Organization (level 1) ← Tenant boundary
└─ Project (level 2) ← Workload isolation
└─ Resource (level 3) ← Individual resource
```
### 4.1 Scope Types
```rust
pub enum Scope {
System,
Org { id: String },
Project { id: String, org_id: String },
Resource { id: String, project_id: String, org_id: String },
}
```
### 4.2 Scope Containment
```rust
impl Scope {
// System contains everything
// Org contains its projects and resources
// Project contains its resources
fn contains(&self, other: &Scope) -> bool;
// Get parent scope
fn parent(&self) -> Option<Scope>;
// Get all ancestors up to System
fn ancestors(&self) -> Vec<Scope>;
}
```
### 4.3 Scope Storage Keys
```
system
org/{org_id}
org/{org_id}/project/{project_id}
org/{org_id}/project/{project_id}/resource/{resource_id}
```
## 5. API
### 5.1 Authorization Service (PDP)
```protobuf
service IamAuthz {
rpc Authorize(AuthorizeRequest) returns (AuthorizeResponse);
rpc BatchAuthorize(BatchAuthorizeRequest) returns (BatchAuthorizeResponse);
}
message AuthorizeRequest {
PrincipalRef principal = 1;
string action = 2; // "compute:instances:create"
ResourceRef resource = 3;
AuthzContext context = 4; // IP, timestamp, metadata
}
message AuthorizeResponse {
bool allowed = 1;
string reason = 2;
string matched_binding = 3;
string matched_role = 4;
}
message ResourceRef {
string kind = 1; // "instance"
string id = 2; // "vm-123"
string org_id = 3; // Required
string project_id = 4; // Required
optional string owner_id = 5;
optional string node_id = 6;
optional string region = 7;
map<string, string> tags = 8;
}
```
### 5.2 Admin Service (Management)
```protobuf
service IamAdmin {
// Principals
rpc CreatePrincipal(CreatePrincipalRequest) returns (Principal);
rpc GetPrincipal(GetPrincipalRequest) returns (Principal);
rpc UpdatePrincipal(UpdatePrincipalRequest) returns (Principal);
rpc DeletePrincipal(DeletePrincipalRequest) returns (Empty);
rpc ListPrincipals(ListPrincipalsRequest) returns (ListPrincipalsResponse);
// Roles
rpc CreateRole(CreateRoleRequest) returns (Role);
rpc GetRole(GetRoleRequest) returns (Role);
rpc UpdateRole(UpdateRoleRequest) returns (Role);
rpc DeleteRole(DeleteRoleRequest) returns (Empty);
rpc ListRoles(ListRolesRequest) returns (ListRolesResponse);
// Bindings
rpc CreateBinding(CreateBindingRequest) returns (PolicyBinding);
rpc GetBinding(GetBindingRequest) returns (PolicyBinding);
rpc UpdateBinding(UpdateBindingRequest) returns (PolicyBinding);
rpc DeleteBinding(DeleteBindingRequest) returns (Empty);
rpc ListBindings(ListBindingsRequest) returns (ListBindingsResponse);
}
```
### 5.3 Token Service
```protobuf
service IamToken {
rpc IssueToken(IssueTokenRequest) returns (IssueTokenResponse);
rpc ValidateToken(ValidateTokenRequest) returns (ValidateTokenResponse);
rpc RevokeToken(RevokeTokenRequest) returns (Empty);
rpc RefreshToken(RefreshTokenRequest) returns (RefreshTokenResponse);
}
message InternalTokenClaims {
string principal_id = 1;
PrincipalKind principal_kind = 2;
string principal_name = 3;
repeated string roles = 4; // Pre-loaded roles
Scope scope = 5;
optional string org_id = 6;
optional string project_id = 7;
optional string node_id = 8;
uint64 iat = 9; // Issued at (TSO)
uint64 exp = 10; // Expires at (TSO)
string session_id = 11;
AuthMethod auth_method = 12; // Jwt | Mtls | ApiKey
}
```
## 6. Authorization Logic
### 6.1 Evaluation Algorithm
```
evaluate(request):
1. Default DENY
2. resource_scope = Scope::from(request.resource)
3. bindings = get_effective_bindings(principal, resource_scope)
4. For each binding where binding.is_active(now):
a. role = get_role(binding.role_ref)
b. If binding.condition exists and !evaluate_condition(binding.condition):
continue
c. If evaluate_role(role, request):
return ALLOW
5. Return DENY
```
### 6.2 Role Permission Evaluation
```
evaluate_role(role, request):
For each permission in role.permissions:
1. If !matches_action(permission.action, request.action):
continue
2. resource_path = request.resource.to_path()
pattern = substitute_variables(permission.resource_pattern)
If !matches_resource(pattern, resource_path):
continue
3. If permission.condition exists and !evaluate_condition(permission.condition):
continue
4. return true // Permission matches
return false
```
### 6.3 Action Matching
```rust
matches_action("compute:*", "compute:instances:create") // true
matches_action("compute:instances:*", "compute:volumes:create") // false
matches_action("*", "anything:here:works") // true
```
### 6.4 Resource Matching
```rust
// Path format: org/{org}/project/{proj}/{kind}/{id}
matches_resource("org/*/project/*/instance/*",
"org/org-1/project/proj-1/instance/vm-1") // true
matches_resource("org/org-1/project/proj-1/*",
"org/org-1/project/proj-1/instance/vm-1") // true (trailing /*)
```
## 7. Conditions (ABAC)
### 7.1 Condition Types
```rust
pub enum Condition {
// String
StringEquals { key: String, value: String },
StringNotEquals { key: String, value: String },
StringLike { key: String, pattern: String }, // Glob pattern
StringEqualsAny { key: String, values: Vec<String> },
// Numeric
NumericEquals { key: String, value: i64 },
NumericLessThan { key: String, value: i64 },
NumericGreaterThan { key: String, value: i64 },
// Network
IpAddress { key: String, cidr: String }, // CIDR matching
NotIpAddress { key: String, cidr: String },
// Temporal
TimeBetween { start: String, end: String }, // HH:MM or Unix timestamp
// Existence
Exists { key: String },
// Boolean
Bool { key: String, value: bool },
// Logical
And { conditions: Vec<Condition> },
Or { conditions: Vec<Condition> },
Not { condition: Box<Condition> },
}
```
### 7.2 Variable Context
```rust
// Available variables for condition evaluation
principal.id, principal.kind, principal.name
principal.org_id, principal.project_id, principal.node_id
principal.email, principal.metadata.{key}
resource.kind, resource.id
resource.org_id, resource.project_id
resource.owner, resource.node, resource.region
resource.tags.{key}
request.source_ip, request.time
request.method, request.path
request.metadata.{key}
```
### 7.3 Variable Substitution
```rust
// In permission patterns
"org/${principal.org_id}/project/${project}/*"
// In conditions
Condition::string_equals("resource.owner", "${principal.id}")
```
### 7.4 Example: Owner-Only Access
```rust
Permission {
action: "compute:instances:*",
resource_pattern: "org/*/project/*/instance/*",
condition: Some(Condition::string_equals(
"resource.owner",
"${principal.id}"
)),
}
```
## 8. Storage
### 8.1 Backend Abstraction
```rust
pub trait StorageBackend: Send + Sync {
async fn get(&self, key: &str) -> Result<Option<(Vec<u8>, u64)>>;
async fn put(&self, key: &str, value: &[u8]) -> Result<u64>;
async fn cas(&self, key: &str, expected: u64, value: &[u8]) -> Result<CasResult>;
async fn delete(&self, key: &str) -> Result<bool>;
async fn scan_prefix(&self, prefix: &str, limit: usize) -> Result<Vec<KvPair>>;
}
```
**Supported Backends**:
- **Chainfire**: Production distributed KV
- **FlareDB**: Alternative distributed DB
- **Memory**: Testing
### 8.2 Key Schema
**Principals**:
```
iam/principals/{kind}/{id} # Primary
iam/principals/by-org/{org_id}/{kind}/{id} # Org index
iam/principals/by-project/{project_id}/{id} # Project index
iam/principals/by-email/{email} # Email lookup
iam/principals/by-oidc/{iss_hash}/{sub} # OIDC lookup
```
**Roles**:
```
iam/roles/{name} # Primary
iam/roles/by-scope/{scope}/{name} # Scope index
iam/roles/builtin/{name} # Builtin marker
```
**Bindings**:
```
iam/bindings/scope/{scope}/principal/{principal}/{id} # Primary
iam/bindings/by-principal/{principal}/{id} # Principal index
iam/bindings/by-role/{role}/{id} # Role index
```
### 8.3 Caching
```rust
pub struct PolicyCache {
bindings: DashMap<PrincipalRef, Vec<PolicyBinding>>,
roles: DashMap<String, Role>,
config: CacheConfig,
}
impl PolicyCache {
fn get_bindings(&self, principal: &PrincipalRef) -> Option<Vec<PolicyBinding>>;
fn put_bindings(&self, principal: &PrincipalRef, bindings: Vec<PolicyBinding>);
fn invalidate_principal(&self, principal: &PrincipalRef);
fn invalidate_role(&self, name: &str);
}
```
## 9. Configuration
### 9.1 Config File Format (TOML)
```toml
[server]
addr = "0.0.0.0:50051"
[server.tls]
cert_file = "/etc/aegis/tls/server.crt"
key_file = "/etc/aegis/tls/server.key"
ca_file = "/etc/aegis/tls/ca.crt" # For client cert verification
require_client_cert = true # Enable mTLS
[store]
backend = "chainfire" # "memory" | "chainfire" | "flaredb"
chainfire_endpoints = ["http://localhost:2379"]
# flaredb_endpoint = "http://localhost:5000"
# flaredb_namespace = "iam"
[authn]
[authn.jwt]
jwks_url = "https://auth.example.com/.well-known/jwks.json"
issuer = "https://auth.example.com"
audience = "aegis"
jwks_cache_ttl_seconds = 3600
[authn.internal_token]
signing_key = "base64-encoded-256-bit-key"
issuer = "aegis"
default_ttl_seconds = 3600 # 1 hour
max_ttl_seconds = 604800 # 7 days
[logging]
level = "info" # "debug" | "info" | "warn" | "error"
format = "json" # "json" | "text"
```
### 9.2 Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `IAM_CONFIG` | - | Path to config file |
| `IAM_ADDR` | `0.0.0.0:50051` | Server listen address |
| `IAM_LOG_LEVEL` | `info` | Log level |
| `IAM_SIGNING_KEY` | - | Token signing key (overrides config) |
| `IAM_STORE_BACKEND` | `memory` | Storage backend type |
### 9.3 CLI Arguments
```
aegis-server [OPTIONS]
Options:
-c, --config <PATH> Config file path
-a, --addr <ADDR> Listen address (overrides config)
-l, --log-level <LEVEL> Log level
-h, --help Print help
-V, --version Print version
```
## 10. Multi-Tenancy
### 10.1 Organization Isolation
- All principals have `org_id` (except System scope)
- All resources require `org_id` and `project_id`
- Scope containment enforces org boundaries
### 10.2 Project Isolation
- Service accounts bound to projects
- Resources belong to projects
- Permissions scoped to `project/${project}/*`
### 10.3 Cross-Tenant Access Patterns
| Pattern | Scope | Use Case |
|---------|-------|----------|
| System Admin | System | Platform operators |
| Org Admin | Org | Organization administrators |
| Project Admin | Project | Project owners |
| Node Agent | Resource | Node-bound service accounts |
### 10.4 Node-Bound Service Accounts
```rust
// Service account with node binding
Principal {
kind: ServiceAccount,
node_id: Some("node-001"),
...
}
// Permission with node condition
Permission {
action: "compute:*",
resource_pattern: "org/*/project/*/instance/*",
condition: Some(Condition::string_equals(
"resource.node",
"${principal.node_id}"
)),
}
```
## 11. Security
### 11.1 Authentication
**External Identity (OIDC/JWT)**:
- Validate JWT signature using JWKS from configured IdP
- Verify issuer, audience, and expiration claims
- Map OIDC `sub` claim to internal principal
- JWKS cached with configurable TTL
**Internal Tokens**:
- HMAC-SHA256 signed tokens for service-to-service auth
- Contains: principal_id, kind, roles, scope, org_id, project_id, exp, iat, session_id
- Short-lived (default 1 hour, max 7 days)
- Revocable via session_id
**mTLS**:
- Optional client certificate authentication
- Certificate CN mapped to service account ID
- Used for node-to-control-plane communication
### 11.2 Authorization Properties
- **Default Deny**: No binding = denied
- **Explicit Allow**: Must match binding + role + permission
- **Scope Enforcement**: Automatic via containment
- **Temporal Bounds**: `expires_at` for time-limited access
- **Soft Disable**: `enabled` flag for quick revocation
### 11.3 Immutable Builtins
- System roles cannot be modified/deleted
- Prevents privilege escalation via role modification
### 11.4 Audit Trail
- `created_by` on all entities
- Timestamps for creation/modification
- Audit event generation via iam-audit crate
## 12. Operations
### 12.1 Deployment
**Single Node**:
```bash
aegis-server --config /etc/aegis/aegis.toml
```
**Cluster Mode**:
- Multiple Aegis instances behind load balancer
- Shared storage backend (Chainfire or FlareDB)
- Stateless - any instance can handle any request
- Session affinity not required
**High Availability**:
- Deploy 3+ instances across availability zones
- Use Chainfire Raft cluster for storage
- Health checks on `/health` endpoint
### 12.2 Initialization
```rust
// Initialize builtin roles (idempotent)
role_store.init_builtin_roles().await?;
```
### 12.3 Client Library
```rust
use iam_client::IamClient;
let client = IamClient::connect("http://127.0.0.1:9090").await?;
// Check authorization
let allowed = client.authorize(
PrincipalRef::user("alice"),
"compute:instances:create",
ResourceRef::new("instance", "org-1", "proj-1", "vm-1"),
).await?;
// Create binding
client.create_binding(CreateBindingRequest {
principal: PrincipalRef::user("alice"),
role: "roles/ProjectAdmin".into(),
scope: Scope::project("proj-1", "org-1"),
..Default::default()
}).await?;
```
### 12.4 Monitoring
**Metrics (Prometheus format)**:
| Metric | Type | Description |
|--------|------|-------------|
| `aegis_authz_requests_total` | Counter | Total authorization requests |
| `aegis_authz_decisions{result}` | Counter | Decisions by allow/deny |
| `aegis_authz_latency_seconds` | Histogram | Authorization latency |
| `aegis_token_issued_total` | Counter | Tokens issued |
| `aegis_token_validated_total` | Counter | Token validations |
| `aegis_cache_hits_total` | Counter | Policy cache hits |
| `aegis_cache_misses_total` | Counter | Policy cache misses |
| `aegis_bindings_total` | Gauge | Total active bindings |
| `aegis_principals_total` | Gauge | Total principals |
**Health Endpoints**:
- `GET /health` - Liveness check
- `GET /ready` - Readiness check (storage connected)
### 12.5 Backup & Recovery
**Backup**:
- Export all principals, roles, and bindings via Admin API
- Or snapshot underlying storage (Chainfire/FlareDB)
- Recommended: Daily full backup + continuous WAL archiving
**Recovery**:
- Restore from storage snapshot
- Or reimport via Admin API
- Builtin roles auto-created on startup
## 13. Compatibility
### 13.1 API Versioning
- gRPC package: `iam.v1`
- Semantic versioning for breaking changes
- Backward compatible additions within major version
- Deprecation warnings before removal
### 13.2 Wire Protocol
- Protocol Buffers v3
- gRPC with HTTP/2 transport
- TLS 1.3 required in production
### 13.3 Storage Migration
- Schema version tracked in metadata key
- Automatic migration on startup
- Backward compatible within major version
## Appendix
### A. Error Codes
| Error | Meaning |
|-------|---------|
| PRINCIPAL_NOT_FOUND | Principal does not exist |
| ROLE_NOT_FOUND | Role does not exist |
| BINDING_NOT_FOUND | Binding does not exist |
| BUILTIN_IMMUTABLE | Cannot modify builtin role |
| SCOPE_VIOLATION | Operation violates scope boundary |
| CONDITION_FAILED | Condition evaluation failed |
### B. Proto Scope Messages
```protobuf
message Scope {
oneof scope {
bool system = 1;
OrgScope org = 2;
ProjectScope project = 3;
ResourceScope resource = 4;
}
}
message OrgScope { string id = 1; }
message ProjectScope { string id = 1; string org_id = 2; }
message ResourceScope { string id = 1; string project_id = 2; string org_id = 3; }
```
### C. Port Assignments
| Port | Protocol | Purpose |
|------|----------|---------|
| 9090 | gRPC | IAM API |
### D. Performance Considerations
- Cache bindings and roles for hot path
- Batch authorization for bulk checks
- Prefix scans for hierarchical queries
- CAS for conflict-free updates
### E. Glossary
- **Principal**: An identity that can be authenticated (user, service account, or group)
- **Role**: A named collection of permissions that can be assigned to principals
- **Permission**: A specific action allowed on a resource pattern with optional conditions
- **Binding**: Assignment of a role to a principal within a specific scope
- **Scope**: Hierarchical boundary for permission application (System > Org > Project > Resource)
- **Condition**: ABAC expression that must evaluate to true for access to be granted
- **PDP**: Policy Decision Point - the authorization evaluation engine
- **RBAC**: Role-Based Access Control - permissions assigned via roles
- **ABAC**: Attribute-Based Access Control - permissions based on attributes/conditions
### F. Example Policies
**Allow user to manage own instances**:
```json
{
"principal": "user:alice",
"role": "roles/ProjectMember",
"scope": { "type": "project", "id": "web-app", "org_id": "acme" }
}
```
**Time-limited admin access**:
```json
{
"principal": "user:bob",
"role": "roles/ProjectAdmin",
"scope": { "type": "project", "id": "staging", "org_id": "acme" },
"expires_at": 1735689600,
"condition": {
"expression": {
"type": "time_between",
"start": "09:00",
"end": "18:00"
}
}
}
```
**Node-bound service account**:
```json
{
"principal": "service_account:compute-agent-node-1",
"role": "roles/ServiceRole-ComputeAgent",
"scope": { "type": "system" },
"condition": {
"expression": {
"type": "string_equals",
"key": "resource.node",
"value": "${principal.node_id}"
}
}
}
```
**IP-restricted access**:
```json
{
"principal": "user:admin",
"role": "roles/SystemAdmin",
"scope": { "type": "system" },
"condition": {
"expression": {
"type": "ip_address",
"key": "request.source_ip",
"cidr": "10.0.0.0/8"
}
}
}
```