# Node Capability Model This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads. ## Overview Capabilities are defined per host in `hosts//capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system. ## Schema Definition The `capabilities.yaml` file follows this structure: ```yaml capabilities: hardware: cpu: arch: # e.g., x86_64, arm64 cores: threads: memory: total_gb: acceleration: type: # e.g., none, cuda, tpu, vaapi model: # e.g., "NVIDIA RTX 3060", "Coral Edge TPU" virtualization: supported: type: # e.g., kvm, docker-only storage: persistence: # ephemeral, persistent, redundant type: # ssd, hdd, nvme, sd-card capacity_gb: networking: reachability: # public, tailscale-only, lan-only ingress_suitability: bandwidth: # e.g., "1Gbps", "100Mbps", "LTE" runtime: container_engine: # docker, podman, containerd os: # debian, ubuntu, alpine, nixos operational: power_constraint: # low-power, mains, battery-backed connectivity: # stable, intermittent availability_target: # high, medium, best-effort deployment: suitability: [] # list of workload types (e.g., ai, database, edge, web) restricted: # if true, only specific workloads are allowed ``` ## Placement Reasoning Examples ### AI Workloads A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`. * **Target:** `solaria` ### Public Ingress A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`. * **Target:** `vps` ### Low-Power Staging Staging workloads that should not consume significant power or are tolerant of intermittent connectivity. * **Target:** `chelsty` ## Recovery Reasoning Examples ### Failover Strategy If `saturn` (the primary orchestrator) fails: 1. Identify nodes with `roles: [control]` or `roles: [infra]`. 2. Check `capabilities.operational.availability_target == "high"`. 3. Propose migration of critical infra services to `piha`. ### Storage-Bound Services If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node. ## Future Usage by AI Agents Future autonomous agents will use this metadata to: 1. **Evaluate Suitability:** Match service requirements (from `service.yaml`) against node capabilities. 2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility. 3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node). 4. **Propose Failover:** Automatically suggest the best alternative node during an outage. ## Agent Reasoning Logic When an agent parses `capabilities.yaml`, it should apply these heuristics: - **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services. - **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks. - **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.