3.4 KiB
Node Capability Model
This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.
Overview
Capabilities are defined per host in hosts/<hostname>/capabilities.yaml. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.
Schema Definition
The capabilities.yaml file follows this structure:
capabilities:
hardware:
cpu:
arch: <string> # e.g., x86_64, arm64
cores: <int>
threads: <int>
memory:
total_gb: <int>
acceleration:
type: <string> # e.g., none, cuda, tpu, vaapi
model: <string> # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
virtualization:
supported: <boolean>
type: <string> # e.g., kvm, docker-only
storage:
persistence: <string> # ephemeral, persistent, redundant
type: <string> # ssd, hdd, nvme, sd-card
capacity_gb: <int>
networking:
reachability: <string> # public, tailscale-only, lan-only
ingress_suitability: <boolean>
bandwidth: <string> # e.g., "1Gbps", "100Mbps", "LTE"
runtime:
container_engine: <string> # docker, podman, containerd
os: <string> # debian, ubuntu, alpine, nixos
operational:
power_constraint: <string> # low-power, mains, battery-backed
connectivity: <string> # stable, intermittent
availability_target: <string> # high, medium, best-effort
deployment:
suitability: [<string>] # list of workload types (e.g., ai, database, edge, web)
restricted: <boolean> # if true, only specific workloads are allowed
Placement Reasoning Examples
AI Workloads
A service requiring cuda acceleration will be matched against nodes where capabilities.hardware.acceleration.type == "cuda".
- Target:
solaria
Public Ingress
A service requiring public exposure will look for capabilities.networking.ingress_suitability == true.
- Target:
vps
Low-Power Staging
Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.
- Target:
chelsty
Recovery Reasoning Examples
Failover Strategy
If saturn (the primary orchestrator) fails:
- Identify nodes with
roles: [control]orroles: [infra]. - Check
capabilities.operational.availability_target == "high". - Propose migration of critical infra services to
piha.
Storage-Bound Services
If a node with persistence: persistent fails, the agent must check if there are other nodes with persistence: persistent and compatible storage.type before attempting recovery, or warn about potential data loss if moved to an ephemeral node.
Future Usage by AI Agents
Future autonomous agents will use this metadata to:
- Evaluate Suitability: Match service requirements (from
service.yaml) against node capabilities. - Generate Plans: Create step-by-step deployment or migration plans based on hardware compatibility.
- Validate Topology: Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
- Propose Failover: Automatically suggest the best alternative node during an outage.