homelab-codex-ws/docs/capabilities.md

93 lines
3.9 KiB
Markdown
Raw Normal View History

2026-05-11 20:46:50 +02:00
# Node Capability Model
This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.
## Overview
Capabilities are defined per host in `hosts/<hostname>/capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.
## Schema Definition
The `capabilities.yaml` file follows this structure:
```yaml
capabilities:
hardware:
cpu:
arch: <string> # e.g., x86_64, arm64
cores: <int>
threads: <int>
memory:
total_gb: <int>
acceleration:
type: <string> # e.g., none, cuda, tpu, vaapi
model: <string> # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
virtualization:
supported: <boolean>
type: <string> # e.g., kvm, docker-only
storage:
persistence: <string> # ephemeral, persistent, redundant
type: <string> # ssd, hdd, nvme, sd-card
capacity_gb: <int>
networking:
reachability: <string> # public, tailscale-only, lan-only
ingress_suitability: <boolean>
bandwidth: <string> # e.g., "1Gbps", "100Mbps", "LTE"
runtime:
container_engine: <string> # docker, podman, containerd
os: <string> # debian, ubuntu, alpine, nixos
operational:
power_constraint: <string> # low-power, mains, battery-backed
connectivity: <string> # stable, intermittent
availability_target: <string> # high, medium, best-effort
deployment:
suitability: [<string>] # list of workload types (e.g., ai, database, edge, web)
restricted: <boolean> # if true, only specific workloads are allowed
```
## Placement Reasoning Examples
### AI Workloads
A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`.
* **Target:** `solaria`
### Public Ingress
A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`.
* **Target:** `vps`
### Low-Power Staging
Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.
* **Target:** `chelsty`
## Recovery Reasoning Examples
### Failover Strategy
If `saturn` (the primary orchestrator) fails:
1. Identify nodes with `roles: [control]` or `roles: [infra]`.
2. Check `capabilities.operational.availability_target == "high"`.
3. Propose migration of critical infra services to `piha`.
### Storage-Bound Services
If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node.
## Future Usage by AI Agents
Future autonomous agents will use this metadata to:
1. **Evaluate Suitability:** Match service requirements (from `service.yaml`) against node capabilities.
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
## Agent Reasoning Logic
When an agent parses `capabilities.yaml`, it should apply these heuristics:
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.