86 lines
3.4 KiB
Markdown
86 lines
3.4 KiB
Markdown
|
|
# Node Capability Model
|
||
|
|
|
||
|
|
This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
Capabilities are defined per host in `hosts/<hostname>/capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.
|
||
|
|
|
||
|
|
## Schema Definition
|
||
|
|
|
||
|
|
The `capabilities.yaml` file follows this structure:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
capabilities:
|
||
|
|
hardware:
|
||
|
|
cpu:
|
||
|
|
arch: <string> # e.g., x86_64, arm64
|
||
|
|
cores: <int>
|
||
|
|
threads: <int>
|
||
|
|
memory:
|
||
|
|
total_gb: <int>
|
||
|
|
acceleration:
|
||
|
|
type: <string> # e.g., none, cuda, tpu, vaapi
|
||
|
|
model: <string> # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
|
||
|
|
|
||
|
|
virtualization:
|
||
|
|
supported: <boolean>
|
||
|
|
type: <string> # e.g., kvm, docker-only
|
||
|
|
|
||
|
|
storage:
|
||
|
|
persistence: <string> # ephemeral, persistent, redundant
|
||
|
|
type: <string> # ssd, hdd, nvme, sd-card
|
||
|
|
capacity_gb: <int>
|
||
|
|
|
||
|
|
networking:
|
||
|
|
reachability: <string> # public, tailscale-only, lan-only
|
||
|
|
ingress_suitability: <boolean>
|
||
|
|
bandwidth: <string> # e.g., "1Gbps", "100Mbps", "LTE"
|
||
|
|
|
||
|
|
runtime:
|
||
|
|
container_engine: <string> # docker, podman, containerd
|
||
|
|
os: <string> # debian, ubuntu, alpine, nixos
|
||
|
|
|
||
|
|
operational:
|
||
|
|
power_constraint: <string> # low-power, mains, battery-backed
|
||
|
|
connectivity: <string> # stable, intermittent
|
||
|
|
availability_target: <string> # high, medium, best-effort
|
||
|
|
|
||
|
|
deployment:
|
||
|
|
suitability: [<string>] # list of workload types (e.g., ai, database, edge, web)
|
||
|
|
restricted: <boolean> # if true, only specific workloads are allowed
|
||
|
|
```
|
||
|
|
|
||
|
|
## Placement Reasoning Examples
|
||
|
|
|
||
|
|
### AI Workloads
|
||
|
|
A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`.
|
||
|
|
* **Target:** `solaria`
|
||
|
|
|
||
|
|
### Public Ingress
|
||
|
|
A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`.
|
||
|
|
* **Target:** `vps`
|
||
|
|
|
||
|
|
### Low-Power Staging
|
||
|
|
Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.
|
||
|
|
* **Target:** `chelsty`
|
||
|
|
|
||
|
|
## Recovery Reasoning Examples
|
||
|
|
|
||
|
|
### Failover Strategy
|
||
|
|
If `saturn` (the primary orchestrator) fails:
|
||
|
|
1. Identify nodes with `roles: [control]` or `roles: [infra]`.
|
||
|
|
2. Check `capabilities.operational.availability_target == "high"`.
|
||
|
|
3. Propose migration of critical infra services to `piha`.
|
||
|
|
|
||
|
|
### Storage-Bound Services
|
||
|
|
If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node.
|
||
|
|
|
||
|
|
## Future Usage by AI Agents
|
||
|
|
|
||
|
|
Future autonomous agents will use this metadata to:
|
||
|
|
1. **Evaluate Suitability:** Match service requirements (from `service.yaml`) against node capabilities.
|
||
|
|
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
||
|
|
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
||
|
|
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|