homelab-codex-ws/docs/capabilities.md

# Node Capability Model

This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.

## Overview

Capabilities are defined per host in `hosts/<hostname>/capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.

## Schema Definition

The `capabilities.yaml` file follows this structure:

```yaml
capabilities:
  hardware:
    cpu:
      arch: <string>          # e.g., x86_64, arm64
      cores: <int>
      threads: <int>
    memory:
      total_gb: <int>
    acceleration:
      type: <string>          # e.g., none, cuda, tpu, vaapi
      model: <string>         # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
  
  virtualization:
    supported: <boolean>
    type: <string>            # e.g., kvm, docker-only
  
  storage:
    persistence: <string>     # ephemeral, persistent, redundant
    type: <string>            # ssd, hdd, nvme, sd-card
    capacity_gb: <int>
  
  networking:
    reachability: <string>    # public, tailscale-only, lan-only
    ingress_suitability: <boolean>
    bandwidth: <string>       # e.g., "1Gbps", "100Mbps", "LTE"
  
  runtime:
    container_engine: <string> # docker, podman, containerd
    os: <string>              # debian, ubuntu, alpine, nixos
  
  operational:
    power_constraint: <string> # low-power, mains, battery-backed
    connectivity: <string>     # stable, intermittent
    availability_target: <string> # high, medium, best-effort
  
  deployment:
    suitability: [<string>]    # list of workload types (e.g., ai, database, edge, web)
    restricted: <boolean>      # if true, only specific workloads are allowed
```

## Placement Reasoning Examples

### AI Workloads
A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`.
*   **Target:** `solaria`

### Public Ingress
A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`.
*   **Target:** `vps`

### Low-Power Staging
Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.
*   **Target:** `chelsty`

## Recovery Reasoning Examples

### Failover Strategy
If `saturn` (the primary orchestrator) fails:
1.  Identify nodes with `roles: [control]` or `roles: [infra]`.
2.  Check `capabilities.operational.availability_target == "high"`.
3.  Propose migration of critical infra services to `piha`.

### Storage-Bound Services
If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node.

## Future Usage by AI Agents

Future autonomous agents will use this metadata to:
1.  **Evaluate Suitability:** Match service requirements (from `service.yaml`) against node capabilities.
2.  **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
3.  **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
4.  **Propose Failover:** Automatically suggest the best alternative node during an outage.
Add node capability model 2026-05-11 20:46:50 +02:00			`# Node Capability Model`

			`This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.`

			`## Overview`

			Capabilities are defined per host in `hosts/<hostname>/capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.

			`## Schema Definition`

			The `capabilities.yaml` file follows this structure:

			```yaml
			`capabilities:`
			`hardware:`
			`cpu:`
			`arch: <string> # e.g., x86_64, arm64`
			`cores: <int>`
			`threads: <int>`
			`memory:`
			`total_gb: <int>`
			`acceleration:`
			`type: <string> # e.g., none, cuda, tpu, vaapi`
			`model: <string> # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"`

			`virtualization:`
			`supported: <boolean>`
			`type: <string> # e.g., kvm, docker-only`

			`storage:`
			`persistence: <string> # ephemeral, persistent, redundant`
			`type: <string> # ssd, hdd, nvme, sd-card`
			`capacity_gb: <int>`

			`networking:`
			`reachability: <string> # public, tailscale-only, lan-only`
			`ingress_suitability: <boolean>`
			`bandwidth: <string> # e.g., "1Gbps", "100Mbps", "LTE"`

			`runtime:`
			`container_engine: <string> # docker, podman, containerd`
			`os: <string> # debian, ubuntu, alpine, nixos`

			`operational:`
			`power_constraint: <string> # low-power, mains, battery-backed`
			`connectivity: <string> # stable, intermittent`
			`availability_target: <string> # high, medium, best-effort`

			`deployment:`
			`suitability: [<string>] # list of workload types (e.g., ai, database, edge, web)`
			`restricted: <boolean> # if true, only specific workloads are allowed`
			```

			`## Placement Reasoning Examples`

			`### AI Workloads`
			A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`.
			* Target: `solaria`

			`### Public Ingress`
			A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`.
			* Target: `vps`

			`### Low-Power Staging`
			`Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.`
			* Target: `chelsty`

			`## Recovery Reasoning Examples`

			`### Failover Strategy`
			If `saturn` (the primary orchestrator) fails:
			1. Identify nodes with `roles: [control]` or `roles: [infra]`.
			2. Check `capabilities.operational.availability_target == "high"`.
			3. Propose migration of critical infra services to `piha`.

			`### Storage-Bound Services`
			If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node.

			`## Future Usage by AI Agents`

			`Future autonomous agents will use this metadata to:`
			1. Evaluate Suitability: Match service requirements (from `service.yaml`) against node capabilities.
			`2. Generate Plans: Create step-by-step deployment or migration plans based on hardware compatibility.`
			`3. Validate Topology: Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).`
			`4. Propose Failover: Automatically suggest the best alternative node during an outage.`