homelab-codex-ws/docs/capabilities.md

3.4 KiB

Node Capability Model

This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.

Overview

Capabilities are defined per host in hosts/<hostname>/capabilities.yaml. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.

Schema Definition

The capabilities.yaml file follows this structure:

capabilities:
  hardware:
    cpu:
      arch: <string>          # e.g., x86_64, arm64
      cores: <int>
      threads: <int>
    memory:
      total_gb: <int>
    acceleration:
      type: <string>          # e.g., none, cuda, tpu, vaapi
      model: <string>         # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
  
  virtualization:
    supported: <boolean>
    type: <string>            # e.g., kvm, docker-only
  
  storage:
    persistence: <string>     # ephemeral, persistent, redundant
    type: <string>            # ssd, hdd, nvme, sd-card
    capacity_gb: <int>
  
  networking:
    reachability: <string>    # public, tailscale-only, lan-only
    ingress_suitability: <boolean>
    bandwidth: <string>       # e.g., "1Gbps", "100Mbps", "LTE"
  
  runtime:
    container_engine: <string> # docker, podman, containerd
    os: <string>              # debian, ubuntu, alpine, nixos
  
  operational:
    power_constraint: <string> # low-power, mains, battery-backed
    connectivity: <string>     # stable, intermittent
    availability_target: <string> # high, medium, best-effort
  
  deployment:
    suitability: [<string>]    # list of workload types (e.g., ai, database, edge, web)
    restricted: <boolean>      # if true, only specific workloads are allowed

Placement Reasoning Examples

AI Workloads

A service requiring cuda acceleration will be matched against nodes where capabilities.hardware.acceleration.type == "cuda".

  • Target: solaria

Public Ingress

A service requiring public exposure will look for capabilities.networking.ingress_suitability == true.

  • Target: vps

Low-Power Staging

Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.

  • Target: chelsty

Recovery Reasoning Examples

Failover Strategy

If saturn (the primary orchestrator) fails:

  1. Identify nodes with roles: [control] or roles: [infra].
  2. Check capabilities.operational.availability_target == "high".
  3. Propose migration of critical infra services to piha.

Storage-Bound Services

If a node with persistence: persistent fails, the agent must check if there are other nodes with persistence: persistent and compatible storage.type before attempting recovery, or warn about potential data loss if moved to an ephemeral node.

Future Usage by AI Agents

Future autonomous agents will use this metadata to:

  1. Evaluate Suitability: Match service requirements (from service.yaml) against node capabilities.
  2. Generate Plans: Create step-by-step deployment or migration plans based on hardware compatibility.
  3. Validate Topology: Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
  4. Propose Failover: Automatically suggest the best alternative node during an outage.