# Node Capability Model

This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.

## Overview

Capabilities are defined per host in `hosts/<hostname>/capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.

## Schema Definition

The `capabilities.yaml` file follows this structure:

```yaml
capabilities:
  hardware:
    cpu:
      arch: <string>          # e.g., x86_64, arm64
      cores: <int>
      threads: <int>
    memory:
      total_gb: <int>
    acceleration:
      type: <string>          # e.g., none, cuda, tpu, vaapi
      model: <string>         # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
  
  virtualization:
    supported: <boolean>
    type: <string>            # e.g., kvm, docker-only
  
  storage:
    persistence: <string>     # ephemeral, persistent, redundant
    type: <string>            # ssd, hdd, nvme, sd-card
    capacity_gb: <int>
  
  networking:
    reachability: <string>    # public, tailscale-only, lan-only
    ingress_suitability: <boolean>
    bandwidth: <string>       # e.g., "1Gbps", "100Mbps", "LTE"
  
  runtime:
    container_engine: <string> # docker, podman, containerd
    os: <string>              # debian, ubuntu, alpine, nixos
  
  operational:
    power_constraint: <string> # low-power, mains, battery-backed
    connectivity: <string>     # stable, intermittent
    availability_target: <string> # high, medium, best-effort
  
  deployment:
    suitability: [<string>]    # list of workload types (e.g., ai, database, edge, web)
    restricted: <boolean>      # if true, only specific workloads are allowed
```

## Placement Reasoning Examples

### AI Workloads
A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`.
*   **Target:** `solaria`

### Public Ingress
A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`.
*   **Target:** `vps`

### Low-Power Staging
Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.
*   **Target:** `chelsty`

## Recovery Reasoning Examples

### Failover Strategy
If `saturn` (the primary orchestrator) fails:
1.  Identify nodes with `roles: [control]` or `roles: [infra]`.
2.  Check `capabilities.operational.availability_target == "high"`.
3.  Propose migration of critical infra services to `piha`.

### Storage-Bound Services
If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node.

## Future Usage by AI Agents

Future autonomous agents will use this metadata to:
1.  **Evaluate Suitability:** Match service requirements (from `service.yaml`) against node capabilities.
2.  **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
3.  **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
4.  **Propose Failover:** Automatically suggest the best alternative node during an outage.

## Agent Reasoning Logic

When an agent parses `capabilities.yaml`, it should apply these heuristics:
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.