# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## What This Repo Is GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed. ## Node Roles | Host | Role | |------|------| | **SATURN** | Primary control node — only node where commits are made | | **SOLARIA** | GPU/compute/AI workloads | | **PIHA** | Infra, monitoring | | **VPS** | Public ingress, reverse proxy, control plane host | | **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first | | **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first | All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts//capabilities.yaml`. ## Deployment ```bash scripts/deploy/deploy.sh # fresh deploy on current node scripts/deploy/deploy.sh --resume # resume after interruption scripts/deploy/deploy.sh --stage verify # specific stage only scripts/deploy/deploy.sh --service mosquitto # specific service only ./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS ./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually) ./scripts/bootstrap/prepare-node.sh # general node bootstrap ./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap ``` Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`. ## Service Structure Every service must follow this layout: ``` services// ├── docker-compose.yml ├── service.yaml # Machine-readable contract (primary source of truth for agents) ├── README.md ├── env.example # Template — never commit actual secrets └── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy) ``` `service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service. Host-specific runtime config and secrets live at `/opt/homelab/config//` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts//runtime//docker-compose.override.yml` in this repo and applied during deployment. ## Agent System Architecture The platform uses a multi-agent model with **human-in-the-loop** for destructive actions: 1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously. 2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`. 3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files. 4. **Executor** — Executes actions only after they transition to `approved`. 5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions. ### Action approval flow ``` Agent → /opt/homelab/actions/pending/.json → Telegram notification → Operator approves → /opt/homelab/actions/approved/.json → Executor runs → completed / failed ``` Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file. ## Event System Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD//events.jsonl`. Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python). Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`. ### Supervisor event routing table | Event type | Source | Action generated | Cooldown | |---|---|---|---| | `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID | | `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID | | `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID | | `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID | | `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion | | `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — | | `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour | | `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour | | `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour | | `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour | | `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour | | `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour | HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress). ## Discovery Entry Points for Agents When exploring the system, use these files in order: 1. `inventory/topology.yaml` — node list, roles, mesh type 2. `hosts//capabilities.yaml` — hardware and software constraints 3. `hosts//services.yaml` — desired services and exposure classes for that host 4. `services//service.yaml` — operational contract for a service ## CHELSTY-Specific Rules - Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`. - CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`. - Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state. ## Runtime Path Conventions `/opt/homelab/` layout on each node: - `data//` — persistent volumes - `config//` — secrets and host-local overrides (not in Git) - `logs//` — service logs - `state/` — deployment stage markers, agent heartbeats - `events/` — append-only event store - `world/` — Observer output (synthesized state) - `actions/` — pending / approved / running / completed / failed ## Naming Conventions - Hosts: ALL CAPS (`SATURN`, `PIHA`) - Services: kebab-case (`stability-agent`, `zigbee2mqtt`) - Container names must match service names - Always `restart: unless-stopped` unless `service.yaml` says otherwise