diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..a459b5c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,156 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What This Repo Is + +GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed. + +## Node Roles + +| Host | Role | +|------|------| +| **SATURN** | Primary control node — only node where commits are made | +| **SOLARIA** | GPU/compute/AI workloads | +| **PIHA** | Infra, monitoring | +| **VPS** | Public ingress, reverse proxy, control plane host | +| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first | +| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first | + +All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services (`zigbee2mqtt`, `mosquitto`, `homeassistant`, `stability-agent`) must never depend on SATURN, VPS, or Forgejo at runtime. + +## Deployment + +### Run a fresh deployment on the current node +```bash +scripts/deploy/deploy.sh +``` + +### Resume after interruption +```bash +scripts/deploy/deploy.sh --resume +``` + +### Run a specific stage only +```bash +scripts/deploy/deploy.sh --stage verify +scripts/deploy/deploy.sh --stage diagnose +``` + +### Deploy a specific service +```bash +scripts/deploy/deploy.sh --service mosquitto +``` + +### Deploy from SATURN/SOLARIA to VPS (control plane) +```bash +./scripts/deploy/deploy-control-plane.sh --ssh +``` + +### Bootstrap a new node +```bash +./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific +./scripts/bootstrap/prepare-node.sh # General node prep +``` + +The staged deploy pipeline runs: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state is persisted in `/opt/homelab/state/deploy/` allowing safe resumption. + +## Service Structure + +Every service must follow this layout: + +``` +services// +├── docker-compose.yml +├── service.yaml # Machine-readable contract (primary source of truth for agents) +├── README.md +├── env.example # Template — never commit actual secrets +└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy) +``` + +`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service. + +Host-specific runtime config and secrets live at `/opt/homelab/config//` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts//runtime//docker-compose.override.yml` in this repo and applied during deployment. + +## Agent System Architecture + +The platform uses a multi-agent model with **human-in-the-loop** for destructive actions: + +1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously. +2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`. +3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files. +4. **Executor** — Executes actions only after they transition to `approved`. +5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions. + +### Action approval flow +``` +Agent → /opt/homelab/actions/pending/.json + → Telegram notification → Operator approves + → /opt/homelab/actions/approved/.json + → Executor runs → completed / failed +``` + +Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file. + +## Event System + +Events are append-only JSON lines at: +``` +/opt/homelab/events/YYYY-MM-DD//events.jsonl +``` + +Emit from shell: +```bash +source scripts/lib/events.sh +emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "cid-123" '{}' +``` + +Emit from Python: +```python +from scripts.lib.events import emit_event +emit_event("service_unhealthy", "error", "monitor.py", "ollama", "cid-123", {"error": "OOM"}) +``` + +Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`. + +## Discovery Entry Points for Agents + +When exploring the system, use these files in order: +1. `inventory/topology.yaml` — node list, roles, mesh type +2. `hosts//capabilities.yaml` — hardware and software constraints +3. `hosts//services.yaml` — desired services and exposure classes for that host +4. `services//service.yaml` — operational contract for a service + +## CHELSTY-Specific Rules + +- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`. +- Deploy CHELSTY nodes individually: + ```bash + ./scripts/deploy/deploy-node.sh chelsty-infra + ./scripts/deploy/deploy-node.sh chelsty-ha + ``` +- Bootstrap CHELSTY runtime: + ```bash + ./scripts/bootstrap/chelsty-runtime.sh + ``` +- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state. + +## Runtime Path Conventions + +``` +/opt/homelab/ +├── data// # Persistent volumes +├── config// # Secrets and host-local overrides (not in Git) +├── logs// # Service logs +├── state/ # Deployment stage markers, agent heartbeats +├── events/ # Append-only event store +├── world/ # Observer output (synthesized state) +└── actions/ # pending / approved / running / completed / failed +``` + +## Naming Conventions + +- Hosts: ALL CAPS (`SATURN`, `PIHA`) +- Services: kebab-case (`stability-agent`, `zigbee2mqtt`) +- Container names must match service names +- Always `restart: unless-stopped` unless `service.yaml` says otherwise