oskar 35e57cc789 docs(CLAUDE.md): update node model and override path convention

- split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table
- correct docker-compose override path to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-20 15:27:46 +02:00

6.1 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

What This Repo Is

GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at /opt/homelab/ on each execution node and is never committed.

Node Roles

Host	Role
SATURN	Primary control node — only node where commits are made
SOLARIA	GPU/compute/AI workloads
PIHA	Infra, monitoring
VPS	Public ingress, reverse proxy, control plane host
CHELSTY-INFRA	LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first
CHELSTY-HA	LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first

All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services (zigbee2mqtt, mosquitto, homeassistant, stability-agent) must never depend on SATURN, VPS, or Forgejo at runtime.

Deployment

Run a fresh deployment on the current node

scripts/deploy/deploy.sh

Resume after interruption

scripts/deploy/deploy.sh --resume

Run a specific stage only

scripts/deploy/deploy.sh --stage verify
scripts/deploy/deploy.sh --stage diagnose

Deploy a specific service

scripts/deploy/deploy.sh --service mosquitto

Deploy from SATURN/SOLARIA to VPS (control plane)

./scripts/deploy/deploy-control-plane.sh --ssh

Bootstrap a new node

./scripts/bootstrap/chelsty-runtime.sh  # CHELSTY-specific
./scripts/bootstrap/prepare-node.sh     # General node prep

The staged deploy pipeline runs: prepare → validate → deploy → verify → diagnose (on failure) → complete. Stage state is persisted in /opt/homelab/state/deploy/ allowing safe resumption.

Service Structure

Every service must follow this layout:

services/<service>/
├── docker-compose.yml
├── service.yaml       # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example        # Template — never commit actual secrets
└── healthcheck.sh     # Returns 0 (healthy) or 1 (unhealthy)

service.yaml defines owner_node, exposure, dependencies, healthcheck, restart_policy, persistence.paths, and runtime.env_vars. This is what AI agents read to understand how to manage a service.

Host-specific runtime config and secrets live at /opt/homelab/config/<service>/ on the target node (not in Git). Docker Compose overrides are version-controlled at hosts/<node>/runtime/<service>/docker-compose.override.yml in this repo and applied during deployment.

Agent System Architecture

The platform uses a multi-agent model with human-in-the-loop for destructive actions:

Stability Agent (services/stability-agent/) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
Observer (services/control-plane/src/) — Synthesizes world state from events into /opt/homelab/world/{nodes,services,deployments,incidents}.json.
Supervisor — Detects drift between desired state (from hosts/*/services.yaml) and actual state (from Observer output). Writes pending action JSON files.
Executor — Executes actions only after they transition to approved.
Operator UI + Telegram Bot — Operators review and approve/reject pending actions.

Action approval flow

Agent → /opt/homelab/actions/pending/<id>.json
      → Telegram notification → Operator approves
      → /opt/homelab/actions/approved/<id>.json
      → Executor runs → completed / failed

Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.

Event System

Events are append-only JSON lines at:

/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl

Emit from shell:

source scripts/lib/events.sh
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "cid-123" '{}'

Emit from Python:

from scripts.lib.events import emit_event
emit_event("service_unhealthy", "error", "monitor.py", "ollama", "cid-123", {"error": "OOM"})

Normalized event types: deployment_started/completed/failed, service_unhealthy/recovered, node_offline/online, healthcheck_failed, remediation_started/completed.

Discovery Entry Points for Agents

When exploring the system, use these files in order:

inventory/topology.yaml — node list, roles, mesh type
hosts/<node>/capabilities.yaml — hardware and software constraints
hosts/<node>/services.yaml — desired services and exposure classes for that host
services/<service>/service.yaml — operational contract for a service

CHELSTY-Specific Rules

Zigbee coordinator is SLZB-06U over TCP (192.168.1.105:6638, ezsp adapter). Never use /dev/ttyUSB0.

Deploy CHELSTY nodes individually:

./scripts/deploy/deploy-node.sh chelsty-infra
./scripts/deploy/deploy-node.sh chelsty-ha

Bootstrap CHELSTY runtime:
```
./scripts/bootstrap/chelsty-runtime.sh
```
Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.

Runtime Path Conventions

/opt/homelab/
├── data/<service>/     # Persistent volumes
├── config/<service>/   # Secrets and host-local overrides (not in Git)
├── logs/<service>/     # Service logs
├── state/              # Deployment stage markers, agent heartbeats
├── events/             # Append-only event store
├── world/              # Observer output (synthesized state)
└── actions/            # pending / approved / running / completed / failed

Naming Conventions

Hosts: ALL CAPS (SATURN, PIHA)
Services: kebab-case (stability-agent, zigbee2mqtt)
Container names must match service names
Always restart: unless-stopped unless service.yaml says otherwise

6.1 KiB Raw Blame History