Lesson from brain-watchdog: code that was never run had a packaging bug that caused a crash loop in production. New rule: docker build + short smoke-run + pytest before any commit or deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
What This Repo Is
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at /opt/homelab/ on each execution node and is never committed.
Node Roles
| Host | Role |
|---|---|
| SATURN | Primary control node — only node where commits are made |
| SOLARIA | GPU/compute/AI workloads |
| PIHA | Infra, monitoring |
| VPS | Public ingress, reverse proxy, control plane host |
| CHELSTY-INFRA | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
| CHELSTY-HA | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: hosts/<node>/capabilities.yaml.
Deployment
scripts/deploy/deploy.sh # fresh deploy on current node
scripts/deploy/deploy.sh --resume # resume after interruption
scripts/deploy/deploy.sh --stage verify # specific stage only
scripts/deploy/deploy.sh --service mosquitto # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
Pipeline stages: prepare → validate → deploy → verify → diagnose (on failure) → complete. Stage state persisted in /opt/homelab/state/deploy/.
Service Structure
Every service must follow this layout:
services/<service>/
├── docker-compose.yml
├── service.yaml # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example # Template — never commit actual secrets
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
service.yaml defines owner_node, exposure, dependencies, healthcheck, restart_policy, persistence.paths, and runtime.env_vars. This is what AI agents read to understand how to manage a service.
Host-specific runtime config and secrets live at /opt/homelab/config/<service>/ on the target node (not in Git). Docker Compose overrides are version-controlled at hosts/<node>/runtime/<service>/docker-compose.override.yml in this repo and applied during deployment.
Agent System Architecture
The platform uses a multi-agent model with human-in-the-loop for destructive actions:
- Stability Agent (
services/stability-agent/) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously. - Observer (
services/control-plane/src/) — Synthesizes world state from events into/opt/homelab/world/{nodes,services,deployments,incidents}.json. - Supervisor — Detects drift between desired state (from
hosts/*/services.yaml) and actual state (from Observer output). Writespendingaction JSON files. - Executor — Executes actions only after they transition to
approved. - Operator UI + Telegram Bot — Operators review and approve/reject pending actions.
Action approval flow
Agent → /opt/homelab/actions/pending/<id>.json
→ Telegram notification → Operator approves
→ /opt/homelab/actions/approved/<id>.json
→ Executor runs → completed / failed
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
Event System
Events are append-only JSON lines at /opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl.
Emit via scripts/lib/events.sh (shell) or scripts/lib/events.py (Python).
Normalized event types: deployment_started/completed/failed, service_unhealthy/recovered, node_offline/online, healthcheck_failed, remediation_started/completed.
Supervisor event routing table
| Event type | Source | Action generated | Cooldown |
|---|---|---|---|
containers_not_running |
stability-agent | container_restart |
dedup via stable ID |
mqtt_unreachable |
stability-agent | container_restart |
dedup via stable ID |
service_unhealthy / other |
stability-agent | redeploy |
dedup via stable ID |
disk_pressure (high) |
stability-agent | disk_cleanup |
dedup via stable ID |
ha_websocket_dead |
ha-diag-agent | container_restart (homeassistant) |
30 min after completion |
ha_websocket_recovered |
ha-diag-agent | cancels matching restart | — |
ha_integration_failed |
ha-diag-agent | alert_only |
1 hour |
ha_entity_unavailable_long |
ha-diag-agent | alert_only |
1 hour |
ha_automation_failing |
ha-diag-agent | alert_only |
1 hour |
ha_update_available |
ha-diag-agent | alert_only |
1 hour |
ha_recorder_lag |
ha-diag-agent | alert_only |
1 hour |
ha_system_health_degraded |
ha-diag-agent | alert_only |
1 hour |
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if homeassistant had a containers_not_running incident within the last 5 minutes (planned restart/update in progress).
Discovery Entry Points for Agents
When exploring the system, use these files in order:
inventory/topology.yaml— node list, roles, mesh typehosts/<node>/capabilities.yaml— hardware and software constraintshosts/<node>/services.yaml— desired services and exposure classes for that hostservices/<service>/service.yaml— operational contract for a service
VPS-Specific Rules
VPS has 4 GiB RAM, no swap. Every repo-managed service must declare memory limits in its hosts/vps/runtime/<service>/docker-compose.override.yml.
Memory limit convention
Use top-level Compose properties (not deploy.resources.limits, which requires Swarm mode):
services:
myservice:
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
Rules:
- Control-plane containers (executor, observer, supervisor, operator-ui), node-agent, stability-agent: always set
oom_score_adj: -900— these must never be a system-level OOM victim. mem_limitstill applies even withoom_score_adj: -900; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.- Budget: OS+Docker reserves ~800 MiB; sum of all
mem_limitvalues must stay ≤ 3200 MiB (3.1 GiB).
Repo-managed services on VPS
All VPS services are now GitOps-managed. Service definitions live in services/<name>/docker-compose.yml; host-specific overrides (mem_limit, env) live in hosts/vps/runtime/<name>/docker-compose.override.yml.
| Service | Compose stack | Data path |
|---|---|---|
| npm | services/npm/ |
/home/dockeruser/docker/npm/{data,letsencrypt} (bind mount) |
| outline | services/outline/ |
Docker named volumes: outline_outline_storage, outline_postgres_data, outline_redis_data |
| joplin | services/joplin/ |
Docker named volume: joplin_postgres_data |
| ai-cluster | services/ai-cluster/ |
Mosquitto config bind: /home/dockeruser/docker/ai-cluster/mosquitto/ |
Data migration rule: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
Cutover checklist (before running docker compose up for any migrated service):
git pullon VPS- Populate
/opt/homelab/config/<service>/.envfrom theenv.exampletemplate - For ai-cluster: copy
/home/dockeruser/docker/ai-cluster/.envto/opt/homelab/config/ai-cluster/.env - For mosquitto: config stays at old bind path until explicitly migrated
- Verify named volumes exist:
docker volume ls | grep <project>
ai-cluster architectural note: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
CHELSTY-Specific Rules
- Zigbee coordinator is SLZB-06U over TCP (
192.168.1.105:6638,ezspadapter). Never use/dev/ttyUSB0. - CHELSTY nodes run docker-compose v1 (1.29.2) — use
docker-compose(hyphenated), notdocker compose. - Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
Runtime Path Conventions
/opt/homelab/ layout on each node:
data/<service>/— persistent volumesconfig/<service>/— secrets and host-local overrides (not in Git)logs/<service>/— service logsstate/— deployment stage markers, agent heartbeatsevents/— append-only event storeworld/— Observer output (synthesized state)actions/— pending / approved / running / completed / failed
Definition of Done (serwisy)
Before any new or changed service is considered ready:
- docker build + smoke run — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g.
ModuleNotFoundError) before they reach a node. - pytest — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in
services/<service>/tests/. - Never commit or deploy code that has never been run. If a smoke run or test fails, fix it first.
Naming Conventions
- Hosts: ALL CAPS (
SATURN,PIHA) - Services: kebab-case (
stability-agent,zigbee2mqtt) - Container names must match service names
- Always
restart: unless-stoppedunlessservice.yamlsays otherwise