Oskar Kapala f381023206 docs(claude): add Definition of Done for services (smoke test + pytest)

Lesson from brain-watchdog: code that was never run had a packaging bug
that caused a crash loop in production. New rule: docker build + short
smoke-run + pytest before any commit or deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-01 20:38:39 +02:00

10 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

What This Repo Is

GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at /opt/homelab/ on each execution node and is never committed.

Node Roles

Host	Role
SATURN	Primary control node — only node where commits are made
SOLARIA	GPU/compute/AI workloads
PIHA	Infra, monitoring
VPS	Public ingress, reverse proxy, control plane host
CHELSTY-INFRA	LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first
CHELSTY-HA	LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first

All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: hosts/<node>/capabilities.yaml.

Deployment

scripts/deploy/deploy.sh                        # fresh deploy on current node
scripts/deploy/deploy.sh --resume              # resume after interruption
scripts/deploy/deploy.sh --stage verify        # specific stage only
scripts/deploy/deploy.sh --service mosquitto   # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh            # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap

Pipeline stages: prepare → validate → deploy → verify → diagnose (on failure) → complete. Stage state persisted in /opt/homelab/state/deploy/.

Service Structure

Every service must follow this layout:

services/<service>/
├── docker-compose.yml
├── service.yaml       # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example        # Template — never commit actual secrets
└── healthcheck.sh     # Returns 0 (healthy) or 1 (unhealthy)

service.yaml defines owner_node, exposure, dependencies, healthcheck, restart_policy, persistence.paths, and runtime.env_vars. This is what AI agents read to understand how to manage a service.

Host-specific runtime config and secrets live at /opt/homelab/config/<service>/ on the target node (not in Git). Docker Compose overrides are version-controlled at hosts/<node>/runtime/<service>/docker-compose.override.yml in this repo and applied during deployment.

Agent System Architecture

The platform uses a multi-agent model with human-in-the-loop for destructive actions:

Stability Agent (services/stability-agent/) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
Observer (services/control-plane/src/) — Synthesizes world state from events into /opt/homelab/world/{nodes,services,deployments,incidents}.json.
Supervisor — Detects drift between desired state (from hosts/*/services.yaml) and actual state (from Observer output). Writes pending action JSON files.
Executor — Executes actions only after they transition to approved.
Operator UI + Telegram Bot — Operators review and approve/reject pending actions.

Action approval flow

Agent → /opt/homelab/actions/pending/<id>.json
      → Telegram notification → Operator approves
      → /opt/homelab/actions/approved/<id>.json
      → Executor runs → completed / failed

Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.

Event System

Events are append-only JSON lines at /opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl.

Emit via scripts/lib/events.sh (shell) or scripts/lib/events.py (Python).

Normalized event types: deployment_started/completed/failed, service_unhealthy/recovered, node_offline/online, healthcheck_failed, remediation_started/completed.

Supervisor event routing table

Event type	Source	Action generated	Cooldown
`containers_not_running`	stability-agent	`container_restart`	dedup via stable ID
`mqtt_unreachable`	stability-agent	`container_restart`	dedup via stable ID
`service_unhealthy` / other	stability-agent	`redeploy`	dedup via stable ID
`disk_pressure` (high)	stability-agent	`disk_cleanup`	dedup via stable ID
`ha_websocket_dead`	ha-diag-agent	`container_restart` (homeassistant)	30 min after completion
`ha_websocket_recovered`	ha-diag-agent	cancels matching restart	—
`ha_integration_failed`	ha-diag-agent	`alert_only`	1 hour
`ha_entity_unavailable_long`	ha-diag-agent	`alert_only`	1 hour
`ha_automation_failing`	ha-diag-agent	`alert_only`	1 hour
`ha_update_available`	ha-diag-agent	`alert_only`	1 hour
`ha_recorder_lag`	ha-diag-agent	`alert_only`	1 hour
`ha_system_health_degraded`	ha-diag-agent	`alert_only`	1 hour

HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if homeassistant had a containers_not_running incident within the last 5 minutes (planned restart/update in progress).

Discovery Entry Points for Agents

When exploring the system, use these files in order:

inventory/topology.yaml — node list, roles, mesh type
hosts/<node>/capabilities.yaml — hardware and software constraints
hosts/<node>/services.yaml — desired services and exposure classes for that host
services/<service>/service.yaml — operational contract for a service

VPS-Specific Rules

VPS has 4 GiB RAM, no swap. Every repo-managed service must declare memory limits in its hosts/vps/runtime/<service>/docker-compose.override.yml.

Memory limit convention

Use top-level Compose properties (not deploy.resources.limits, which requires Swarm mode):

services:
  myservice:
    mem_limit: 256m      # cgroup ceiling; Docker restarts on breach
    oom_score_adj: -900  # host kernel OOM-killer will not pick this container

Rules:

Control-plane containers (executor, observer, supervisor, operator-ui), node-agent, stability-agent: always set oom_score_adj: -900 — these must never be a system-level OOM victim.
mem_limit still applies even with oom_score_adj: -900; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
Budget: OS+Docker reserves ~800 MiB; sum of all mem_limit values must stay ≤ 3200 MiB (3.1 GiB).

Repo-managed services on VPS

All VPS services are now GitOps-managed. Service definitions live in services/<name>/docker-compose.yml; host-specific overrides (mem_limit, env) live in hosts/vps/runtime/<name>/docker-compose.override.yml.

Service	Compose stack	Data path
npm	`services/npm/`	`/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount)
outline	`services/outline/`	Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data`
joplin	`services/joplin/`	Docker named volume: `joplin_postgres_data`
ai-cluster	`services/ai-cluster/`	Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/`

Data migration rule: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.

Cutover checklist (before running docker compose up for any migrated service):

git pull on VPS
Populate /opt/homelab/config/<service>/.env from the env.example template
For ai-cluster: copy /home/dockeruser/docker/ai-cluster/.env to /opt/homelab/config/ai-cluster/.env
For mosquitto: config stays at old bind path until explicitly migrated
Verify named volumes exist: docker volume ls | grep <project>

ai-cluster architectural note: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.

CHELSTY-Specific Rules

Zigbee coordinator is SLZB-06U over TCP (192.168.1.105:6638, ezsp adapter). Never use /dev/ttyUSB0.
CHELSTY nodes run docker-compose v1 (1.29.2) — use docker-compose (hyphenated), not docker compose.
Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.

Runtime Path Conventions

/opt/homelab/ layout on each node:

data/<service>/ — persistent volumes
config/<service>/ — secrets and host-local overrides (not in Git)
logs/<service>/ — service logs
state/ — deployment stage markers, agent heartbeats
events/ — append-only event store
world/ — Observer output (synthesized state)
actions/ — pending / approved / running / completed / failed

Definition of Done (serwisy)

Before any new or changed service is considered ready:

docker build + smoke run — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. ModuleNotFoundError) before they reach a node.
pytest — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in services/<service>/tests/.
Never commit or deploy code that has never been run. If a smoke run or test fails, fix it first.

Naming Conventions

Hosts: ALL CAPS (SATURN, PIHA)
Services: kebab-case (stability-agent, zigbee2mqtt)
Container names must match service names
Always restart: unless-stopped unless service.yaml says otherwise

10 KiB Raw Blame History