homelab-codex-ws/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What This Repo Is

GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.

## Node Roles

| Host | Role |
|------|------|
| **SATURN** | Primary control node — only node where commits are made |
| **SOLARIA** | GPU/compute/AI workloads |
| **PIHA** | Infra, monitoring |
| **VPS** | Public ingress, reverse proxy, control plane host |
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |

All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.

## Deployment

```bash
scripts/deploy/deploy.sh                        # fresh deploy on current node
scripts/deploy/deploy.sh --resume              # resume after interruption
scripts/deploy/deploy.sh --stage verify        # specific stage only
scripts/deploy/deploy.sh --service mosquitto   # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh            # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap
```

Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.

## Service Structure

Every service must follow this layout:

```
services/<service>/
├── docker-compose.yml
├── service.yaml       # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example        # Template — never commit actual secrets
└── healthcheck.sh     # Returns 0 (healthy) or 1 (unhealthy)
```

`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.

Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.

## Agent System Architecture

The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:

1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
4. **Executor** — Executes actions only after they transition to `approved`.
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.

### Action approval flow
```
Agent → /opt/homelab/actions/pending/<id>.json
      → Telegram notification → Operator approves
      → /opt/homelab/actions/approved/<id>.json
      → Executor runs → completed / failed
```

Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.

## Event System

Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.

Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).

Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.

### Supervisor event routing table

| Event type | Source | Action generated | Cooldown |
|---|---|---|---|
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |

HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).

## Discovery Entry Points for Agents

When exploring the system, use these files in order:
1. `inventory/topology.yaml` — node list, roles, mesh type
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
4. `services/<service>/service.yaml` — operational contract for a service

## VPS-Specific Rules

VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.

### Memory limit convention

Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):

```yaml
services:
  myservice:
    mem_limit: 256m      # cgroup ceiling; Docker restarts on breach
    oom_score_adj: -900  # host kernel OOM-killer will not pick this container
```

Rules:
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).

### Repo-managed services on VPS

All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.

| Service | Compose stack | Data path |
|---|---|---|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |

**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.

**Cutover checklist** (before running `docker compose up` for any migrated service):
1. `git pull` on VPS
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
4. For mosquitto: config stays at old bind path until explicitly migrated
5. Verify named volumes exist: `docker volume ls | grep <project>`

**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.

## CHELSTY-Specific Rules

- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.

## Runtime Path Conventions

`/opt/homelab/` layout on each node:

- `data/<service>/` — persistent volumes
- `config/<service>/` — secrets and host-local overrides (not in Git)
- `logs/<service>/` — service logs
- `state/` — deployment stage markers, agent heartbeats
- `events/` — append-only event store
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed

## Definition of Done (serwisy)

Before any new or changed service is considered ready:

1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.

## Naming Conventions

- Hosts: ALL CAPS (`SATURN`, `PIHA`)
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`# CLAUDE.md`

			`This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.`

			`## What This Repo Is`

			GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.

			`## Node Roles`

			`\| Host \| Role \|`
			`\|------\|------\|`
			`\| SATURN \| Primary control node — only node where commits are made \|`
			`\| SOLARIA \| GPU/compute/AI workloads \|`
			`\| PIHA \| Infra, monitoring \|`
			`\| VPS \| Public ingress, reverse proxy, control plane host \|`
			`\| CHELSTY-INFRA \| LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first \|`
			`\| CHELSTY-HA \| LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first \|`

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			`## Deployment`

			```bash
docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			`scripts/deploy/deploy.sh # fresh deploy on current node`
			`scripts/deploy/deploy.sh --resume # resume after interruption`
			`scripts/deploy/deploy.sh --stage verify # specific stage only`
			`scripts/deploy/deploy.sh --service mosquitto # specific service only`
			`./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS`
			`./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)`
			`./scripts/bootstrap/prepare-node.sh # general node bootstrap`
			`./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap`
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			```

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			Pipeline stages: prepare → validate → deploy → verify → diagnose (on failure) → complete. Stage state persisted in `/opt/homelab/state/deploy/`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			`## Service Structure`

			`Every service must follow this layout:`

			```
			`services/<service>/`
			`├── docker-compose.yml`
			`├── service.yaml # Machine-readable contract (primary source of truth for agents)`
			`├── README.md`
			`├── env.example # Template — never commit actual secrets`
			`└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)`
			```

			`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.

			Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.

			`## Agent System Architecture`

			`The platform uses a multi-agent model with human-in-the-loop for destructive actions:`

			1. Stability Agent (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
			2. Observer (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
			3. Supervisor — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
			4. Executor — Executes actions only after they transition to `approved`.
			`5. Operator UI + Telegram Bot — Operators review and approve/reject pending actions.`

			`### Action approval flow`
			```
			`Agent → /opt/homelab/actions/pending/<id>.json`
			`→ Telegram notification → Operator approves`
			`→ /opt/homelab/actions/approved/<id>.json`
			`→ Executor runs → completed / failed`
			```

			`Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.`

			`## Event System`

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.

feat(control-plane): route ha-diag-agent events through supervisor - 8 HA event types mapped to existing action types - ha_websocket_dead → container_restart (homeassistant), 30-min cooldown - 6 events → alert_only (entity_unavailable, integration_failed, automation_failing, update_available, recorder_lag, system_health_degraded), 1-hour cooldown - ha_websocket_recovered → cancels matching pending container_restart - state-aware suppression: skip HA events when homeassistant has an active containers_not_running incident < 5 min ago (avoids alert storms during HA restarts/updates) - location_tag preserved through action pipeline for per-house telegram alerts - executor: alert_only acknowledged as no-op success - 18 tests covering all 8 event types, suppression, cooldown, dedup, location_tag, recovery cancellation - CLAUDE.md: supervisor event routing table added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 15:59:23 +02:00			`### Supervisor event routing table`

			`\| Event type \| Source \| Action generated \| Cooldown \|`
			`\|---\|---\|---\|---\|`
			\| `containers_not_running` \| stability-agent \| `container_restart` \| dedup via stable ID \|
			\| `mqtt_unreachable` \| stability-agent \| `container_restart` \| dedup via stable ID \|
			\| `service_unhealthy` / other \| stability-agent \| `redeploy` \| dedup via stable ID \|
			\| `disk_pressure` (high) \| stability-agent \| `disk_cleanup` \| dedup via stable ID \|
			\| `ha_websocket_dead` \| ha-diag-agent \| `container_restart` (homeassistant) \| 30 min after completion \|
			\| `ha_websocket_recovered` \| ha-diag-agent \| cancels matching restart \| — \|
			\| `ha_integration_failed` \| ha-diag-agent \| `alert_only` \| 1 hour \|
			\| `ha_entity_unavailable_long` \| ha-diag-agent \| `alert_only` \| 1 hour \|
			\| `ha_automation_failing` \| ha-diag-agent \| `alert_only` \| 1 hour \|
			\| `ha_update_available` \| ha-diag-agent \| `alert_only` \| 1 hour \|
			\| `ha_recorder_lag` \| ha-diag-agent \| `alert_only` \| 1 hour \|
			\| `ha_system_health_degraded` \| ha-diag-agent \| `alert_only` \| 1 hour \|

			HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).

docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`## Discovery Entry Points for Agents`

			`When exploring the system, use these files in order:`
			1. `inventory/topology.yaml` — node list, roles, mesh type
			2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
			3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
			4. `services/<service>/service.yaml` — operational contract for a service

docs(claude): add Definition of Done for services (smoke test + pytest) Lesson from brain-watchdog: code that was never run had a packaging bug that caused a crash loop in production. New rule: docker build + short smoke-run + pytest before any commit or deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 20:38:39 +02:00			`## VPS-Specific Rules`

			VPS has 4 GiB RAM, no swap. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.

			`### Memory limit convention`

			Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):

			```yaml
			`services:`
			`myservice:`
			`mem_limit: 256m # cgroup ceiling; Docker restarts on breach`
			`oom_score_adj: -900 # host kernel OOM-killer will not pick this container`
			```

			`Rules:`
			- Control-plane containers (executor, observer, supervisor, operator-ui), node-agent, stability-agent: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
			- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
			- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).

			`### Repo-managed services on VPS`

			All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.

			`\| Service \| Compose stack \| Data path \|`
			`\|---\|---\|---\|`
			\| npm \| `services/npm/` \| `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) \|
			\| outline \| `services/outline/` \| Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` \|
			\| joplin \| `services/joplin/` \| Docker named volume: `joplin_postgres_data` \|
			\| ai-cluster \| `services/ai-cluster/` \| Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` \|

			`Data migration rule: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.`

			Cutover checklist (before running `docker compose up` for any migrated service):
			1. `git pull` on VPS
			2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
			3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
			`4. For mosquitto: config stays at old bind path until explicitly migrated`
			5. Verify named volumes exist: `docker volume ls \| grep <project>`

			`ai-cluster architectural note: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.`

docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`## CHELSTY-Specific Rules`

			- Zigbee coordinator is SLZB-06U over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			- CHELSTY nodes run docker-compose v1 (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.`

			`## Runtime Path Conventions`

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			`/opt/homelab/` layout on each node:

			- `data/<service>/` — persistent volumes
			- `config/<service>/` — secrets and host-local overrides (not in Git)
			- `logs/<service>/` — service logs
			- `state/` — deployment stage markers, agent heartbeats
			- `events/` — append-only event store
			- `world/` — Observer output (synthesized state)
			- `actions/` — pending / approved / running / completed / failed
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
docs(claude): add Definition of Done for services (smoke test + pytest) Lesson from brain-watchdog: code that was never run had a packaging bug that caused a crash loop in production. New rule: docker build + short smoke-run + pytest before any commit or deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 20:38:39 +02:00			`## Definition of Done (serwisy)`

			`Before any new or changed service is considered ready:`

			1. docker build + smoke run — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
			2. pytest — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
			`3. Never commit or deploy code that has never been run. If a smoke run or test fails, fix it first.`

docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`## Naming Conventions`

			- Hosts: ALL CAPS (`SATURN`, `PIHA`)
			- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
			`- Container names must match service names`
			- Always `restart: unless-stopped` unless `service.yaml` says otherwise