homelab-codex-ws/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What This Repo Is

GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.

## Node Roles

| Host | Role |
|------|------|
| **SATURN** | Primary control node — only node where commits are made |
| **SOLARIA** | GPU/compute/AI workloads |
| **PIHA** | Infra, monitoring |
| **VPS** | Public ingress, reverse proxy, control plane host |
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |

All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.

## Deployment

```bash
scripts/deploy/deploy.sh                        # fresh deploy on current node
scripts/deploy/deploy.sh --resume              # resume after interruption
scripts/deploy/deploy.sh --stage verify        # specific stage only
scripts/deploy/deploy.sh --service mosquitto   # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh            # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap
```

Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.

## Service Structure

Every service must follow this layout:

```
services/<service>/
├── docker-compose.yml
├── service.yaml       # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example        # Template — never commit actual secrets
└── healthcheck.sh     # Returns 0 (healthy) or 1 (unhealthy)
```

`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.

Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.

## Agent System Architecture

The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:

1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
4. **Executor** — Executes actions only after they transition to `approved`.
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.

### Action approval flow
```
Agent → /opt/homelab/actions/pending/<id>.json
      → Telegram notification → Operator approves
      → /opt/homelab/actions/approved/<id>.json
      → Executor runs → completed / failed
```

Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.

## Event System

Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.

Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).

Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.

## Discovery Entry Points for Agents

When exploring the system, use these files in order:
1. `inventory/topology.yaml` — node list, roles, mesh type
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
4. `services/<service>/service.yaml` — operational contract for a service

## CHELSTY-Specific Rules

- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.

## Runtime Path Conventions

`/opt/homelab/` layout on each node:

- `data/<service>/` — persistent volumes
- `config/<service>/` — secrets and host-local overrides (not in Git)
- `logs/<service>/` — service logs
- `state/` — deployment stage markers, agent heartbeats
- `events/` — append-only event store
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed

## Naming Conventions

- Hosts: ALL CAPS (`SATURN`, `PIHA`)
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`# CLAUDE.md`

			`This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.`

			`## What This Repo Is`

			GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.

			`## Node Roles`

			`\| Host \| Role \|`
			`\|------\|------\|`
			`\| SATURN \| Primary control node — only node where commits are made \|`
			`\| SOLARIA \| GPU/compute/AI workloads \|`
			`\| PIHA \| Infra, monitoring \|`
			`\| VPS \| Public ingress, reverse proxy, control plane host \|`
			`\| CHELSTY-INFRA \| LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first \|`
			`\| CHELSTY-HA \| LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first \|`

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			`## Deployment`

			```bash
docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			`scripts/deploy/deploy.sh # fresh deploy on current node`
			`scripts/deploy/deploy.sh --resume # resume after interruption`
			`scripts/deploy/deploy.sh --stage verify # specific stage only`
			`scripts/deploy/deploy.sh --service mosquitto # specific service only`
			`./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS`
			`./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)`
			`./scripts/bootstrap/prepare-node.sh # general node bootstrap`
			`./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap`
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			```

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			Pipeline stages: prepare → validate → deploy → verify → diagnose (on failure) → complete. Stage state persisted in `/opt/homelab/state/deploy/`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			`## Service Structure`

			`Every service must follow this layout:`

			```
			`services/<service>/`
			`├── docker-compose.yml`
			`├── service.yaml # Machine-readable contract (primary source of truth for agents)`
			`├── README.md`
			`├── env.example # Template — never commit actual secrets`
			`└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)`
			```

			`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.

			Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.

			`## Agent System Architecture`

			`The platform uses a multi-agent model with human-in-the-loop for destructive actions:`

			1. Stability Agent (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
			2. Observer (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
			3. Supervisor — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
			4. Executor — Executes actions only after they transition to `approved`.
			`5. Operator UI + Telegram Bot — Operators review and approve/reject pending actions.`

			`### Action approval flow`
			```
			`Agent → /opt/homelab/actions/pending/<id>.json`
			`→ Telegram notification → Operator approves`
			`→ /opt/homelab/actions/approved/<id>.json`
			`→ Executor runs → completed / failed`
			```

			`Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.`

			`## Event System`

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.

			`## Discovery Entry Points for Agents`

			`When exploring the system, use these files in order:`
			1. `inventory/topology.yaml` — node list, roles, mesh type
			2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
			3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
			4. `services/<service>/service.yaml` — operational contract for a service

			`## CHELSTY-Specific Rules`

			- Zigbee coordinator is SLZB-06U over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			- CHELSTY nodes run docker-compose v1 (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00			`- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.`

			`## Runtime Path Conventions`

docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 14:17:23 +02:00			`/opt/homelab/` layout on each node:

			- `data/<service>/` — persistent volumes
			- `config/<service>/` — secrets and host-local overrides (not in Git)
			- `logs/<service>/` — service logs
			- `state/` — deployment stage markers, agent heartbeats
			- `events/` — append-only event store
			- `world/` — Observer output (synthesized state)
			- `actions/` — pending / approved / running / completed / failed
docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 15:27:46 +02:00
			`## Naming Conventions`

			- Hosts: ALL CAPS (`SATURN`, `PIHA`)
			- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
			`- Container names must match service names`
			- Always `restart: unless-stopped` unless `service.yaml` says otherwise