157 lines
6.1 KiB
Markdown
157 lines
6.1 KiB
Markdown
|
|
# CLAUDE.md
|
||
|
|
|
||
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||
|
|
|
||
|
|
## What This Repo Is
|
||
|
|
|
||
|
|
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
|
||
|
|
|
||
|
|
## Node Roles
|
||
|
|
|
||
|
|
| Host | Role |
|
||
|
|
|------|------|
|
||
|
|
| **SATURN** | Primary control node — only node where commits are made |
|
||
|
|
| **SOLARIA** | GPU/compute/AI workloads |
|
||
|
|
| **PIHA** | Infra, monitoring |
|
||
|
|
| **VPS** | Public ingress, reverse proxy, control plane host |
|
||
|
|
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
||
|
|
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
||
|
|
|
||
|
|
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services (`zigbee2mqtt`, `mosquitto`, `homeassistant`, `stability-agent`) must never depend on SATURN, VPS, or Forgejo at runtime.
|
||
|
|
|
||
|
|
## Deployment
|
||
|
|
|
||
|
|
### Run a fresh deployment on the current node
|
||
|
|
```bash
|
||
|
|
scripts/deploy/deploy.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
### Resume after interruption
|
||
|
|
```bash
|
||
|
|
scripts/deploy/deploy.sh --resume
|
||
|
|
```
|
||
|
|
|
||
|
|
### Run a specific stage only
|
||
|
|
```bash
|
||
|
|
scripts/deploy/deploy.sh --stage verify
|
||
|
|
scripts/deploy/deploy.sh --stage diagnose
|
||
|
|
```
|
||
|
|
|
||
|
|
### Deploy a specific service
|
||
|
|
```bash
|
||
|
|
scripts/deploy/deploy.sh --service mosquitto
|
||
|
|
```
|
||
|
|
|
||
|
|
### Deploy from SATURN/SOLARIA to VPS (control plane)
|
||
|
|
```bash
|
||
|
|
./scripts/deploy/deploy-control-plane.sh --ssh
|
||
|
|
```
|
||
|
|
|
||
|
|
### Bootstrap a new node
|
||
|
|
```bash
|
||
|
|
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific
|
||
|
|
./scripts/bootstrap/prepare-node.sh # General node prep
|
||
|
|
```
|
||
|
|
|
||
|
|
The staged deploy pipeline runs: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state is persisted in `/opt/homelab/state/deploy/` allowing safe resumption.
|
||
|
|
|
||
|
|
## Service Structure
|
||
|
|
|
||
|
|
Every service must follow this layout:
|
||
|
|
|
||
|
|
```
|
||
|
|
services/<service>/
|
||
|
|
├── docker-compose.yml
|
||
|
|
├── service.yaml # Machine-readable contract (primary source of truth for agents)
|
||
|
|
├── README.md
|
||
|
|
├── env.example # Template — never commit actual secrets
|
||
|
|
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
|
||
|
|
```
|
||
|
|
|
||
|
|
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
|
||
|
|
|
||
|
|
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
|
||
|
|
|
||
|
|
## Agent System Architecture
|
||
|
|
|
||
|
|
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
|
||
|
|
|
||
|
|
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
|
||
|
|
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
|
||
|
|
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
|
||
|
|
4. **Executor** — Executes actions only after they transition to `approved`.
|
||
|
|
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
|
||
|
|
|
||
|
|
### Action approval flow
|
||
|
|
```
|
||
|
|
Agent → /opt/homelab/actions/pending/<id>.json
|
||
|
|
→ Telegram notification → Operator approves
|
||
|
|
→ /opt/homelab/actions/approved/<id>.json
|
||
|
|
→ Executor runs → completed / failed
|
||
|
|
```
|
||
|
|
|
||
|
|
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
|
||
|
|
|
||
|
|
## Event System
|
||
|
|
|
||
|
|
Events are append-only JSON lines at:
|
||
|
|
```
|
||
|
|
/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl
|
||
|
|
```
|
||
|
|
|
||
|
|
Emit from shell:
|
||
|
|
```bash
|
||
|
|
source scripts/lib/events.sh
|
||
|
|
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "cid-123" '{}'
|
||
|
|
```
|
||
|
|
|
||
|
|
Emit from Python:
|
||
|
|
```python
|
||
|
|
from scripts.lib.events import emit_event
|
||
|
|
emit_event("service_unhealthy", "error", "monitor.py", "ollama", "cid-123", {"error": "OOM"})
|
||
|
|
```
|
||
|
|
|
||
|
|
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
||
|
|
|
||
|
|
## Discovery Entry Points for Agents
|
||
|
|
|
||
|
|
When exploring the system, use these files in order:
|
||
|
|
1. `inventory/topology.yaml` — node list, roles, mesh type
|
||
|
|
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
|
||
|
|
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
||
|
|
4. `services/<service>/service.yaml` — operational contract for a service
|
||
|
|
|
||
|
|
## CHELSTY-Specific Rules
|
||
|
|
|
||
|
|
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||
|
|
- Deploy CHELSTY nodes individually:
|
||
|
|
```bash
|
||
|
|
./scripts/deploy/deploy-node.sh chelsty-infra
|
||
|
|
./scripts/deploy/deploy-node.sh chelsty-ha
|
||
|
|
```
|
||
|
|
- Bootstrap CHELSTY runtime:
|
||
|
|
```bash
|
||
|
|
./scripts/bootstrap/chelsty-runtime.sh
|
||
|
|
```
|
||
|
|
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
||
|
|
|
||
|
|
## Runtime Path Conventions
|
||
|
|
|
||
|
|
```
|
||
|
|
/opt/homelab/
|
||
|
|
├── data/<service>/ # Persistent volumes
|
||
|
|
├── config/<service>/ # Secrets and host-local overrides (not in Git)
|
||
|
|
├── logs/<service>/ # Service logs
|
||
|
|
├── state/ # Deployment stage markers, agent heartbeats
|
||
|
|
├── events/ # Append-only event store
|
||
|
|
├── world/ # Observer output (synthesized state)
|
||
|
|
└── actions/ # pending / approved / running / completed / failed
|
||
|
|
```
|
||
|
|
|
||
|
|
## Naming Conventions
|
||
|
|
|
||
|
|
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
||
|
|
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
||
|
|
- Container names must match service names
|
||
|
|
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|