Compare commits

..

No commits in common. "master" and "chelsty-stability-agent" have entirely different histories.

191 changed files with 518 additions and 19298 deletions

View file

@ -1,43 +0,0 @@
---
name: deploy
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
---
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
## Targets
| Target | What it deploys |
|---|---|
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
| `solaria` | SOLARIA compute services |
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
## Invocation
```bash
scripts/deploy/deploy.sh <target> # full pipeline
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
```
## Exit Code Handling
| Code | Meaning | Required action |
|---|---|---|
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
## Rules
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
- Canonical branch is `master` — preflight enforces this.
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.

View file

@ -1,152 +0,0 @@
---
name: node-onboarding
description: >
Use when the user wants to add or onboard a new node to homelab-codex —
repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration.
Keywords: "nowy node", "dodaj node", "onboarding", "onboard node".
living_doc: true
maturity: partial # PROVEN: 00-access, 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify (live pending). Update after each step lands on a real node.
---
> **Living document** — sections marked **SCAFFOLD** are stubs waiting for battle-testing on a real node.
> Promote to **PROVEN** after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.
## Trigger
User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.
---
## Workflow — one step at a time
```
preflight (read-only)
└─ 00-access [PROVEN]
└─ 20-base [PROVEN]
└─ 30-node-agent [PROVEN]
└─ 40-register [WRITTEN — live pending]
└─ 50-verify [WRITTEN — live pending]
```
Never skip ahead. Each step must exit 0 before the next begins.
---
## Invocation
```bash
# Full onboarding (all steps in order)
scripts/onboard/onboard.sh --node <name>
# Single step
scripts/onboard/onboard.sh --node <name> --step 00-access
# Resume from a step
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime
# Dry-run — probes run for real; mutations are printed, not executed
scripts/onboard/onboard.sh --node <name> --dry-run
```
---
## Step status table
| Step | File | Status | What it does |
|------|------|--------|--------------|
| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
| `00-access` | `steps/00-access.sh` | **PROVEN** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interactive URL), verify over mesh |
| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | SCAFFOLD | Create `/opt/homelab/` layout, `chown <ssh_user>` |
| `20-base` | `steps/20-base.sh` | **PROVEN** | swap→zram, `/opt/homelab/` layout, event dir `/opt/homelab/events/<node>/` |
| `20-install-docker` | `steps/20-install-docker.sh` | SCAFFOLD | Install Docker Engine if `docker_present=false`; skip if already installed |
| `30-node-agent` | `steps/30-node-agent.sh` | **PROVEN** | rsync base compose + override, `docker compose up -d --build`, verify container + events |
| `40-register` | `steps/40-register.sh` | WRITTEN | Dopisuje node do `inventory/topology.yaml` + tworzy `hosts/<node>/services.yaml`, commit na branchu (bez push) |
| `50-verify` | `steps/50-verify.sh` | WRITTEN | SSH node: container+events; SSH VPS: restart observer + heartbeat poll + world/nodes.json |
---
## node.yaml — key fields
```yaml
name: LUSTRO # ALL CAPS
role: edge # edge | compute | infra
ssh_user: pi # existing user on the node
first_contact: pi@192.168.31.19 # LAN IP — NEVER .local (mDNS unreliable in automation)
tailscale:
hostname: lustro # mesh name; switch to this after tailscale up
ip: # fill after join
deploy_autonomy: true # false → print manual instructions and stop
git_control: false # false → push-based from SATURN (edge nodes)
hardware:
arch: arm64 # filled by 00-preflight
ram_mb: 4096 # filled by 00-preflight
swap:
kind: zram # zram | file | none
docker_present: true # filled by 00-preflight
mm_runtime: systemd:magicmirror.service # filled by 00-preflight; none if absent
services:
node-agent:
runtime:
engine: docker
mem_limit: 256m # mandatory on RAM-constrained hosts (≤4 GB)
```
preflight fills `arch`, `ram_mb`, `docker_present`, `mm_runtime` — do NOT guess these.
Full schema: `scripts/onboard/README.md`.
---
## Operational rules (PROVEN)
**PLAN-FIRST** — before any mutation, show exactly what will touch the remote host.
Always run `--dry-run` first; dry-run must print real commands (`run()` propagation).
**Idempotency** — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
**Isolation** — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
**Worktree discipline** — onboarding is a feature. Work in a task worktree (`agent.sh new`), never in the main checkout (`~/homelab-codex-ws` is deploy-only). See [[worktree-aware]].
---
## Gotchas (battle-tested)
| Problem | Fix |
|---------|-----|
| mDNS `.local` resolve fail | Always use LAN IP in `first_contact`; `.local` OK interactively, not in automation |
| uid=1000 collision on RPi OS | If `pi` already holds uid=1000 → USE that user, don't create `oskar`. node-agent `1000:1000` matches out-of-box; creating a second uid=1000 breaks MM ownership |
| passwordless sudo not guaranteed | Verify `sudo -n true` exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to `10-bootstrap-runtime` |
| RAM ≤4 GB with heavy app | `mem_limit` on node-agent is mandatory — same OOM profile as VPS |
| Docker already installed | Check `docker_present` from preflight; skip install step if true |
| SSH known-hosts warning in parsed output | Pass `-o LogLevel=ERROR` to SSH for new mesh hosts |
| `yaml_get` drops value prefix after `:` | Non-greedy colon: `s/^[[:space:]]*[^:]*:[[:space:]]*//'` — handles `systemd:unit` correctly |
| `yaml_get` keeps inline YAML comments | Strip with `s/[[:space:]]\+#.*$//` after extraction (requires ≥1 space before `#`) |
| dry-run stops at orchestrator level | `run()` wrapper + `export DRY_RUN=1` propagated to all step scripts; probes execute for real |
| rsync push Permission denied to VPS events/ | ssh-user must be in the **group that owns `/opt/homelab/events/`** (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: `usermod -aG 1000 <user>` on VPS + re-login |
| node-agent SSH key mount target | Mount the push key under the **container's HOME**: `/home/homelab/.ssh` (uid 1000 `homelab`), **NOT `/root/.ssh`** — ssh in `_ship_events_to_vps()` has no `-i` and only looks in `$HOME/.ssh`; a `/root/.ssh` mount is blind → `Permission denied` (lustro 2026-06-11, fix `a5a1352`). The new node's pubkey must also land in `authorized_keys` of `oskar@VPS` |
| observer not seeing new node after topology.yaml edit | `_load_inventory()` runs once at `__init__`. After `git pull` on VPS (bind-mount is live), **`docker restart control-plane-observer`** is required — no redeploy needed |
| worktree on wrong branch | Always check `git branch --show-current` on entry. One task = one worktree (`agent.sh new`). Never manually `git checkout` between task branches in the same worktree |
---
## lib/ reference
```
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
```
`run()` contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, `command -v`, status queries) always execute so the plan is realistic.
---
## Definition of Done
A node is fully onboarded when:
1. `50-verify` exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
2. `hosts/<node>/node.yaml` committed with all preflight fields filled.
3. `hosts/<node>/capabilities.yaml` present and accurate.
4. Node appears in `inventory/topology.yaml`.

View file

@ -1,65 +0,0 @@
---
name: save-session
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
---
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
Never invoke proactively. Never invoke mid-task.
## 1. Determine Session Boundary
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
2. Fallback if no previous entry exists: 24 hours ago.
## 2. Collect Facts (deterministic only — no invention)
Run exactly:
```bash
# All commits since boundary
git --no-pager log --oneline <boundary>..HEAD
# Changed file summary
git --no-pager diff --stat <boundary>..HEAD
```
From the visible conversation transcript: deploys run and their outcomes, test results seen.
## 3. Write the Session Entry
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
Never overwrite existing content.
```markdown
## Session HH:MM
### Commits
<output of git log --oneline>
### Files changed
<output of git diff --stat>
### Deploys
<list from transcript, or "None recorded">
### Narrative
> _user-provided summary_
```
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
## 4. What NOT to Touch
- `backlog.md` — only on explicit "update backlog" instruction
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
- Any other file not listed above
## 5. Commit
Stage and commit **only** the session file:
```bash
git add docs/sessions/YYYY-MM-DD.md
git commit -m "docs: session YYYY-MM-DD HH:MM"
```
No other files. No `git add -A`.

View file

@ -1,81 +0,0 @@
---
name: worktree-aware
description: >
Use when working in a git worktree checkout for a parallel agent task.
The presence of an .agent-task file in the current working directory indicates
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
to the assigned task branch, NEVER push origin master, NEVER touch the main
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
completion, report the branch name verbatim and stop — the human merges via
scripts/dev/agent.sh.
---
## When this applies
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
In the main checkout these rules do not apply.
## Reading the marker
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
```yaml
task: my-feature
branch: task/my-feature
parent_commit: abc1234
created_utc: 2026-06-03T10:00:00Z
worktree_path: /home/oskar/homelab-codex-ws-my-feature
```
Always read this file first before taking any action.
## Rules
1. **Commit only to your branch.**
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
If it does not, stop immediately and report the discrepancy.
2. **Push only to your branch.**
The only permitted push is `git push origin task/<name>`.
NEVER `git push origin master` or any other branch.
3. **Do not touch the main checkout.**
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
Do not read from, write to, or execute commands inside it.
4. **Stay scoped.**
Only change files directly related to your assigned task.
If you notice other problems, report them in your final summary as separate follow-up proposals.
Do not fix them in this worktree.
5. **Never `git add -A`.**
Always stage specific files by name: `git add path/to/file`.
6. **Do not manage worktrees.**
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
Worktree lifecycle is the human's responsibility.
7. **Final report before stopping.**
When the task is done, provide a structured report containing:
- Files changed (path and one-line summary of change)
- Tests run and results
- All commit hashes on the task branch
- **Branch name verbatim** (copy-paste ready)
- Follow-up items as bulleted proposals for separate tasks
## Definition of Done
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
- Test suite passes
- Branch pushed: `git push origin task/<name>`
- Full report delivered in conversation
## What you do NOT do
- Merge branches
- Create or push tags
- Run deploys or healthchecks against production nodes
- Delete branches or worktrees
- Modify files in other worktrees
- Push to `origin master` under any circumstances

3
.gitignore vendored
View file

@ -15,13 +15,10 @@ __pycache__/
*$py.class
venv/
.venv/
*.egg-info/
# Tools
.aider*
.codex
# worktree task marker created by scripts/dev/agent.sh new — must stay untracked per worktree
.agent-task
# OS files
.DS_Store

212
CLAUDE.md
View file

@ -1,212 +0,0 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## What This Repo Is
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
## Node Roles
| Host | Role |
|------|------|
| **SATURN** | Primary control node — only node where commits are made |
| **SOLARIA** | GPU/compute/AI workloads |
| **PIHA** | Infra, monitoring |
| **VPS** | Public ingress, reverse proxy, control plane host |
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
## Deployment
```bash
scripts/deploy/deploy.sh # fresh deploy on current node
scripts/deploy/deploy.sh --resume # resume after interruption
scripts/deploy/deploy.sh --stage verify # specific stage only
scripts/deploy/deploy.sh --service mosquitto # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
scripts/onboard/onboard.sh --node <name> # onboard a new node (idempotent, bash)
scripts/onboard/onboard.sh --node <name> --step 00-access # single step
scripts/onboard/onboard.sh --node <name> --dry-run # simulate
```
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
## Node Onboarding
New nodes are onboarded via `scripts/onboard/` — an idempotent bash tool driven by
`hosts/<node>/node.yaml` manifests (no Ansible). See `scripts/onboard/README.md` for
the full schema, step status table, and gotchas.
Key fields in `node.yaml`: `ssh_user`, `first_contact` (LAN IP — not `.local`),
`tailscale.hostname`, `deploy_autonomy`, `git_control`, `hardware.*`.
## Service Structure
Every service must follow this layout:
```
services/<service>/
├── docker-compose.yml
├── service.yaml # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example # Template — never commit actual secrets
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
```
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
## Agent System Architecture
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
4. **Executor** — Executes actions only after they transition to `approved`.
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
### Action approval flow
```
Agent → /opt/homelab/actions/pending/<id>.json
→ Telegram notification → Operator approves
→ /opt/homelab/actions/approved/<id>.json
→ Executor runs → completed / failed
```
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
## Event System
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
### Supervisor event routing table
| Event type | Source | Action generated | Cooldown |
|---|---|---|---|
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
## Discovery Entry Points for Agents
When exploring the system, use these files in order:
1. `inventory/topology.yaml` — node list, roles, mesh type
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
4. `services/<service>/service.yaml` — operational contract for a service
## VPS-Specific Rules
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
### Memory limit convention
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
```yaml
services:
myservice:
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
```
Rules:
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
### Repo-managed services on VPS
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
| Service | Compose stack | Data path |
|---|---|---|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
**Cutover checklist** (before running `docker compose up` for any migrated service):
1. `git pull` on VPS
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
4. For mosquitto: config stays at old bind path until explicitly migrated
5. Verify named volumes exist: `docker volume ls | grep <project>`
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
## CHELSTY-Specific Rules
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
## Runtime Path Conventions
`/opt/homelab/` layout on each node:
- `data/<service>/` — persistent volumes
- `config/<service>/` — secrets and host-local overrides (not in Git)
- `logs/<service>/` — service logs
- `state/` — deployment stage markers, agent heartbeats
- `events/` — append-only event store
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed
## Definition of Done (serwisy)
Before any new or changed service is considered ready:
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
## Naming Conventions
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
## Multi-agent worktree mode
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
**DISCIPLINE RULE — enforced after 2026-06-08 session violation:**
All feature/implementation work MUST happen in a task worktree, never directly in the main
checkout. The main checkout is for reading context and running deploys only. If you are
about to create a new branch or make implementation commits while `pwd` is
`~/homelab-codex-ws`, stop and ask the operator to run `agent.sh new <name>` first.
If `.agent-task` exists in your current working directory, you are in a task worktree.
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
Agents never invoke these — only the human does.

View file

@ -13,22 +13,6 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
## Agent System
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
| Agent | Node | Role |
|-------|------|------|
| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
| **executor** | VPS | Executes actions only after operator approval |
| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
## Repository Structure
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
@ -45,13 +29,10 @@ Action approval flow: `pending/` → operator approves → `approved/` → execu
## Documentation Index
- [Infrastructure Standards](docs/standards.md)
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
- [Deployment Conventions](docs/deployment.md)
- [Hardware](docs/hardware.md)
- [Networking](docs/networking.md)
- [Services](docs/services.md)
- [Node Capabilities](docs/capabilities.md)
- [Action Model](services/agent-system/action-model.md)
---
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*

View file

@ -1,49 +0,0 @@
# Agent Operating Procedures
This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
## 1. Core Principles for Agents
1. **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
2. **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
3. **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
4. **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
5. **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
## 2. Agent Roles
| Role | Responsibility | Scope |
|------|----------------|-------|
| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
## 3. Discovery Protocol
Agents must use the following entry points to understand the system:
1. **Topology**: `inventory/topology.yaml` for node list and roles.
2. **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
3. **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
4. **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
## 4. Interaction with Humans
Agents communicate with the operator via the `agent-system/telegram-bot`.
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
## 5. Decision Logic (Reasoning)
When making decisions, agents MUST prioritize:
1. **Safety**: Do not violate power constraints (see `capabilities.yaml`).
2. **Stability**: Prefer keeping services on their `owner_node` unless it's down.
3. **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
## 6. Access Control for Agents
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.

View file

@ -1,123 +0,0 @@
# Tech-debt backlog
Centralny tracker tech-długu i znanych usterek. Wpisy ze sesji — dodawaj z datą i kontekstem.
---
## Aktywne
### 🔴 BLOKUJĄCE — FLOTA-BOMBA: node-agent SSH mount ślepy po recreate
**Data**: 2026-06-11
**Źródło**: sesja lustro ssh shipping fix
**Problem**: solaria/piha/chelsty to stare **root** kontenery node-agenta (piha Created
2026-05-27, uid 0) — sprzed dodania `user: "1000:1000"` do bazowego compose. Ich override
montuje klucz SSH w `/root/.ssh`, co działa tylko dla uid 0. Pierwszy `--force-recreate` /
reboot hosta / update obrazu przełączy kontener na uid 1000 (`homelab`, HOME=/home/homelab)
i shipping eventów na VPS padnie z "Permission denied" — dokładnie jak na lustrze
(naprawione `a5a1352`). `ssh` w `_ship_events_to_vps()` nie ma `-i` i szuka klucza
w `$HOME/.ssh`.
**⚠️ NIE RECREATE node-agenta na solaria/piha/chelsty przed fixem.**
**Fix**: ujednolicić mount → `/home/homelab/.ssh` we wszystkich
`hosts/*/runtime/node-agent/docker-compose.override.yml` (wzór: `hosts/lustro/`)
ALBO dodać `-i $HOME/.ssh/id_rsa` w `_ship_events_to_vps()`.
---
### ha-diag-agent deploy ZABLOKOWANY (placeholder token)
**Data**: 2026-06-11
**Źródło**: sesja — deploy config merged (`5e9db5c`), `.env` na piha utworzony
(`/opt/homelab/config/ha-diag-agent/.env`, chmod 600) ale token = PLACEHOLDER.
**Blokada**: chelsty-ha offline → brak tokenu i połączenia.
**Do decyzji**: cel HA — chelsty-ha vs HA Ken (`homeassistant5` na piha; z kontenera
NIE `localhost`).
**Przed `shadow_mode=false`**: target restartu w supervisorze = nazwa kontenera
`homeassistant5`; curl endpointu HA z tokenem = HTTP 200.
---
### observer-poison-quarantine — review brancha (`78c9e4a`)
**Data**: 2026-06-11
**Źródło**: sesja — patch Codexa zachowany na `task/observer-poison-quarantine`, NIE w master.
**Do zrobienia**: zweryfikować, czy observer realnie wiesza się na malformed evencie
(poison NIE był przyczyną awarii lustra — hipoteza niezweryfikowana, obalona przez
verify-before-fix). Realny bug → merge; inaczej → drop brancha i worktree.
---
### node_agent.py — drobne sprzątanie shippingu
**Data**: 2026-06-11
**Źródło**: sesja lustro ssh shipping fix
1. **Stale komentarz** `node_agent.py:546-548` — twierdzi, że kontener "runs as root";
nieaktualne od `user: "1000:1000"`.
2. **Sukces shippingu na `logger.debug`** → podnieść do `info` lub dodać licznik —
działający shipping jest niewidoczny w logach przy INFO, co utrudniało diagnozę
(cicha awaria wyglądała identycznie jak ciche działanie).
---
### event-bloat: wyczyścić spłynięty backlog lustro na VPS
**Data**: 2026-06-11
**Źródło**: sesja — po fixie shippingu 7600+ plików backlogu spłynęło do
`/opt/homelab/events/lustro/` na VPS.
**Fix**: wyczyścić stare pliki (observer już je przetworzył); docelowo polityka retencji
w event-store.
---
### rsync `--omit-dir-times` (node-agent)
**Data**: 2026-06-09
**Źródło**: flota recovery session
**Objaw**: rsync exit code 23 po każdym push — `set-times` na katalogu `/opt/homelab/events/`
zwraca EPERM (oskar nie jest właścicielem katalogu; aerbot jest). Pliki są kopiowane poprawnie,
ale exit 23 zaśmieca logi i może maskować prawdziwe błędy.
**Fix**: dodać `--omit-dir-times` do wywołania `rsync` w `node-agent.py`.
**Lokalizacja**: `services/node-agent/src/node_agent.py` — wywołanie rsync w pętli push.
**Update 2026-06-11**: potwierdzone flotowo — każdy node loguje fałszywe
"Event shipping failed" (rsync code 23) co cykl, mimo że pliki przechodzą; katalogi
`/opt/homelab/events/*` na VPS należą do `aerbot`, klient nie ustawi na nich czasów.
---
### Deklaratywny zapis `oskar ∈ aerbot` w manifeście VPS
**Data**: 2026-06-09
**Źródło**: flota recovery — root cause: oskar spoza grupy aerbot(1000) → rsync Permission denied
**Problem**: przynależność do grupy jest zarządzana ręcznie (`usermod -aG 1000 oskar` ad-hoc).
Brak gwarancji po przeinstalowaniu VPS lub zmianie usera.
**Fix**: dodać do `hosts/vps/host.yaml` lub `hosts/vps/capabilities.yaml` sekcję
`users: oskar: groups: [aerbot]` — i wyegzekwować w deploy/bootstrap skrypcie VPS.
Alternatywa: zmienić właściciela `/opt/homelab/events/` na `oskar:oskar` i zaktualizować
node-agent deploy skrypty.
---
### Rozdzielenie worktree per task (agent.sh)
**Data**: 2026-06-09
**Źródło**: sesja — `homelab-codex-ws-node-onboarding` używany raz dla `task/node-onboarding`,
raz dla `task/fix-event-bloat` przez ręczne `git checkout`.
**Problem**: jeden worktree współdzielony przez dwa branche = anty-wzorzec. `git branch`
mogło wskazywać zły branch; `+` w listingu = pozornie "w innym worktree" ale nieprawda.
Prowadzi do commitowania na złej gałęzi.
**Fix**: egzekwować — jeden task = jeden worktree (`agent.sh new <task-name>`). Przy wejściu
do worktree zawsze `git branch --show-current` i weryfikacja `.agent-task`.
Długoterminowo: `agent.sh new` powinien odmawiać jeśli żądana gałąź jest już sprawdzona.
---
## Zamknięte
### Observer staleness — martwy node pokazywany NOMINAL
**Data**: 2026-06-08 (złapane), status: OTWARTY w sensie implementacji
**Problem**: observer/supervisor trzyma ostatni znany stan; brak heartbeat TTL.
Chelsty-infra milczy, ale status NOMINAL podważa zaufanie do panelu.
**Fix**: heartbeat TTL → po przekroczeniu oznacz status `stale` lub `down`.
**Powiązane**: brain-watchdog ślepy na per-node freshness.
*(Otwarty jako TODO implementacyjny — przeniesiony z sesji 2026-06-08)*

View file

@ -83,10 +83,3 @@ Future autonomous agents will use this metadata to:
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
## Agent Reasoning Logic
When an agent parses `capabilities.yaml`, it should apply these heuristics:
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.

View file

@ -1,154 +1,60 @@
# CHELSTY Runtime
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
| Node | Role | Services |
|------|------|----------|
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
## Runtime Layout
```
/opt/homelab/
├── config/ # Service-specific configs and secrets (not in Git)
│ ├── mosquitto/
│ └── zigbee2mqtt/
├── data/ # Persistent service data
│ ├── mosquitto/ # Persistence DB, password file
│ └── zigbee2mqtt/
│ └── data/ # z2m config, coordinator backup, network key
└── logs/
```
The CHELSTY runtime is located at `/opt/homelab`.
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
- `/opt/homelab/data/`: Persistent data for services.
- `/opt/homelab/logs/`: Service logs.
### Key Service Locations
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
## SLZB-06U Integration
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
- **Coordinator IP**: `192.168.1.105`
- **Port**: `6638`
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
- **Coordinator IP**: 192.168.1.105
- **Port**: 6638
- **Protocol**: TCP (ezsp adapter)
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
Zigbee2MQTT is configured to connect to this coordinator over the local network.
## Networking Constraints
## Offline & LTE Assumptions
### Mosquitto — `network_mode: host`
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
### Zigbee2MQTT — bridge network + extra_hosts
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
```yaml
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
services:
zigbee2mqtt:
extra_hosts:
- "mosquitto:host-gateway"
```
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
**Why not `network_mode: host` for z2m?**
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
## Zigbee2MQTT Config Location
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
```
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
```
This path is mounted read-write by the base `docker-compose.yml`:
```yaml
volumes:
- /opt/homelab/data/zigbee2mqtt/data:/app/data
```
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
### Minimal configuration.yaml
```yaml
homeassistant: true
permit_join: false
mqtt:
base_topic: zigbee2mqtt
server: mqtt://mosquitto:1883
serial:
port: tcp://192.168.1.105:6638
adapter: ezsp
frontend:
port: 8080
advanced:
log_level: info
```
## chelsty-ha — No node-agent
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
In `hosts/chelsty-ha/services.yaml`:
```yaml
services:
homeassistant:
monitor: false # No node-agent; suppresses supervisor action generation
```
Remove `monitor: false` once node-agent is bootstrapped on this VM.
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
## Deployment Flow
### Initial Bootstrap
```bash
./scripts/bootstrap/chelsty-runtime.sh
```
1. **Initial Bootstrap**:
Run the bootstrap script on the CHELSTY node:
```bash
./scripts/bootstrap/chelsty-runtime.sh
```
### Deploy services
```bash
./scripts/deploy/deploy-node.sh chelsty-infra
./scripts/deploy/deploy-node.sh chelsty-ha
```
2. **Manual Configuration**:
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
- Add Mosquitto user:
```bash
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
```
### Manual (SSH) — chelsty-infra uses docker-compose v1
```bash
ssh oskar@100.122.201.22
cd ~/homelab-codex-ws/services/<service>
docker-compose -f docker-compose.yml \
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
up -d --build --force-recreate
```
3. **Service Deployment**:
Use the staged deployment runtime:
```bash
./scripts/deploy/deploy-node.sh chelsty
```
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
## Recovery Procedure
## Recovery Procedures
### Mosquitto stopped
```bash
ssh oskar@100.122.201.22 "docker start mosquitto"
# Ensure restart policy is correct:
docker update --restart unless-stopped mosquitto
```
### Zigbee2MQTT won't start
1. Check logs: `docker logs zigbee2mqtt --tail 50`
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
4. If config missing, recreate from the minimal template above
### SLZB-06U unreachable
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
## Critical Backup Sets
| Data | Path |
|------|------|
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
In case of runtime failure:
1. Verify Docker and Compose plugin: `docker compose version`
2. Re-run bootstrap script to ensure directory structure and basic configs.
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
4. Verify SLZB-06U reachability: `ping 192.168.1.105`

View file

@ -19,7 +19,7 @@ It acts as a filesystem-first watchdog that detects anomalies in the local runti
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty/events.jsonl`.
#### Deployment

View file

@ -1,98 +0,0 @@
# Observer Runtime
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
## Architecture
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
### Inputs
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
### World Model Output
Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
## Checkpoint Format
The observer tracks per-node progress to avoid silently skipping event directories:
```json
{
"node_checkpoints": {
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
}
}
```
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
## Event Types
### Negative events (create/escalate incidents)
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
- `deployment_failed` — record failure in deployments.json
### Positive events (resolve state)
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
- `service_recovered` — alias, same effect
- `deployment_completed` — marks deployment as completed
### Node events
- `node_online`, `node_offline` — update node status in nodes.json
- `disk_pressure_*` — set `disk_pressure` field on the node record
## Incident Lifecycle
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
### Example Incident JSON
```json
{
"inc-1715518800-vps-observer": {
"id": "inc-1715518800-vps-observer",
"node": "vps",
"service": "observer",
"status": "resolved",
"severity": "error",
"started_at": 1715518800.0,
"last_occurrence": 1715518860.0,
"occurrence_count": 2,
"trigger_type": "containers_not_running",
"resolved_at": 1715519100.0
}
}
```
## World State Pruning
`_prune_stale_world()` runs every reconcile cycle and removes:
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
4. **Expired incidents** — resolved incidents older than 7 days.
## Runtime Behavior
### Idempotency
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
### Deployment Tracking
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
### Topology Filtering
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.

View file

@ -1,234 +0,0 @@
# SESSION: Budowa planner-agent — LLM-based diagnostics
**DATA:** 2026-05-27
**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
---
## Co zostało zbudowane
### `services/planner-agent/src/llm_router.py`
Moduł LLM routing z local-first fallback chain:
- **`LLMRouter`** — główna klasa routingu przez litellm
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
### `services/planner-agent/src/planner.py`
Główna pętla agenta:
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
Pipeline:
```
Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
→ write_pending_action() → emit_event("remediation_started")
```
### Pliki towarzyszące
| Plik | Opis |
|------|------|
| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
| `requirements.txt` | litellm, redis, jsonschema, structlog |
| `tests/test_planner.py` | 49 testów jednostkowych |
| `tests/test_llm_router.py` | 34 testy jednostkowe |
---
## Kluczowe decyzje architektoniczne
### 1. HITL invariant (Human-in-the-loop)
Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
### 2. Cooldown gate
Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
propozycjami dla tego samego serwisu.
**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
przez 5 minut mimo że nie powstała żadna propozycja.
### 3. Fallback chain — local-first
Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
Uzasadnienie:
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
- Haiku = szybki i tani cloud fallback
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
### 4. Brak importów z control-plane
`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
`scripts/lib/events.py`).
Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
cykl deploymentu.
### 5. structlog z PrintLoggerFactory
Nie używamy `structlog.stdlib.add_logger_name``PrintLogger` nie ma atrybutu `.name`.
Zamiast tego łańcuch procesorów: `add_log_level``TimeStamper``StackInfoRenderer`
`format_exc_info``JSONRenderer`.
### 6. NODE_NAME czytany w czasie wywołania, nie importu
`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
(nie jako default parameter). Umożliwia patchowanie w testach.
---
## Problemy napotkane i rozwiązania
### Problem: `localhost` w kontenerze nie sięga do hosta
**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
**Rozwiązanie:**
1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
### Problem: `environment` vs `env_file` — podwójne zmienne
**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
że `.env` był opcjonalny a nie wymagany.
**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
kompatybilny z `PrintLoggerFactory`.
### Problem: verify stage failuje zaraz po starcie
**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
pętli i pierwsze `touch()` heartbeatu.
**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
Kontener pokazuje `(healthy)` po 30s od startu.
### Problem: git pull z divergent branches na solaria
**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
`git pull` failował z "Need to specify how to reconcile divergent branches."
**Rozwiązanie:**
```bash
git checkout -- services/planner-agent/docker-compose.yml # porzuć ręczne zmiany
git fetch origin
git rebase origin/master # rebase local commits on top of master
```
---
## Status deploymentu na SOLARIA
```
Container: planner-agent Up ~30m (healthy)
Image: planner-agent-planner-agent
Node: solaria (100.100.231.104)
Heartbeat: /opt/homelab/state/planner-agent.heartbeat (age 0s)
Channels subscribed:
- health_events
- world_updates
LLM chain:
PRIMARY: ollama/qwen2.5-coder:14b @ http://host-gateway:11434
FALLBACK: claude-haiku-4-5-20251001 (disabled — brak ANTHROPIC_API_KEY)
FALLBACK: claude-sonnet-4-6 (disabled — brak ANTHROPIC_API_KEY)
Redis: redis://100.108.208.3:6379 ✓ connected
```
---
## Co zostało na później
### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.
Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
Aby włączyć:
```bash
ssh oskar@100.100.231.104
echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
```
### 2. End-to-end test z prawdziwym eventem
Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
Test:
```bash
redis-cli -h 100.108.208.3 PUBLISH health_events '{
"type": "service_unhealthy",
"node": "piha",
"service": "mosquitto",
"severity": "error",
"payload": {"reason": "container exited"},
"timestamp": "2026-05-27T20:00:00Z"
}'
# Obserwuj: docker logs planner-agent -f
# Sprawdź: ls /opt/homelab/actions/pending/
```
### 3. Solaria local commits
Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
Należy je wypchnąć lub zreviewować i ewentualnie squashować.
### 4. Integracja z operator UI / Telegram
Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
---
## Commity tej sesji
```
ff6fda1 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
ca37fca Add planner-agent: LLM-powered remediation planner
(llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
healthcheck.sh, Dockerfile)
```

View file

@ -1,103 +0,0 @@
# SESSION: Stabilizacja systemu wieloagentowego homelabu
**DATE:** 2026-05-27
**RESULT:** System NOMINAL (97/97 services, 0 errors)
---
## PROBLEMS FOUND
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
- `service_healthy` event nie zamykał aktywnych incydentów
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
---
## FIXES SHIPPED (commits in master)
```
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
b40b832 Fix ghost service keys from hash-prefixed Docker container names
28e9534 observer: service_healthy resolves active incidents
46ae92b supervisor: also cancel pending actions for services removed from desired state
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
fb7828b supervisor: auto-cancel pending actions when drift is resolved
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
267742c vps/node-agent: add network_mode: host for control-plane health probe
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
65bac4e fix(node-agent): mount host SSH key into container for event shipping
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
```
### Szczegóły kluczowych napraw
**fix(observer): per-node checkpoints**
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
**fix(observer): ghost key pruning**
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
**fix(node-agent): canonical container name**
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
**fix(node-agent): service_healthy emission**
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
**fix(supervisor): auto-cancel resolved actions**
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
- serwis stał się healthy (`drift_resolved_auto`)
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
**fix(supervisor): monitor:false**
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
**fix(agent-system/materializer): control-plane API as source**
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
**fix(chelsty-infra/zigbee2mqtt): writable config**
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
---
## STAN KOŃCOWY
| Node | Status | Serwisy |
|------|--------|---------|
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
| solaria | online | node-agent, stability-agent, AI workloads |
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
**Action queue:** 0 pending, 0 approved, 0 running
**Incidents:** 0 active
**Ghost service keys:** 0
---
## ZNANE OGRANICZENIA / TODO
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.

View file

@ -1,100 +0,0 @@
# Sesja 2026-06-08 — onboarding LUSTRO (RPi4 / Magic Mirror / KEN)
## Cel
Budowa reużywalnego narzędzia onboardingu nodów `scripts/onboard/` (bash idempotentny,
NIE Ansible — świadoma decyzja), napędzanego deklaratywnym manifestem
`hosts/<node>/node.yaml`. Pierwszy realny node: LUSTRO.
## Node LUSTRO (fakty z preflight)
- RPi4, aarch64, Debian bookworm, hostname pimirror2, sieć KEN 192.168.31.x
- RAM 4 GB (MM zjada ~1.7 Gi — ten sam profil co VPS z OOM 2026-06-01 → `mem_limit` obowiązkowy)
- dysk 58 G / 48% (luz)
- docker 29.5.3 już zainstalowany (krok `20-install-docker` zbędny dla tego node'a)
- user `pi`: uid=1000, passwordless sudo (potwierdzone `sudo -n true`=0), grupy docker+ollama
- Magic Mirror = systemd unit `magicmirror.service` (Electron jako pi) — **NIETKNIĘTY** przez całą sesję
- swap = 200 M plik `/var/swap` na SD → do migracji na zram (wear karty)
- Tailscale: zainstalowany w tej sesji, Running, IP 100.99.85.73
## Decyzje
- **user = istniejący `pi`** (NIE tworzymy `oskar``pi` już zajmuje uid 1000, jest
właścicielem MM, ma docker+sudo; node-agent docker `1000:1000` pasuje out-of-box).
Świadome odstępstwo od konwencji "oskar wszędzie".
- runtime node-agent = docker
- `first_contact` = LAN IP `pi@192.168.31.19` (mDNS `.local` okazał się zawodny —
transient resolve fail); po `tailscale up` kontakt przejmuje mesh (`pi@lustro`)
- Tailscale auth = login interaktywny (URL), bez authkey
- swap target = zram
## Stan: 00-access ZAMKNIĘTY
Idempotentny, przeszedł na ostro + re-run czysty. Lustro w mesh, kanał SATURN→lustro
przez Tailscale działa bezhasłowo. Verify czysty (arch=aarch64).
## Bugi narzędzia naprawione w tej sesji
1. **dry-run był płytki** (tylko orchestrator) → `run()` helper + propagacja `DRY_RUN=1`
do steps (`lib/common.sh`, `onboard.sh`, `remote.sh`, `00-access.sh`)
2. **`yaml_get` fallback** (bez `yq`):
- inline-comment stripping — `[[:space:]]+#.*$` po wartości
- PRE-EXISTING greedy-colon bug — `.*:` ucinał ostatni dwukropek, gubił prefix
w `systemd:magicmirror.service`; fix: `^[[:space:]]*[^:]*:[[:space:]]*`
3. **`00-access` verify** — ssh known-hosts warning wpadał do parsowanego `arch`
(`WARN "Unexpected arch 'Warning:Permanently…'"`); fix: `-o LogLevel=ERROR`
+ czysty stdout (bez `2>&1`)
## Branch / commity
`feat/node-onboarding` (6 commitów):
| Hash | Opis |
|------|------|
| `adb8407` | scaffold — onboard.sh, lib/, steps/00-preflight, hosts/lustro/node.yaml draft |
| `9012a36` | 00-access.sh + node.yaml ssh_user/first_contact/hardware |
| `931fd46` | dry-run propagacja — run() helper, DRY_RUN=0/1 |
| `eed0ad0` | yaml_get fix — inline-comment + greedy-colon |
| `1bed855` | first_contact: IP zamiast mDNS .local |
| `471ba09` | verify fix — LogLevel=ERROR, czysty stdout |
## OTWARTE — do następnej sesji (kolejność)
1. **WORKTREE HYGIENE** (pierwsza rzecz): cała sesja jechała w MAIN checkout wbrew
zasadzie "main = deploy-only". Decyzja nierozstrzygnięta:
- (A) rename `feat/``task/node-onboarding` + worktree + main→master
(pełna zgodność z `agent.sh`; merge=FF)
- (B) zostać `feat/` + ręczny `git merge --ff-only`
`agent.sh new` tworzy `task/<name>` od `master` i NIE bierze istniejącego brancha.
`git worktree list` jeszcze nieodczytany (potrzebny wzorzec ścieżki).
2. **base step**: migracja swap 200 M-plik → zram; `/opt/homelab` + `chown pi`
(uid 1000 już pasuje); event dir `/opt/homelab/events/lustro/`
3. **node-agent step**: docker override, user 1000:1000 (pi=1000), `mem_limit: 256m`
4. **register step**: observer/supervisor inventory + redis sub + UI panel agents.okit.pl
5. **verify step (50)**: smoke end-to-end (event dotarł do control plane, widać w UI,
realny alert path Telegram)
6. **mm-watch**: health check `systemctl is-active magicmirror.service`
7. **drobiazgi**: baner URL w 00-access ma defekt wyrównania; `locale pl_PL`
niewygenerowane na lustrze (niegroźne)
## Learnings
(odzwierciedlone też w `scripts/onboard/README.md`)
- mDNS `.local` zawodny do automatyzacji → `first_contact` przez IP lub tailscale, nie `.local`
- istniejący node z userem uid=1000: użyj go zamiast tworzyć `oskar` (kolizja uid)
- swap na SD = wear → zram
- dry-run MUSI propagować do step-skryptów (`run()` wrapper), inaczej bezużyteczny
- yaml fallback bez `yq` musi strippować inline komentarze i nie być greedy na `:`
## Update — worktree hygiene
- feat/node-onboarding → task/node-onboarding. Main checkout (~/homelab-codex-ws) wrócił na master (deploy-only). Praca onboardingu w ~/homelab-codex-ws-node-onboarding.
- Origin: task/ pushnięty+tracking, feat/ usunięty.
- DROBIAZG: worktree utworzony ręcznie (git worktree add) → agent.sh list pokazuje "(no marker)"/parent=?. Działa; przy finałowym `agent.sh merge node-onboarding` zweryfikować, czy brak markera nie przeszkadza — inaczej dorobić marker (wzór: ha-piha) lub ręczny `git merge --ff-only`.
- NASTĘPNE: base step (zram, /opt/homelab, event dir /opt/homelab/events/lustro/) — z worktree node-onboarding.
- Osobny przyszły projekt: parent-layout refaktor (bare + worktree pod jednym katalogiem) — wymaga przepisania agent.sh + zabezpieczenia dirty ha-piha.
## Tech-debt złapany w sesji
- OBSERVER STALENESS: martwy node (chelsty-infra) świeci NOMINAL w agents.okit.pl — observer/supervisor trzyma ostatni znany stan i nie degraduje przy braku heartbeatu (eventy: tylko VPS raportuje świeżo, chelsty milczy a status NOMINAL). FIX (zdalny, software): heartbeat TTL → po przekroczeniu oznacz `stale`/`down`. Ważne: false-NOMINAL podważa zaufanie do statusu wszystkich nodów. Przenieść do głównego tech-debt backlogu, jeśli istnieje osobny.

View file

@ -1,124 +0,0 @@
# Sesja 2026-06-09 — flota recovery + LUSTRO register
## Cel
Diagnoza cichej awarii reportingu floty; dokończenie kroku REGISTER dla LUSTRO
(40-register.sh + 50-verify.sh); update skilla node-onboarding.
---
## GŁÓWNE: 8-dniowa cicha awaria reportingu floty — ROZWIĄZANA
### Root cause
`oskar` (uid 1002) **spoza grupy aerbot (1000)** na VPS.
`/opt/homelab/events/*` = `aerbot:aerbot 775``oskar` w "other" (r-x).
`rsync` push z każdego node'a (jako `oskar` przez SSH) = **Permission denied** przy
zapisie → `--remove-source-files` nie czyścił backlogu → **292 000 plików** nagromadzonych
w staging cache node-agentów.
### Fix
```bash
usermod -aG 1000 oskar # na VPS; ssh re-login wymagany
```
### Weryfikacja
- VPS `events/piha` 3443 pliki (rośnie)
- `piha` lokalnie: 2 pliki (staging wyczyszczony)
- Panel agents.okit.pl: vps / piha / solaria — Last Seen świeże
### Diagnoza — 5 warstw, 4 obalone hipotezy
Verify-before-fix obalił kolejno:
1. `authorized_keys` missing — klucz był, SSH działał (piha→VPS ręcznie OK)
2. Remote agent down — procesy `rsync` widoczne w `ps`, logi bez crash
3. VPS IP zmiana — Tailscale IP niezmieniony 100.95.58.48
4. Bridge/relay cutoff — ping VPS→piha OK przez mesh
5 warstwa (błąd uprawnienia) odkryta przez ręczny `rsync` jako `oskar` na VPS →
`Permission denied (13)``stat /opt/homelab/events/``aerbot:aerbot 775`.
### Dlaczego awaria była CICHA (3 warstwy maskujące)
| Warstwa | Mechanizm |
|---------|-----------|
| (a) shipping fail | Logowany jako `WARNING`, nie crash — node-agent nie failował, milczał |
| (b) observer staleness | Stale node pokazywany NOMINAL — brak heartbeat TTL, observer trzyma ostatni znany stan |
| (c) brain-watchdog | Ślepy na per-node freshness — nie monitoruje świeżości eventów per-node |
### Pozostały drobny błąd
`rsync` exit code 23: `set-times` na katalogu = `EPERM` (oskar nie jest właścicielem
`/opt/homelab/events/``aerbot` jest). Kosmetyka — rsync działa poprawnie.
**Fix**: dodać `--omit-dir-times` do wywołania rsync w node-agent (wpisane do backlogu).
---
## LUSTRO register: stan po sesji
### Dokonane
- `40-register.sh` — napisany i zcommitowany na `task/node-onboarding`
- Idempotentny: grep topology, `[[ -f services.yaml ]]`, `git diff --quiet`
- Commituje tylko `inventory/topology.yaml` + `hosts/lustro/services.yaml` na bieżącym branchu
- BEZ `git push` (merge należy do operatora)
- `50-verify.sh` — napisany i zcommitowany
- 4 checki: node-agent running, eventy, observer restart + heartbeat poll, world/nodes.json
- Tabela pass/fail; exit 1 on failure
- `40-deploy-node-agent.sh` — scaffold usunięty (deploy w 30-node-agent.sh)
- Dry-run `40-register.sh --dry-run` przeszedł czysto
### Mechanizm aktywacji observera (zbadany)
Observer bind-mountuje repo root jako `/repo:ro` z `services/control-plane/docker-compose.yml`
(`../..:/repo:ro` → `/home/oskar/homelab-codex-ws` na VPS). `_load_inventory()` wywoływane
raz przy starcie. **Aktywacja po merge**: `git pull` VPS + `docker restart control-plane-observer`
— bez redeploy.
### Wpis lustro w topology.yaml (minimalistyczny, 1:1 z piha)
```yaml
lustro:
roles:
- edge
services:
- node-agent
```
### PENDING (jutro)
1. Commit B: `onboard.sh --node lustro --step 40-register` live → commit na branchu
2. `agent.sh merge task/node-onboarding` → master
3. `git pull` na VPS + `docker restart control-plane-observer`
4. `onboard.sh --node lustro --step 50-verify` → lustro widoczny w agents.okit.pl
---
## fix-event-bloat (task/fix-event-bloat)
Commit `d483274` na branchu: batch rsync, backlog trim, timeout 120s, backlog warn.
**PENDING**: review + deploy na flotę.
---
## OOM ai-cluster (obserwacja live)
Zaobserwowany na VPS podczas sesji: cgroup OOM restart-loop, python workery ~195 MB,
0 swap. **PENDING**: migracja `ai-cluster` → SOLARIA + dodanie swap na VPS.
---
## Gotcha sesji
**Worktree branch confusion**: `~/homelab-codex-ws-node-onboarding` był przełączony
ręcznie na `task/fix-event-bloat` (jeden worktree, dwa branche ręcznie switchwane).
Anty-wzorzec: zawsze sprawdzać `git branch --show-current` na wejściu do worktree.
Docelowo: osobny worktree per task.
---
## Tech-debt złapany w sesji
→ wpisany do `docs/backlog.md`

View file

@ -1,114 +0,0 @@
# Sesja 2026-06-10/11 — lustro SSH shipping fix + ha-diag-agent piha
## Cel
Naprawa shippingu eventów lustro → VPS; domknięcie deploy-configu ha-diag-agent na piha;
zachowanie poison-quarantine (Codex) do osobnego review.
---
## GŁÓWNE: LUSTRO event shipping — NAPRAWIONY (merged `a5a1352`)
### Root cause
`_ship_events_to_vps()` (`services/node-agent/src/node_agent.py`) woła `ssh` **bez `-i`**,
więc klucz jest szukany w `$HOME/.ssh` = `/home/homelab/.ssh` (kontener działa jako
uid 1000 `homelab` od dodania `user: "1000:1000"` do bazowego
`services/node-agent/docker-compose.yml`). Override lustra montował klucz w `/root/.ssh`
**ślepy mount**, ssh tam nie patrzy → `oskar@100.95.58.48: Permission denied`.
### Fix
`hosts/lustro/runtime/node-agent/docker-compose.override.yml`:
```yaml
- /home/pi/.ssh:/home/homelab/.ssh:ro # było: /root/.ssh — ślepe
```
Klucz `pi@pimirror2` dodany do `authorized_keys` `oskar@VPS`.
uid match (pi=1000 = homelab=1000) spełnia strict ownership check OpenSSH.
### Weryfikacja
- 5 nodów NOMINAL w world state; lustro w `/opt/homelab/world/nodes.json` (online, świeży `last_seen`)
- 7600+ eventów backlogu spłynęło na VPS (`/opt/homelab/events/lustro/`)
- Staging na lustrze drenowany do zera (`--remove-source-files` działa)
- "Permission denied" zniknął z logów node-agenta
### Diagnoza — lekcja verify-before-fix
Oba agenty (Claude Code, Codex) błędnie wskazały observer (poison event / race)
na **nieaktualnym stanie** (`events=2` z ręcznego testu). Verify-before-fix obalił
obie hipotezy: `events/lustro` na VPS było puste → problem w warstwie **dostarczania**
(klucz SSH), nie w observerze.
---
## ha-diag-agent piha — deploy config merged (`5e9db5c`), deploy NIEDOKOŃCZONY
- `.env` utworzony na piha: `/opt/homelab/config/ha-diag-agent/.env`, chmod 600
- **ALE token = PLACEHOLDER** — chelsty-ha offline → brak tokenu i połączenia
- Przed `shadow_mode=false`: target restartu w supervisorze = nazwa kontenera
`homeassistant5`; curl endpointu z tokenem musi dać HTTP 200
- Decyzja PENDING: cel HA = chelsty-ha vs HA Ken (`homeassistant5` na piha —
z kontenera NIE `localhost`)
---
## observer poison-quarantine (Codex)
Zachowany na branchu `task/observer-poison-quarantine` (`78c9e4a`) — **NIE w master**.
Do osobnego review: czy observer realnie wiesza się na malformed evencie
(poison NIE był przyczyną lustra; hipoteza niezweryfikowana).
Realny bug → merge; inaczej → drop.
---
## 🔴 FLOTA-BOMBA — odkryta, NIE naprawiona (backlog, BLOKUJĄCE)
solaria / piha / chelsty to wciąż **stare root kontenery** node-agenta
(piha Created 2026-05-27, uid 0). Ich mount `/root/.ssh` działa tylko dlatego,
że kontenery są sprzed `user: "1000:1000"`. Pierwszy `--force-recreate` / reboot
hosta / update obrazu przełączy je na uid 1000 i shipping padnie jak na lustrze.
**NIE RECREATE bez fixu.** Szczegóły i fix: `docs/backlog.md`.
---
## Tech-debt złapany w sesji
→ wpisany do `docs/backlog.md` (flota-bomba, ha-diag-agent blocked,
poison-quarantine review, `--omit-dir-times`, stale komentarz node_agent.py,
shipping success na `logger.debug`, event-bloat lustro na VPS).
## Session 20:19
### Commits
fa59625 docs(ha-diag-agent): replace curl verify commands with docker exec
d7e0d31 fix(ha-diag-agent): remove host port mapping for 8087
### Files changed
services/ha-diag-agent/DEPLOY.md | 4 ++--
services/ha-diag-agent/README.md | 4 ++--
services/ha-diag-agent/docker-compose.yml | 3 ---
services/ha-diag-agent/service.yaml | 3 ---
4 files changed, 4 insertions(+), 10 deletions(-))
### Deploys
None recorded
### Narrative
> _user-provided summary_
## Session 20:35
### Commits
(brak nowych — commity d7e0d31 i fa59625 z tej sesji trafiły do mastera przed tym wpisem)
### Files changed
(bez zmian — zob. Session 20:19)
### Deploys
None recorded
### Narrative
> _user-provided summary_

View file

@ -1,62 +0,0 @@
# Stability Agent Multi-Node Rollout
## Architecture Summary
The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
- **Source**: `services/stability-agent`
- **State Path**: `/opt/homelab/state`
- **Events Path**: `/opt/homelab/events`
- **Redis Target**: `100.108.208.3:6379` (PIHA)
## Why UI only showed CHELSTY
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
## Deployment
Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
```bash
# Print commands
./scripts/deploy/deploy-stability-agent.sh <node-name>
# Deploy via SSH (executes ssh oskar@<ip>)
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
```
### Manual Steps per Node
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
```bash
cd /home/oskar/homelab-codex-ws
git fetch origin
git checkout master
git pull origin master
cd services/stability-agent
./deploy-local.sh <node-name>
```
## Verification
### Fleet Overview
Run the verification script from any node with `redis-cli` access:
```bash
./scripts/deploy/verify-agent-fleet.sh
```
### Redis Inspection (on PIHA)
```bash
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
```
Verify Web UI backend:
```bash
curl -s http://127.0.0.1:18180/nodes
curl -k https://agents.okit.pl/nodes
```
## Troubleshooting
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.

View file

@ -49,10 +49,9 @@ Runtime state must live outside the repository to keep it immutable and clean.
## Service Standards
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
4. **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
5. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
## Docker Compose Standards

View file

@ -1,126 +0,0 @@
# VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
## Architecture
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
| Container | Role |
|-----------|------|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
## Supervisor Behavior
### Desired State
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
### Drift Types
- `missing_service` — service is in desired state but absent from `services.json`
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
### Action Types
| Trigger | Action type | Risk |
|---------|-------------|------|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
| Any other / unknown | `redeploy` | guarded |
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
### Action ID Stability
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
### Auto-Cancel
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
### Node Name Resolution
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
```bash
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
```
## Deployment
### From SATURN (primary control node)
```bash
# Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh
# Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
```
### Direct on VPS
```bash
cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate
```
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
### Verification
```bash
# On VPS
docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool
```
## Action Approval Workflow
```
Supervisor writes → /opt/homelab/actions/pending/<id>.json
→ Operator UI (port 18180) or Telegram Bot notifies
→ Operator clicks Approve
→ /opt/homelab/actions/approved/<id>.json
→ Executor executes → completed / failed
```
Possible action states: `pending → approved → running → completed / failed / rejected`
Auto-cancel path: `pending → cancelled/`
## Recovery
### World state is stale or corrupt
```bash
# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer
```
### Flood of pending actions after bootstrap
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
```bash
# Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
```
### Rebuild from scratch
```bash
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
```
## Integration
### piha agent-system webui (port 18180 on piha)
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
### Nginx Proxy Manager
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
### Log Locations
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/`
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`

View file

@ -1,24 +0,0 @@
host: chelsty-ha
site: chelsty
capabilities:
networking:
reachability: tailscale-only
tailscale_ip: 100.122.201.23
ingress_suitability: false
bandwidth: LTE
runtime:
container_engine: docker
os: debian
operational:
connectivity: intermittent
availability_target: best-effort
offline_first: true
uplink: lte
deployment:
suitability:
- homeassistant
restricted: false

View file

@ -1,20 +0,0 @@
hostname: chelsty-ha
site: chelsty
roles:
- homeassistant
network:
tailscale_ip: 100.122.201.23
runtime:
root: /opt/homelab
deployment:
mode: pull
managed_by: saturn
constraints:
connectivity:
intermittent: true
uplink: lte

View file

@ -1,12 +0,0 @@
host: chelsty-ha
site: chelsty
services:
homeassistant:
role: home-automation-controller
offline_required: true
# monitor: false — chelsty-ha has no node-agent deployed, so there are no
# container-health events for the observer to track. HA is monitored
# indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
# is likely down). Re-enable once node-agent is bootstrapped on this VM.
monitor: false

View file

@ -1,88 +0,0 @@
# Frigate NVR — chelsty-infra
# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
# Object detection: CPU (no Coral TPU)
# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
#
# Required env vars in /opt/homelab/config/frigate/frigate.env:
# CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
# CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
# MQTT_USER, MQTT_PASS (if mosquitto auth is enabled)
mqtt:
enabled: true
host: 127.0.0.1
port: 1883
# user: "{MQTT_USER}"
# password: "{MQTT_PASS}"
detectors:
cpu1:
type: cpu
num_threads: 3
ffmpeg:
hwaccel_args: preset-vaapi
global_args:
- -hide_banner
- -loglevel
- warning
record:
enabled: true
retain:
days: 7
mode: all
events:
retain:
default: 14
mode: motion
snapshots:
enabled: true
retain:
default: 7
quality: 70
objects:
track:
- person
- car
- bicycle
filters:
person:
min_area: 5000
max_area: 100000
threshold: 0.7
cameras:
camera1:
ffmpeg:
inputs:
# Main stream — high-res recording
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
roles:
- record
# Sub stream — low-res detection (lower CPU cost)
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
roles:
- detect
detect:
enabled: true
width: 640
height: 480
fps: 5
camera2:
ffmpeg:
inputs:
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
roles:
- record
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
roles:
- detect
detect:
enabled: true
width: 640
height: 480
fps: 5

View file

@ -1,25 +0,0 @@
services:
frigate:
container_name: frigate
image: ghcr.io/blakeblackshear/frigate:stable
restart: unless-stopped
privileged: true
shm_size: "256mb"
network_mode: host
devices:
- /dev/dri/renderD128:/dev/dri/renderD128
volumes:
- /etc/localtime:/etc/localtime:ro
- /opt/homelab/config/frigate/config.yml:/config/config.yml
- /opt/homelab/config/frigate:/config/credentials:ro
- /opt/homelab/data/frigate:/media/frigate
tmpfs:
- /tmp/cache
env_file:
- /opt/homelab/config/frigate/frigate.env
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s

View file

@ -1,11 +0,0 @@
services:
node-agent:
environment:
- NODE_NAME=chelsty-infra
- NODE_TYPE=lte_node
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
- /home/oskar/.ssh:/home/homelab/.ssh:ro

View file

@ -1,21 +0,0 @@
services:
zigbee2mqtt:
# mosquitto runs with network_mode: host on chelsty-infra.
# extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
# mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
# mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
extra_hosts:
- "mosquitto:host-gateway"
environment:
- TZ=Europe/Warsaw
healthcheck:
test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 90s
# Note: volumes NOT overridden here.
# The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
# (read-write). configuration.yaml must be placed in that directory on the node:
# /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
# z2m rewrites this file during migrations — read-only mount is not viable.

View file

@ -1,37 +0,0 @@
host: chelsty-infra
site: chelsty
services:
ha-diag-agent:
role: ha-diagnostic-agent
deployment_model: docker-compose
exposure: local-only
offline_required: false
depends_on:
local: []
external: [homeassistant]
config:
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
location_tag: "chelsty"
events_dir: /opt/homelab/events/chelsty-infra
runtime:
config_path: /opt/homelab/config/ha-diag-agent
data_path: /var/lib/ha-diag-agent
node-agent:
role: node-stability-monitor
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
# own retain policy is the correct remediation, not docker prune.
deployment_model: docker-compose
exposure: local-only
offline_required: true
mosquitto:
role: local-mqtt-broker
zigbee2mqtt:
role: zigbee-mqtt-bridge
frigate:
role: nvr

View file

@ -1,6 +1,3 @@
host: chelsty-infra
site: chelsty
capabilities:
hardware:
cpu:
@ -34,11 +31,10 @@ capabilities:
power_constraint: low-power
connectivity: intermittent
availability_target: best-effort
offline_operation_required: true
deployment:
suitability:
- staging
- infra
- homeassistant
- edge
restricted: false

View file

@ -1,10 +1,9 @@
hostname: chelsty-infra
site: chelsty
hostname: chelsty
roles:
- edge
- hypervisor
- infra
- homeassistant
- staging
network:

View file

@ -1,4 +1,4 @@
host: chelsty-infra
host: chelsty
uplink:
type: lte
@ -20,7 +20,7 @@ exposure_classes:
networks:
home_automation_lan:
purpose: MQTT broker, Zigbee coordinator, and local device control.
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
offline_required: true
internet_required_for_core_operation: false

View file

@ -1,4 +1,4 @@
host: chelsty-infra
host: chelsty
runtime_root: /opt/homelab
@ -9,6 +9,12 @@ conventions:
logs: /opt/homelab/logs
services:
homeassistant:
data: /opt/homelab/data/homeassistant
config: /opt/homelab/config/homeassistant
logs: /opt/homelab/logs/homeassistant
backup_priority: critical
zigbee2mqtt:
data: /opt/homelab/data/zigbee2mqtt
config: /opt/homelab/config/zigbee2mqtt
@ -21,13 +27,13 @@ services:
logs: /opt/homelab/logs/mosquitto
backup_priority: high
stability-agent:
data: /opt/homelab/state
config: /opt/homelab/config/stability-agent
logs: /opt/homelab/events
backup_priority: low
backup_sets:
homeassistant:
include:
- /opt/homelab/config/homeassistant
- /opt/homelab/data/homeassistant
restore_note: Restore before starting the Home Assistant container.
zigbee2mqtt:
include:
- /opt/homelab/config/zigbee2mqtt

View file

@ -1,11 +1,6 @@
services:
stability-agent:
environment:
- NODE_NAME=chelsty-infra
- SITE_NAME=chelsty
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true
- STABILITY_CHECK_INTERVAL=60
- DISK_THRESHOLD_PCT=85
- MQTT_HOST=mosquitto

View file

@ -0,0 +1,13 @@
services:
zigbee2mqtt:
volumes:
- ./configuration.yaml:/app/data/configuration.yaml:ro
environment:
- MQTT_USER=${MQTT_USER}
- MQTT_PASSWORD=${MQTT_PASSWORD}
# Healthcheck is already defined in base service, but we ensure compatibility
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080"]
interval: 10s
timeout: 5s
retries: 3

126
hosts/chelsty/services.yaml Normal file
View file

@ -0,0 +1,126 @@
host: chelsty
exposure_classes:
local-only:
description: Reachable only from CHELSTY-local networks or container networks.
public_ingress: false
tailscale_required: false
tailscale-internal:
description: Reachable through the Tailscale mesh by approved tailnet clients.
public_ingress: false
tailscale_required: true
public:
description: Reachable from the public internet through an explicit ingress path.
public_ingress: true
tailscale_required: false
operational_constraints:
uplink: lte
connectivity: intermittent
offline_operation_required: true
must_not_depend_on:
- saturn
- vps
- forgejo
services:
homeassistant:
role: home-automation-controller
deployment_model: docker-compose
exposure: tailscale-internal
offline_required: true
depends_on:
local:
- mosquitto
- zigbee2mqtt
external: []
ports:
- name: http
container_port: 8123
protocol: tcp
runtime:
config_path: /opt/homelab/config/homeassistant
data_path: /opt/homelab/data/homeassistant
logs_path: /opt/homelab/logs/homeassistant
backup:
recommended: true
include:
- /opt/homelab/config/homeassistant
- /opt/homelab/data/homeassistant
notes:
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
zigbee2mqtt:
role: zigbee-mqtt-bridge
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local:
- mosquitto
external:
- slzb-06u
coordinator:
name: slzb-06u
connection: network
usb_device: null
ports:
- name: frontend
container_port: 8080
protocol: tcp
exposure: tailscale-internal
runtime:
config_path: /opt/homelab/config/zigbee2mqtt
data_path: /opt/homelab/data/zigbee2mqtt
logs_path: /opt/homelab/logs/zigbee2mqtt
backup:
recommended: true
include:
- /opt/homelab/config/zigbee2mqtt
- /opt/homelab/data/zigbee2mqtt
notes:
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
mosquitto:
role: local-mqtt-broker
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
ports:
- name: mqtt
container_port: 1883
protocol: tcp
runtime:
config_path: /opt/homelab/config/mosquitto
data_path: /opt/homelab/data/mosquitto
logs_path: /opt/homelab/logs/mosquitto
backup:
recommended: true
include:
- /opt/homelab/config/mosquitto
- /opt/homelab/data/mosquitto
notes:
- Retain ACL, password, persistence, and bridge configuration if enabled.
stability-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local:
- mosquitto
external: []
runtime:
config_path: null
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
backup:
recommended: false
notes:
- Events and state are transient or can be reconstructed; high-frequency writes.

View file

@ -1,32 +0,0 @@
# hosts/lustro/node.yaml — LUSTRO edge node manifest
# First-contact bootstrap: scripts/onboard/onboard.sh --node lustro --step 00-access
# Full onboarding: scripts/onboard/onboard.sh --node lustro
name: LUSTRO
role: edge
location: KEN
ssh_user: pi
first_contact: pi@192.168.31.19 # LAN IP KEN; mDNS .local zawodny; mesh przejmuje po tailscale up
tailscale:
hostname: lustro
# ip: TODO — fill after tailscale join (step 30-install-tailscale)
deploy_autonomy: true # onboard.sh may run mutating steps autonomously
git_control: false # node does NOT pull from Forgejo; push-based via SATURN
hardware:
arch: arm64
ram_mb: 4096
swap:
kind: zram
mb: 2048
docker_present: true
mm_runtime: systemd:magicmirror.service
services:
node-agent:
runtime:
engine: docker
mem_limit: 256m

View file

@ -1,23 +0,0 @@
services:
node-agent:
# Docker GID on LUSTRO is 991 (not the Debian default 999).
# Compose concatenates group_add lists; 991 is what gives socket access here.
group_add:
- "991"
mem_limit: 256m # RPi4 4 GiB; MagicMirror consumes ~1.9 GiB — agent must be bounded
environment:
- NODE_NAME=lustro
- NODE_TYPE=sd_card
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
# pi's SSH key for rsync event shipping to VPS (push-based node, no repo
# checkout). Container runs as uid 1000 (homelab, HOME=/home/homelab) per
# the base compose — ssh has no -i flag, so the key must land in
# /home/homelab/.ssh, NOT /root/.ssh. uid match (pi=1000) satisfies
# OpenSSH strict ownership checks on the mounted key.
- /home/pi/.ssh:/home/homelab/.ssh:ro
# Override ../.. from the base compose to the pushed deploy dir (no repo on node)
- /opt/homelab/deploy/node-agent:/repo:ro

View file

@ -1,15 +0,0 @@
host: lustro
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events

View file

@ -1,8 +0,0 @@
services:
runtime-materializer:
environment:
# Pull world state from the VPS control-plane API instead of local Redis.
# The observer on VPS is the authoritative writer; mirroring its API output
# here ensures the webui /snapshot matches the clean 97-service state that
# the control-plane /summary endpoint serves.
CONTROL_PLANE_URL: "http://100.95.58.48:18180"

View file

@ -1,4 +0,0 @@
services:
brain-watchdog:
mem_limit: 64m
restart: unless-stopped

View file

@ -1,12 +0,0 @@
services:
ha-diag-agent:
environment:
- NODE_NAME=piha
# Pin events to the piha-specific subdirectory; overrides the ${NODE_NAME}
# variable substitution in the base compose file which requires a shell env var.
volumes:
- /opt/homelab/events/piha:/events
- /var/lib/ha-diag-agent:/data
- /opt/homelab/config/ha-diag-agent:/config:ro
mem_limit: 128m
restart: unless-stopped

View file

@ -1,11 +0,0 @@
services:
node-agent:
environment:
- NODE_NAME=piha
- NODE_TYPE=sd_card
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
- /home/oskar/.ssh:/home/homelab/.ssh:ro

View file

@ -1,7 +0,0 @@
services:
stability-agent:
environment:
- NODE_NAME=piha
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true

View file

@ -1,42 +0,0 @@
host: piha
services:
ha-diag-agent:
role: ha-diagnostic-agent
deployment_model: docker-compose
exposure: local-only
offline_required: false
depends_on:
local: []
external: [homeassistant]
config:
target_url: http://localhost:8123
location_tag: "ken"
events_dir: /opt/homelab/events/piha
runtime:
config_path: /opt/homelab/config/ha-diag-agent
data_path: /var/lib/ha-diag-agent
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
brain-watchdog:
role: control-plane-watchdog
deployment_model: docker-compose
exposure: private
offline_required: false
depends_on:
local: []
external: [control-plane]
runtime:
config_path: /opt/homelab/config/brain-watchdog

View file

@ -1,11 +0,0 @@
services:
node-agent:
environment:
- NODE_NAME=solaria
- NODE_TYPE=ai_node
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
- /home/oskar/.ssh:/home/homelab/.ssh:ro

View file

@ -1,7 +0,0 @@
services:
stability-agent:
environment:
- NODE_NAME=solaria
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true

View file

@ -1,15 +0,0 @@
host: solaria
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events

View file

@ -1,39 +0,0 @@
# Control-plane production overrides for the VPS deployment.
#
# NODE_ALIAS_MAP translates the node names that appear in raw event files
# (written by node agents / seed scripts) to the canonical names used in
# inventory/topology.yaml and hosts/*/services.yaml.
#
# Current live mapping (from /opt/homelab/events/ inspection):
# node-2 → chelsty (zigbee2mqtt / mosquitto / homeassistant node)
#
# Add further entries when new nodes come online and their event-source names
# differ from their topology names. Format is a single-line JSON object, e.g.:
# NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
#
# The executor inherits the canonical name from the action JSON written by the
# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
#
# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
# host kernel OOM-killer never targets control-plane containers. mem_limit
# provides a per-container cgroup ceiling so a leaking process is restarted by
# Docker before it can exhaust host memory.
services:
operator-ui:
mem_limit: 192m
oom_score_adj: -900
observer:
mem_limit: 192m
oom_score_adj: -900
supervisor:
mem_limit: 400m
oom_score_adj: -900
environment:
- NODE_ALIAS_MAP={"node-2":"chelsty"}
executor:
mem_limit: 64m
oom_score_adj: -900

View file

@ -1,7 +0,0 @@
# Control Plane Environment Variables
PORT=8080
HOMELAB_STATE_ROOT=/opt/homelab/state
HOMELAB_EVENTS_ROOT=/opt/homelab/events
HOMELAB_WORLD_ROOT=/opt/homelab/world
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
HOMELAB_CONFIG_ROOT=/opt/homelab/config

View file

@ -1,16 +0,0 @@
services:
node-agent:
environment:
- NODE_NAME=vps
- CHECK_INTERVAL=60
# host network mode: node-agent on VPS shares the host's network namespace
# so that localhost:18180 resolves to the control-plane's exposed port.
# Without this, localhost inside the container is the container's own loopback
# and the _check_control_plane_health() probe would always fail.
network_mode: host
# HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
# and may accumulate Python RSS over hours; 640m cap ensures it is killed and
# auto-restarted by Docker before consuming host memory. oom_score_adj -900
# prevents the host kernel OOM-killer from picking it as a global victim.
mem_limit: 640m
oom_score_adj: -900

View file

@ -1,9 +0,0 @@
services:
stability-agent:
environment:
- NODE_NAME=vps
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true
mem_limit: 96m
oom_score_adj: -900

1
hosts/vps/services.txt Normal file
View file

@ -0,0 +1 @@
npm

View file

@ -1,43 +0,0 @@
host: vps
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
control-plane:
role: management-and-orchestration
deployment_model: docker-compose
exposure: tailscale-internal
offline_required: false
depends_on:
local:
- node-agent
external:
- piha:redis
ports:
- name: http
container_port: 18180
protocol: tcp
runtime:
config_path: /opt/homelab/config/control-plane
data_path: /opt/homelab/data/control-plane
logs_path: /opt/homelab/logs/control-plane
node_exporter:
role: metrics-exporter
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []

View file

@ -17,10 +17,6 @@ nodes:
roles:
- infra
- monitoring
services:
- node-agent
- ha-diag-agent
- brain-watchdog
solaria:
roles:
@ -31,25 +27,12 @@ nodes:
roles:
- edge
- ingress
- control-plane
services:
# Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
- node-agent
- control-plane # executor, observer, supervisor, operator-ui
- node_exporter
- stability-agent
- npm # Nginx Proxy Manager — public ingress, TLS termination
- outline # Team wiki (outline + postgres + redis)
- joplin # Note sync server (joplin-server + postgres)
- ai-cluster # AI workers: codex-worker, openclaw, planner-worker,
# service-ops-worker, redis, mosquitto
chelsty-infra:
site: chelsty
chelsty:
roles:
- remote
- hypervisor
- infra
- homeassistant
- staging
connectivity:
uplink: lte
@ -57,28 +40,10 @@ nodes:
home_automation:
offline_operation_required: true
services:
- homeassistant
- zigbee2mqtt
- mosquitto
coordinator:
model: SLZB-06U
connection: network
usb: false
chelsty-ha:
site: chelsty
roles:
- remote
- homeassistant
connectivity:
uplink: lte
intermittent: true
home_automation:
offline_operation_required: true
services:
- homeassistant
lustro:
roles:
- edge
services:
- node-agent

View file

@ -1,75 +0,0 @@
#!/usr/bin/env bash
# vps-control-plane.sh - Bootstrap script for VPS control plane
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
RUNTIME_DIR="/opt/homelab"
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
log "Starting VPS control plane bootstrap..."
# 1. Validate Docker availability
if ! command -v docker &> /dev/null; then
error "Docker is not installed. Please install Docker first."
fi
# 2. Validate compose plugin
if ! docker compose version &> /dev/null; then
error "Docker Compose plugin is not installed."
fi
log "Docker and Compose plugin verified."
# 3. Create filesystem-first runtime structure
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
sudo mkdir -p "$RUNTIME_DIR/events" \
"$RUNTIME_DIR/state" \
"$RUNTIME_DIR/world" \
"$RUNTIME_DIR/actions/pending" \
"$RUNTIME_DIR/actions/approved" \
"$RUNTIME_DIR/actions/running" \
"$RUNTIME_DIR/actions/completed" \
"$RUNTIME_DIR/actions/failed" \
"$RUNTIME_DIR/actions/rejected" \
"$RUNTIME_DIR/config" \
"$RUNTIME_DIR/logs"
# 4. Set permissions
log "Setting permissions..."
sudo chown -R $USER:$USER "$RUNTIME_DIR"
chmod -R 755 "$RUNTIME_DIR"
# 5. Install environment file
log "Installing environment configuration..."
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
log "Created $RUNTIME_DIR/config/control-plane.env from template."
else
warn "Environment file already exists, skipping installation."
fi
# 6. Build and start the control plane
log "Building and starting control plane services..."
cd "$REPO_ROOT/services/control-plane"
docker compose build
docker compose up -d
log "VPS control plane bootstrap complete!"
echo -e "\n${YELLOW}Verification commands:${NC}"
echo "1. Check container status: docker compose ps"
echo "2. Check operator UI: curl http://localhost:8080/summary"
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"

View file

@ -1,23 +0,0 @@
#!/bin/bash
# scripts/deploy/deploy-control-plane.sh
set -e
VPS_IP="100.95.58.48"
USER="oskar"
REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
MODE=$1
case "$MODE" in
"--ssh")
echo "Deploying to VPS ($VPS_IP) via SSH..."
ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
;;
"--print")
echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
;;
*)
echo "Usage: $0 [--ssh|--print]"
exit 1
;;
esac

View file

@ -1,26 +0,0 @@
#!/usr/bin/env bash
# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
MODE="print"
[[ "$1" == "--ssh" ]] && MODE="ssh"
TARGET="100.122.201.22"
NODE="chelsty-infra"
REPO_PATH="/home/oskar/homelab-codex-ws"
SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
echo "HOST: $NODE"
echo "MODE: $MODE"
echo "TARGET: $TARGET"
# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
# before first deploy. See config.yml for required variables.
DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
if [[ "$MODE" == "ssh" ]]; then
echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
ssh oskar@$TARGET "$DEPLOY_CMD"
else
echo "# --- Deployment commands for $NODE ---"
echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
fi

View file

@ -8,7 +8,6 @@ set -e
REPO_PATH="${HOME}/homelab-codex-ws"
RUNTIME_PATH="/opt/homelab"
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"
echo "--- Starting Deployment on ${HOSTNAME} ---"
@ -23,33 +22,20 @@ echo "Pulling latest changes..."
git pull
# 2. Identify Services
SERVICES=()
if [ -f "${HOST_DIR}/services.txt" ]; then
mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
elif [ -f "${HOST_DIR}/services.yaml" ]; then
SERVICES=($(python3 -c "
import yaml, sys
try:
with open('${HOST_DIR}/services.yaml', 'r') as f:
data = yaml.safe_load(f)
if data and 'services' in data:
if isinstance(data['services'], dict):
print(' '.join(data['services'].keys()))
elif isinstance(data['services'], list):
print(' '.join(data['services']))
except Exception as e:
print(f'Error parsing YAML: {e}', file=sys.stderr)
sys.exit(1)
"))
fi
# Based on our convention, we look for services assigned to this host
# For now, we'll check if a 'services.txt' exists in the host folder
SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"
if [ ${#SERVICES[@]} -eq 0 ]; then
echo "No services found for ${HOSTNAME}. Skipping service deployment."
if [ ! -f "$SERVICE_LIST" ]; then
echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
exit 0
fi
# 3. Deploy Services
for service in "${SERVICES[@]}"; do
while IFS= read -r service || [ -n "$service" ]; do
[[ "$service" =~ ^#.*$ ]] && continue # Skip comments
[[ -z "$service" ]] && continue # Skip empty lines
echo "Deploying service: ${service}..."
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
@ -59,10 +45,13 @@ for service in "${SERVICES[@]}"; do
continue
fi
# Target directory in runtime
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
mkdir -p "$TARGET_DIR"
OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
# We use the compose file from the repo directly
# but we can also handle overrides here
OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
if [ -f "$OVERRIDE_FILE" ]; then
@ -71,6 +60,7 @@ for service in "${SERVICES[@]}"; do
fi
$COMPOSE_CMD up -d --remove-orphans
done
done < "$SERVICE_LIST"
echo "--- Deployment Complete ---"

View file

@ -1,55 +0,0 @@
#!/usr/bin/env bash
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
NODE=$1
MODE="print"
[[ "$2" == "--ssh" ]] && MODE="ssh"
if [[ -z "$NODE" ]]; then
echo "Usage: $0 <node-name> [--ssh]"
echo "Supported nodes: chelsty, piha, solaria, vps"
exit 1
fi
case "$NODE" in
piha) TARGET="100.108.208.3" ;;
chelsty) TARGET="100.122.201.22" ;;
vps) TARGET="100.95.58.48" ;;
solaria) TARGET="local" ;;
*)
echo "Error: Unknown node '$NODE'"
echo "Supported nodes: chelsty, piha, solaria, vps"
exit 1
;;
esac
echo "HOST: $NODE"
echo "MODE: $MODE"
echo "TARGET: $TARGET"
REPO_PATH="/home/oskar/homelab-codex-ws"
if [[ "$NODE" == "solaria" ]]; then
if [[ "$MODE" == "ssh" ]]; then
echo "--- Running local deployment for solaria ---"
cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
else
echo "# --- Deployment commands for solaria ---"
echo "cd $REPO_PATH"
echo "git fetch origin"
echo "git checkout master"
echo "git pull origin master"
echo "cd services/stability-agent"
echo "./deploy-local.sh solaria"
fi
else
# Remote nodes
SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
if [[ "$MODE" == "ssh" ]]; then
echo "--- Deploying to $NODE ($TARGET) via SSH ---"
eval "$SSH_CMD"
else
echo "# --- Deployment commands for $NODE ---"
echo "$SSH_CMD"
fi
fi

View file

@ -1,321 +1,270 @@
#!/usr/bin/env bash
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
# deploy.sh - Staged deployment framework for homelab nodes.
set -uo pipefail
set -o pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SSH_USER="${SSH_USER:-oskar}"
START_TIME=$(date +%s)
TARGET=""
DRY_RUN=false
NO_GATE=false
# --- Configuration ---
export RUNTIME_PATH="/opt/homelab"
export STATE_DIR="${RUNTIME_PATH}/state/deploy"
export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
export REPO_PATH="${HOME}/homelab-codex-ws"
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
usage() {
cat >&2 <<'EOF'
Usage: deploy.sh <target> [--dry-run] [--no-gate]
# --- Initialization ---
mkdir -p "$STATE_DIR" "$LOG_DIR"
Targets:
control-plane observer/supervisor/executor/operator-ui on VPS
vps all VPS GitOps services
piha PIHA services
solaria SOLARIA compute services
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
# Redirection for logging
exec > >(tee -a "$LOG_FILE") 2>&1
Flags:
--dry-run run preflight + gate only; stop before deploy
--no-gate skip pytest + docker build (emergency only; logged as WARNING)
# --- Load Libraries ---
LIB_PATH="${REPO_PATH}/scripts/lib"
source "${LIB_PATH}/log.sh"
source "${LIB_PATH}/state.sh"
source "${LIB_PATH}/inventory.sh"
source "${LIB_PATH}/compose.sh"
source "${LIB_PATH}/diagnostics.sh"
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
EOF
exit 1
}
# --- CLI Parsing ---
TARGET_HOST=$(hostname)
TARGET_SERVICE=""
RESUME=false
REQUESTED_STAGE=""
while [[ $# -gt 0 ]]; do
case $1 in
control-plane|vps|piha|solaria|chelsty-infra)
TARGET="$1"; shift ;;
--dry-run)
DRY_RUN=true; shift ;;
--no-gate)
NO_GATE=true; shift ;;
-h|--help)
usage ;;
--host)
TARGET_HOST="$2"
shift 2
;;
--service)
TARGET_SERVICE="$2"
shift 2
;;
--resume)
RESUME=true
shift
;;
--stage)
REQUESTED_STAGE="$2"
shift 2
;;
*)
echo "Unknown argument: $1" >&2
usage ;;
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
REQUESTED_STAGE="$1"
fi
shift
;;
esac
done
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
# --- Stages ---
case "$TARGET" in
control-plane) SSH_HOST="vps" ;;
*) SSH_HOST="$TARGET" ;;
esac
case "$TARGET" in
chelsty-*) SSH_TIMEOUT=30 ;;
*) SSH_TIMEOUT=5 ;;
esac
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
preflight() {
echo "=== PREFLIGHT ==="
local branch
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
if [[ "$branch" != "master" ]]; then
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
exit 1
fi
echo "[ok] branch: master"
if ! git -C "$REPO_ROOT" diff --quiet; then
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
exit 1
fi
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
exit 1
fi
echo "[ok] working tree clean"
git -C "$REPO_ROOT" fetch origin master --quiet
local unpushed
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
if [[ -n "$unpushed" ]]; then
echo "ERROR: Unpushed commits on master:" >&2
echo "$unpushed" >&2
echo "Push first: git push origin master" >&2
exit 1
fi
echo "[ok] no unpushed commits"
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
exit 1
fi
echo "[ok] ${SSH_HOST} reachable"
}
# ── GATE ─────────────────────────────────────────────────────────────────────
gate() {
if [[ "$NO_GATE" == "true" ]]; then
echo "=== GATE: SKIPPED ==="
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
stage_prepare() {
local host=$1
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping PREPARE (already complete)"
return 0
fi
echo "=== GATE ==="
log "INFO" "Stage: PREPARE ($host)"
set_stage "prepare"
local services=()
emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
if [[ "$TARGET" == "control-plane" ]]; then
services=("control-plane")
else
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
if [[ ! -f "$svc_yaml" ]]; then
echo "ERROR: ${svc_yaml} not found." >&2
exit 2
fi
local svc_list
svc_list=$(python3 -c "
import yaml
with open('${svc_yaml}') as f:
data = yaml.safe_load(f)
svcs = data.get('services', {})
if isinstance(svcs, dict):
print('\n'.join(svcs.keys()))
elif isinstance(svcs, list):
print('\n'.join(svcs))
")
while IFS= read -r svc; do
[[ -z "$svc" ]] && continue
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
services+=("$svc")
fi
done <<< "$svc_list"
cd "$REPO_PATH" || exit 1
log "INFO" "Pulling latest changes..."
if ! git pull; then
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
fi
if [[ ${#services[@]} -eq 0 ]]; then
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
# Ensure runtime directories exist
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
struct_log "prepare" "$host" "all" "success" "repo_updated"
mark_stage_complete "prepare"
}
stage_validate() {
local host=$1
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping VALIDATE (already complete)"
return 0
fi
echo "Services under gate: ${services[*]}"
local gate_failed=false
log "INFO" "Stage: VALIDATE ($host)"
set_stage "validate"
for svc in "${services[@]}"; do
local svc_dir="${REPO_ROOT}/services/${svc}"
if [[ -d "${svc_dir}/tests" ]]; then
echo "--- pytest: ${svc} ---"
if ! python3 -m pytest "${svc_dir}/tests" -q; then
echo "GATE FAIL: pytest failed for ${svc}" >&2
gate_failed=true
fi
fi
echo "--- docker build: ${svc} ---"
if ! docker build --quiet "${svc_dir}" >/dev/null; then
echo "GATE FAIL: docker build failed for ${svc}" >&2
gate_failed=true
for service in "${SERVICES[@]}"; do
log "INFO" "Validating $service..."
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
log "ERROR" "Service definition not found: $service"
struct_log "validate" "$host" "$service" "fail" "not_found"
return 1
fi
done
if [[ "$gate_failed" == "true" ]]; then
exit 2
fi
echo "[ok] gate passed"
struct_log "validate" "$host" "all" "success" "validated"
mark_stage_complete "validate"
}
# ── EXECUTE ──────────────────────────────────────────────────────────────────
stage_deploy() {
local host=$1
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping DEPLOY (already complete)"
return 0
fi
execute() {
echo "=== EXECUTE ==="
log "INFO" "Stage: DEPLOY ($host)"
set_stage "deploy"
local cmd_output
local cmd_exit=0
local last_s=$(get_last_service)
local skip=false
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
skip=true
fi
if [[ "$TARGET" == "control-plane" ]]; then
echo "Running deploy-control-plane.sh --ssh..."
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|| cmd_exit=$?
for service in "${SERVICES[@]}"; do
if [[ "$skip" == "true" ]]; then
if [[ "$service" == "$last_s" ]]; then
skip=false
log "INFO" "Resuming from $service..."
else
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" \
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|| cmd_exit=$?
log "INFO" "Skipping $service (already processed)"
continue
fi
fi
echo "$cmd_output"
log "INFO" "Deploying $service..."
set_last_service "$service"
if echo "$cmd_output" | grep -qF "[sudo] password"; then
echo "" >&2
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
echo "Run manually:" >&2
if [[ "$TARGET" == "control-plane" ]]; then
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
else
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
fi
exit 5
if ! run_compose_up "$service"; then
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
collect_diagnostics "$host" "$service"
return 1
fi
if [[ $cmd_exit -ne 0 ]]; then
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
exit 3
fi
echo "[ok] execute completed"
}
# ── VERIFY ───────────────────────────────────────────────────────────────────
verify() {
echo "=== VERIFY ==="
local ps_output
local ps_exit=0
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" \
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|| ps_exit=$?
if [[ $ps_exit -ne 0 ]]; then
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
echo "$ps_output" >&2
exit 4
fi
echo "$ps_output"
local failed=false
local not_up
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
if [[ -n "$not_up" ]]; then
echo "ERROR: Containers not in Up state:" >&2
echo "$not_up" >&2
failed=true
fi
local unhealthy
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
if [[ -n "$unhealthy" ]]; then
echo "ERROR: Unhealthy containers:" >&2
echo "$unhealthy" >&2
failed=true
fi
if [[ "$TARGET" == "control-plane" ]]; then
for cp_svc in supervisor observer executor operator-ui; do
if ! echo "$ps_output" | grep -q "$cp_svc"; then
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
failed=true
fi
struct_log "deploy" "$host" "$service" "success" "deployed"
done
fi
if [[ "$failed" == "true" ]]; then
echo "" >&2
echo "Full docker ps output above." >&2
exit 4
fi
echo "[ok] all containers healthy"
set_last_service ""
mark_stage_complete "deploy"
}
# ── REPORT ───────────────────────────────────────────────────────────────────
report() {
local mode="${1:-deploy}"
local end_time
end_time=$(date +%s)
local elapsed
elapsed=$(( end_time - START_TIME ))
local commit_hash
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
local gate_s verify_s
if [[ "$NO_GATE" == "true" ]]; then
gate_s="skip"
else
gate_s="ok"
stage_verify() {
local host=$1
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping VERIFY (already complete)"
return 0
fi
if [[ "$mode" == "dry-run" ]]; then
verify_s="skip(dry-run)"
else
verify_s="green"
fi
log "INFO" "Stage: VERIFY ($host)"
set_stage "verify"
echo ""
if [[ "$mode" == "dry-run" ]]; then
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
else
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
for service in "${SERVICES[@]}"; do
log "INFO" "Verifying $service..."
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
if [[ -f "$health_script" ]]; then
if ! bash "$health_script"; then
log "ERROR" "Healthcheck failed for $service"
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
collect_diagnostics "$host" "$service"
return 1
fi
else
# Generic check if container is running
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
log "ERROR" "Container $service is not running"
struct_log "verify" "$host" "$service" "fail" "container_not_running"
collect_diagnostics "$host" "$service"
return 1
fi
fi
struct_log "verify" "$host" "$service" "success" "verified"
done
mark_stage_complete "verify"
}
# ── MAIN ─────────────────────────────────────────────────────────────────────
stage_complete() {
local host=$1
log "INFO" "Stage: COMPLETE ($host)"
set_stage "complete"
struct_log "complete" "$host" "all" "success" "deployment_finished"
clear_deployment_state
}
preflight
gate
# --- Execution Logic ---
if [[ "$DRY_RUN" == "true" ]]; then
report dry-run
exit 0
run_deployment() {
local start_stage=$1
# Sequential execution from start_stage
case "$start_stage" in
prepare)
stage_prepare "$TARGET_HOST" || return 1
;&
validate)
stage_validate "$TARGET_HOST" || return 1
;&
deploy)
stage_deploy "$TARGET_HOST" || return 1
;&
verify)
stage_verify "$TARGET_HOST" || return 1
;&
complete)
stage_complete "$TARGET_HOST" || return 1
;;
*)
log "ERROR" "Invalid stage: $start_stage"
return 1
;;
esac
}
# --- Main ---
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
log "ERROR" "Failed to load inventory"
exit 1
fi
execute
verify
report
EXIT_STATUS=0
if [[ "$RESUME" == "true" ]]; then
CURRENT=$(get_stage)
log "INFO" "Resuming from state: $CURRENT"
case "$CURRENT" in
prepare|validate|deploy|verify)
run_deployment "$CURRENT" || EXIT_STATUS=1
;;
complete|none)
log "INFO" "No interrupted deployment found. Starting from scratch..."
run_deployment "prepare" || EXIT_STATUS=1
;;
*)
log "INFO" "Unknown state. Starting from prepare..."
run_deployment "prepare" || EXIT_STATUS=1
;;
esac
elif [[ -n "$REQUESTED_STAGE" ]]; then
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
else
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
fi
else
# New deployment - clear previous state
clear_deployment_state
run_deployment "prepare" || EXIT_STATUS=1
fi
if [[ $EXIT_STATUS -eq 0 ]]; then
print_summary "$TARGET_HOST" "SUCCESS"
log "INFO" "--- Homelab Deployment Finished Successfully ---"
else
print_summary "$TARGET_HOST" "FAILED"
log "ERROR" "--- Homelab Deployment Failed ---"
exit 1
fi

View file

@ -1,30 +1,15 @@
#!/usr/bin/env bash
# orchestrate-deploy.sh - To be run on SATURN
# Triggers deployment on remote execution nodes via inventory.
# Triggers deployment on remote execution nodes.
set -e
REPO_PATH="${HOME}/homelab-codex-ws"
USER="oskar"
HOSTS=("solaria" "piha" "vps")
USER="oskar" # Default user
while IFS=' ' read -r HOST TAG; do
for HOST in "${HOSTS[@]}"; do
echo ">>> Triggering deployment on ${HOST}..."
if [[ "$TAG" == "lte" ]]; then
ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
else
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
fi
done < <(python3 -c "
import yaml, sys
with open('${REPO_PATH}/inventory/topology.yaml') as f:
data = yaml.safe_load(f)
skip = {'saturn', 'solaria'}
for name, info in (data.get('nodes') or {}).items():
if name in skip:
continue
uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
print(name, 'lte' if uplink == 'lte' else 'standard')
")
done
echo ">>> All deployments triggered."

View file

@ -1,68 +0,0 @@
#!/usr/bin/env bash
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
# Check if docker is available
if ! command -v docker &> /dev/null; then
echo "Error: docker command not found."
exit 1
fi
# Check if container is running
if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
echo "Error: agent-system-redis container not found or not running."
echo "This script must be run on PIHA (the node hosting the Redis container)."
exit 1
fi
REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
MISSING_NODES=0
echo "--- Homelab Agent Fleet Status ---"
printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
printf "%s\n" "--------------------------------------------------------------------------------"
for NODE in "${REQUIRED_NODES[@]}"; do
KEY="homelab:nodes:$NODE"
# Check if key exists
EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
if [[ "$EXISTS" != "1" ]]; then
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
MISSING_NODES=$((MISSING_NODES + 1))
continue
fi
HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
done
echo ""
echo "--- Control Plane Summary ---"
if command -v jq >/dev/null; then
curl -s http://127.0.0.1:18180/summary | jq .
else
curl -s http://127.0.0.1:18180/summary
fi
echo ""
echo "--- Control Plane Nodes ---"
if command -v jq >/dev/null; then
curl -s http://127.0.0.1:18180/nodes | jq .
else
curl -s http://127.0.0.1:18180/nodes
fi
if [[ $MISSING_NODES -gt 0 ]]; then
echo ""
echo "Error: $MISSING_NODES required nodes are missing from Redis."
exit 1
fi
exit 0

View file

@ -1,361 +0,0 @@
#!/usr/bin/env bash
# Multi-agent worktree manager.
# EXIT: 0 ok, 1 preflight, 2 operation failed.
set -euo pipefail
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
RESERVED_NAMES=(master main HEAD list merge clean new)
MAX_WORKTREES=4
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
# ── helpers ──────────────────────────────────────────────────────────────────
is_main_checkout() {
local git_dir common_dir
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
[ "$git_dir" = "$common_dir" ]
}
require_main_checkout() {
is_main_checkout || prefail "must run from the main checkout, not a worktree"
}
require_master_branch() {
local branch
branch=$(git rev-parse --abbrev-ref HEAD)
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
}
require_clean_tree() {
local dirty
dirty=$(git status --porcelain)
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
}
worktree_paths() {
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
local main_path
main_path=$(git rev-parse --show-toplevel)
git worktree list --porcelain \
| awk '/^worktree /{p=$2} /^$/{print p}' \
| grep -v "^${main_path}$" \
|| true
}
worktree_count() {
worktree_paths | wc -l
}
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
age_str() {
local created_utc="$1"
local now_ts created_ts diff_s
now_ts=$(date -u +%s)
# strip Z, replace T with space for `date -d`
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
diff_s=$(( now_ts - created_ts ))
if (( diff_s < 60 )); then echo "${diff_s}s"
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
else echo "$(( diff_s/86400 ))d"
fi
}
validate_name() {
local name="$1"
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
fi
for r in "${RESERVED_NAMES[@]}"; do
if [ "$name" = "$r" ]; then
prefail "'$name' is a reserved word"
fi
done
}
# ── subcommands ───────────────────────────────────────────────────────────────
cmd_new() {
local name="${1:-}"
[ -n "$name" ] || { usage; exit 1; }
validate_name "$name"
require_main_checkout
require_master_branch
require_clean_tree
# worktree limit
local count
count=$(worktree_count)
if (( count >= MAX_WORKTREES )); then
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
cmd_list
exit 1
fi
# branch collision
if branch_exists_local "task/$name"; then
prefail "branch task/$name already exists locally"
fi
git fetch origin master --quiet
if branch_exists_remote "refs/heads/task/$name"; then
prefail "branch task/$name already exists on origin"
fi
# directory collision
local main_path wt_path
main_path=$(git rev-parse --show-toplevel)
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
# create worktree
git worktree add -b "task/$name" "$wt_path" origin/master \
|| die "git worktree add failed"
# write marker
local parent_commit
parent_commit=$(git rev-parse origin/master)
cat > "$wt_path/.agent-task" <<EOF
task: $name
branch: task/$name
parent_commit: $parent_commit
created_utc: $(utc_now)
worktree_path: $wt_path
EOF
echo ""
echo "Worktree created: $wt_path"
echo "Branch: task/$name"
echo ""
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
echo "─────────────────────────────────────────────────────────────────────────────"
}
cmd_list() {
local main_path
main_path=$(git rev-parse --show-toplevel)
# fetch to get up-to-date ahead/behind
git fetch origin master --quiet 2>/dev/null || true
local paths
paths=$(worktree_paths)
if [ -z "$paths" ]; then
echo "(no active task worktrees)"
return
fi
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
while IFS= read -r wt_path; do
[ -z "$wt_path" ] && continue
local marker="$wt_path/.agent-task"
local task_name branch parent_commit created_utc
if [ -f "$marker" ]; then
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
else
task_name="(no marker)"
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
parent_commit="?"
created_utc=""
fi
local status="clean"
local dirty
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
[ -n "$dirty" ] && status="dirty"
local ahead behind ab
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
ab="+${ahead}/-${behind}"
local age=""
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
local short_parent="${parent_commit:0:7}"
local short_created="${created_utc:0:10}"
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
done <<< "$paths"
}
cmd_merge() {
local name="${1:-}"
[ -n "$name" ] || { usage; exit 1; }
require_main_checkout
require_master_branch
require_clean_tree
git fetch origin --quiet
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
local main_path wt_path
main_path=$(git rev-parse --show-toplevel)
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
# attempt ff-only merge
local merge_failed=0
git merge --ff-only "task/$name" || merge_failed=1
if (( merge_failed )); then
# abort any partial merge state
git merge --abort 2>/dev/null || true
echo ""
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
echo " The branch has likely diverged from master." >&2
echo "" >&2
echo "Diagnose with:" >&2
echo " git log master..task/$name # commits only on task branch" >&2
echo " git log task/$name..master # commits master has that task doesn't" >&2
echo "" >&2
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
echo "Worktree and branch are preserved — no changes made." >&2
exit 2
fi
echo "Merged task/$name into master (fast-forward)."
git push origin master || die "git push origin master failed"
echo "Pushed master to origin."
if [ -d "$wt_path" ]; then
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
echo "Removed worktree: $wt_path"
else
echo "(worktree directory $wt_path not found — skipping worktree remove)"
fi
git branch -d "task/$name" || die "git branch -d task/$name failed"
echo "Deleted local branch task/$name."
git push origin --delete "task/$name" 2>/dev/null \
&& echo "Deleted remote branch task/$name." \
|| echo "(remote branch task/$name not found — nothing to delete)"
echo ""
echo "Done. task/$name merged and cleaned up."
}
cmd_clean() {
local main_path
main_path=$(git rev-parse --show-toplevel)
git fetch origin --quiet 2>/dev/null || true
local to_remove=()
# orphaned registered worktrees: branch deleted or fully merged into master
local paths
paths=$(worktree_paths)
while IFS= read -r wt_path; do
[ -z "$wt_path" ] && continue
local branch
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
# branch gone locally?
if ! branch_exists_local "$branch"; then
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
continue
fi
# branch fully merged into master?
local ahead
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
if [ "$ahead" = "0" ]; then
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
fi
done <<< "$paths"
# dangling directories: ../homelab-codex-ws-* not registered
local registered_paths
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
local parent_dir
parent_dir=$(dirname "$main_path")
while IFS= read -r candidate; do
[ -d "$candidate" ] || continue
if ! echo "$registered_paths" | grep -qF "$candidate"; then
to_remove+=("dangling:$candidate")
fi
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
if [ ${#to_remove[@]} -eq 0 ]; then
echo "Nothing to clean."
return 0
fi
echo "Found ${#to_remove[@]} item(s) to clean:"
for entry in "${to_remove[@]}"; do
echo " $entry"
done
echo ""
local overall_rc=0
for entry in "${to_remove[@]}"; do
local kind="${entry%%:*}"
local path="${entry#*:}"
# strip trailing annotation in parens
local raw_path
raw_path="${path%% (*}"
local confirm
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
if [[ "$confirm" =~ ^[Yy]$ ]]; then
if [ "$kind" = "worktree" ]; then
git worktree remove --force "$raw_path" 2>/dev/null \
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
else
rm -rf "$raw_path"
fi
echo " Removed."
else
echo " Skipped."
fi
done
return $overall_rc
}
usage() {
cat <<'EOF'
Usage: agent.sh <subcommand> [args]
agent.sh new <name> Create a new task worktree (branch task/<name>)
agent.sh list List active task worktrees with status
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
agent.sh clean Remove orphaned or dangling worktrees (interactive)
EXIT: 0 ok, 1 preflight, 2 operation failed.
EOF
}
# ── dispatch ──────────────────────────────────────────────────────────────────
SUBCOMMAND="${1:-}"
shift || true
case "$SUBCOMMAND" in
new) cmd_new "$@" ;;
list) cmd_list "$@" ;;
merge) cmd_merge "$@" ;;
clean) cmd_clean "$@" ;;
*) usage; exit 1 ;;
esac

View file

@ -1,338 +0,0 @@
#!/usr/bin/env bash
# health-monitor.sh - Homelab node health monitor and safe disk cleanup
#
# Designed to run standalone on the host (cron or direct) or to be called by
# the node-agent Python daemon. All cleanup decisions follow the conservative
# policy agreed in the design review:
#
# lte_node (chelsty-infra, chelsty-ha) : NO cleanup at all
# sd_card (piha, saturn) : dangling images + stopped containers,
# rate-limited to once per 24 h
# ai_node (solaria) : dangling images + stopped containers
# + build cache (NEVER -a)
# standard (vps) : dangling images + stopped containers
# + build cache
#
# VPS additionally rotates control-plane filesystem artefacts:
# actions/completed + failed > 7 days
# logs/deploy > 30 days
# events/** > 3 days AND past observer checkpoint
#
# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
# actions/pending|approved|running, Frigate recordings, Ollama models,
# Zigbee2MQTT data, Mosquitto data, HA database/config.
set -euo pipefail
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
EVENTS_DIR="${RUNTIME_PATH}/events"
STATE_DIR="${RUNTIME_PATH}/state"
LOGS_DIR="${RUNTIME_PATH}/logs"
ACTIONS_DIR="${RUNTIME_PATH}/actions"
NODE_NAME="${NODE_NAME:-$(hostname)}"
TIMESTAMP=$(date +%s)
DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# Thresholds
DISK_WARN_PCT=75
DISK_CRIT_PCT=85
MEM_WARN_PCT=85
MEM_CRIT_PCT=95
# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
CLEANUP_INTERVAL=86400 # seconds
# Node classifications
LTE_NODES="chelsty-infra chelsty-ha"
SD_CARD_NODES="piha saturn"
AI_NODES="solaria"
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
log() { echo "$(date -u +%H:%M:%S) [INFO] $*"; }
warn() { echo "$(date -u +%H:%M:%S) [WARN] $*" >&2; }
err() { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
contains() {
local word="$1"; shift
for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
return 1
}
get_node_type() {
# shellcheck disable=SC2086
if contains "$NODE_NAME" $LTE_NODES; then echo "lte_node"; return; fi
if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card"; return; fi
if contains "$NODE_NAME" $AI_NODES; then echo "ai_node"; return; fi
echo "standard"
}
# ---------------------------------------------------------------------------
# Event emission
# ---------------------------------------------------------------------------
emit_event() {
local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
local dir="${EVENTS_DIR}/${NODE_NAME}"
mkdir -p "$dir"
cat > "${dir}/${id}.json" <<EOF
{
"id": "${id}",
"timestamp": ${TIMESTAMP},
"date": "${DATE}",
"type": "${type}",
"severity": "${severity}",
"node": "${NODE_NAME}",
"service": "${service}",
"message": "${message}",
"payload": ${payload}
}
EOF
}
# ---------------------------------------------------------------------------
# Health checks
# ---------------------------------------------------------------------------
check_disk() {
# Use /opt/homelab as the check target — it lives on the host filesystem
# and this path is correct both when running natively and in a container
# that mounts /opt/homelab from the host.
local mount="${RUNTIME_PATH}"
local usage_pct avail_mb total_mb
usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
avail_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}') || return
total_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}') || return
if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
emit_event "disk_pressure" "high" "" \
"Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
warn "Disk elevated: ${usage_pct}% used"
emit_event "disk_pressure" "medium" "" \
"Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
fi
echo "${usage_pct}"
}
check_memory() {
local total avail pct avail_mb
total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
pct=$(( (total - avail) * 100 / total ))
avail_mb=$(( avail / 1024 ))
if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
warn "Memory CRITICAL: ${pct}% used"
emit_event "high_memory" "high" "" \
"Memory usage critical: ${pct}% (${avail_mb} MB available)" \
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
warn "Memory elevated: ${pct}%"
emit_event "high_memory" "medium" "" \
"Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
fi
echo "${pct}"
}
check_cpu() {
# Two-sample /proc/stat delta for accurate instantaneous CPU usage.
local idle1 total1 idle2 total2 pct
read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
sleep 1
read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
local d_idle=$(( idle2 - idle1 ))
local d_total=$(( total2 - total1 ))
pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
if [[ "${pct}" -ge 90 ]]; then
warn "CPU elevated: ${pct}%"
emit_event "high_cpu" "medium" "" \
"CPU usage elevated: ${pct}%" \
"{\"usage_pct\": ${pct}}"
fi
echo "${pct}"
}
check_containers() {
command -v docker &>/dev/null || return
# Containers that have exited but carry a restart policy meaning they should be up
local cname
while IFS= read -r cname; do
[[ -z "$cname" ]] && continue
warn "Container exited (should be running): ${cname}"
emit_event "containers_not_running" "high" "${cname}" \
"Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
"{\"container\": \"${cname}\"}"
done < <(docker ps -a \
--filter "status=exited" \
--filter "label=com.docker.compose.project" \
--format "{{.Names}}" 2>/dev/null || true)
# Containers that are running but their health check is failing
while IFS= read -r cname; do
[[ -z "$cname" ]] && continue
warn "Container unhealthy: ${cname}"
emit_event "healthcheck_failed" "high" "${cname}" \
"Container '${cname}' is running but health check is failing" \
"{\"container\": \"${cname}\"}"
done < <(docker ps \
--filter "health=unhealthy" \
--format "{{.Names}}" 2>/dev/null || true)
}
# ---------------------------------------------------------------------------
# Safe Docker cleanup (per policy)
# ---------------------------------------------------------------------------
_sd_card_rate_ok() {
if [[ -f "${CLEANUP_LOCK}" ]]; then
local last_ts elapsed
last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
elapsed=$(( TIMESTAMP - last_ts ))
if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
return 1
fi
fi
return 0
}
_mark_cleanup_done() {
echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
}
run_safe_cleanup() {
command -v docker &>/dev/null || return
local node_type
node_type=$(get_node_type)
case "${node_type}" in
lte_node)
# NO cleanup on LTE nodes. Any docker operation risks triggering
# a pull over a metered/intermittent connection.
log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
;;
sd_card)
# Dangling images + stopped containers only.
# Rate-limited to once per 24 hours to protect SD card write endurance.
_sd_card_rate_ok || return
log "Running rate-limited Docker cleanup (SD card node)"
docker image prune -f >/dev/null 2>&1 || true
docker container prune -f >/dev/null 2>&1 || true
_mark_cleanup_done
;;
ai_node)
# Dangling images + stopped containers + build cache.
# NEVER docker image prune -a (would remove Ollama runtime images,
# requiring a multi-hour re-pull of model weights).
log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
docker image prune -f >/dev/null 2>&1 || true
docker container prune -f >/dev/null 2>&1 || true
docker builder prune -f >/dev/null 2>&1 || true
;;
standard)
# VPS and other standard nodes: full safe cleanup.
log "Running standard Docker cleanup"
docker image prune -f >/dev/null 2>&1 || true
docker container prune -f >/dev/null 2>&1 || true
docker builder prune -f >/dev/null 2>&1 || true
;;
esac
}
# ---------------------------------------------------------------------------
# VPS-specific: control-plane filesystem rotation
# ---------------------------------------------------------------------------
cleanup_control_plane_fs() {
log "Running control-plane filesystem rotation"
# Completed / failed actions older than 7 days
for status in completed failed; do
local dir="${ACTIONS_DIR}/${status}"
[[ -d "${dir}" ]] || continue
find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
log "Cleaned ${status} actions older than 7 days" || true
done
# Deploy logs older than 30 days
local deploy_logs="${LOGS_DIR}/deploy"
if [[ -d "${deploy_logs}" ]]; then
find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
log "Cleaned deploy logs older than 30 days" || true
fi
# Event files older than 3 days AND already past the observer checkpoint.
# The dual condition ensures we never delete an event the observer hasn't seen.
local checkpoint="${STATE_DIR}/observer_checkpoint.json"
if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
local last_processed
last_processed=$(python3 -c "
import json, sys
try:
d = json.load(open('${checkpoint}'))
print(d.get('last_processed_file', ''))
except Exception:
print('')
" 2>/dev/null || echo "")
if [[ -n "${last_processed}" ]]; then
find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
# Only delete files that sort before the checkpoint path
# (i.e., the observer has already processed them).
if [[ "$f" < "${last_processed}" ]]; then
rm -f "$f"
log "Cleaned old event: $(basename "$f")"
fi
done
else
log "No observer checkpoint set; skipping event file cleanup"
fi
fi
}
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
disk_pct=$(check_disk || echo 0)
mem_pct=$(check_memory || echo 0)
cpu_pct=$(check_cpu || echo 0)
check_containers
run_safe_cleanup
# VPS: also rotate control-plane filesystem artefacts
if [[ "${NODE_NAME}" == "vps" ]]; then
cleanup_control_plane_fs
fi
# Emit a node_health heartbeat so the observer can update node status
# and the supervisor can see up-to-date resource metrics.
emit_event "node_health" "info" "" \
"Health check completed on ${NODE_NAME}" \
"{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"

View file

@ -1,546 +0,0 @@
import os
import json
import time
import glob
import logging
import yaml
from datetime import datetime, timezone
from pathlib import Path
def _atomic_write_json(path: Path, data) -> None:
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
tmp = path.with_suffix(".tmp")
with open(tmp, "w") as f:
json.dump(data, f, indent=2)
f.flush()
os.fsync(f.fileno())
os.replace(tmp, path)
def _parse_ts(ts) -> float:
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
Events from node-agent use int(time.time()); events from stability-agent / events.py
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
last_occurrence and resolved_at, so any arithmetic on them must go through here.
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
"""
if ts is None:
return 0.0
if isinstance(ts, (int, float)):
return float(ts)
try:
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
except Exception:
return 0.0
# Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
STATE_DIR = Path(RUNTIME_PATH) / "state"
LOGS_DIR = Path(RUNTIME_PATH) / "logs"
WORLD_DIR = Path(RUNTIME_PATH) / "world"
OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
FAILED_EVENTS_DIR = STATE_DIR / "observer_failed_events"
REPO_ROOT = Path(__file__).parent.parent.parent
INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("observer")
class Observer:
def __init__(self):
# Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
# Replaces the old single last_processed_file which silently skipped event dirs
# that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
self.node_checkpoints: dict = {}
self.world_state = {
"nodes": {},
"services": {},
"deployments": {},
"incidents": {},
"summary": {
"last_update": datetime.now(timezone.utc).isoformat(),
"status": "initializing",
"active_incidents_count": 0
}
}
self.inventory = self._load_inventory()
self._ensure_dirs()
self._load_checkpoint()
def _ensure_dirs(self):
WORLD_DIR.mkdir(parents=True, exist_ok=True)
STATE_DIR.mkdir(parents=True, exist_ok=True)
EVENTS_DIR.mkdir(parents=True, exist_ok=True)
LOGS_DIR.mkdir(parents=True, exist_ok=True)
FAILED_EVENTS_DIR.mkdir(parents=True, exist_ok=True)
def _quarantine_event_file(self, file_path: str, node_dir: str, exc: Exception) -> None:
"""Move an unreadable/unprocessable event out of the hot path."""
src = Path(file_path)
dest_dir = FAILED_EVENTS_DIR / node_dir
dest_dir.mkdir(parents=True, exist_ok=True)
dest = dest_dir / src.name
if dest.exists():
dest = dest_dir / f"{src.stem}-{int(time.time())}{src.suffix}"
try:
os.replace(src, dest)
logger.error(
"Quarantined bad event for node_dir=%s: %s -> %s (%s: %s)",
node_dir, src, dest, type(exc).__name__, exc,
)
except Exception as move_exc:
logger.error(
"Failed to quarantine bad event for node_dir=%s: %s (%s: %s); move error=%s: %s",
node_dir, src, type(exc).__name__, exc, type(move_exc).__name__, move_exc,
)
def _load_inventory(self):
inventory = {"nodes": {}, "services": {}}
try:
if INVENTORY_TOPOLOGY.exists():
with open(INVENTORY_TOPOLOGY, "r") as f:
topo = yaml.safe_load(f)
for node_name, node_info in topo.get("nodes", {}).items():
inventory["nodes"][node_name] = {
"roles": node_info.get("roles", []),
"connectivity": node_info.get("connectivity", {})
}
# Load service assignments from hosts files
hosts_dir = REPO_ROOT / "hosts"
for host_dir in hosts_dir.iterdir():
if host_dir.is_dir():
svc_file = host_dir / "services.yaml"
if svc_file.exists():
with open(svc_file, "r") as f:
svc_data = yaml.safe_load(f)
host_name = svc_data.get("host")
for svc_name, svc_info in svc_data.get("services", {}).items():
if host_name not in inventory["services"]:
inventory["services"][host_name] = {}
inventory["services"][host_name][svc_name] = {
"role": svc_info.get("role"),
"exposure": svc_info.get("exposure")
}
except Exception as e:
logger.error(f"Failed to load inventory: {e}")
return inventory
def _load_checkpoint(self):
if OBSERVER_STATE_FILE.exists():
try:
with open(OBSERVER_STATE_FILE, "r") as f:
checkpoint = json.load(f)
if "node_checkpoints" in checkpoint:
# New format: per-directory checkpoints.
self.node_checkpoints = checkpoint["node_checkpoints"]
elif "last_processed_file" in checkpoint:
# Migrate old single-file checkpoint: extract node dir from path.
old = checkpoint["last_processed_file"]
if old:
try:
node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
self.node_checkpoints = {node_dir: old}
logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
except Exception:
pass # Bad path — start fresh
self._load_world_from_disk()
except Exception as e:
logger.error(f"Failed to load checkpoint: {e}")
def _load_world_from_disk(self):
# Optional: Load existing state to resume faster
files = {
"nodes": WORLD_DIR / "nodes.json",
"services": WORLD_DIR / "services.json",
"deployments": WORLD_DIR / "deployments.json",
"incidents": WORLD_DIR / "incidents.json",
"summary": WORLD_DIR / "runtime-summary.json"
}
for key, path in files.items():
if path.exists():
try:
with open(path, "r") as f:
self.world_state[key] = json.load(f)
except Exception as e:
logger.error(f"Failed to load {key} state: {e}")
def _save_checkpoint(self):
try:
_atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
except Exception as e:
logger.error(f"Failed to save checkpoint: {e}")
def _prune_stale_world(self):
"""Remove world-state entries for nodes absent from the topology inventory.
Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
falls back to socket.gethostname(), which inside a Docker container returns the
12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
('vps'). The observer ingests those events and creates ghost entries that never
expire on their own.
Also ages out resolved incidents older than 7 days to keep world state lean.
"""
known_nodes = set(self.inventory["nodes"].keys())
if not known_nodes:
# Inventory failed to load — don't prune to avoid wiping valid state.
return
stale_nodes = [n for n in list(self.world_state["nodes"].keys())
if n not in known_nodes]
for n in stale_nodes:
logger.info(f"Pruning stale node from world state: {n}")
del self.world_state["nodes"][n]
stale_svcs = [k for k in list(self.world_state["services"].keys())
if k.split("/")[0] in stale_nodes]
for k in stale_svcs:
logger.info(f"Pruning stale service from world state: {k}")
del self.world_state["services"][k]
# Prune ghost service keys whose service-name portion is a hash-prefixed
# Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
# These are created when node-agent incorrectly uses c.name instead of the
# compose label, and accumulate on every container rebuild.
# Pattern: <node>/<12hexchars>_<real-name>
ghost_svcs = [
k for k in list(self.world_state["services"].keys())
if len(k.split("/", 1)) == 2
and len(k.split("/", 1)[1]) > 13
and k.split("/", 1)[1][12] == "_"
and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
]
for k in ghost_svcs:
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
del self.world_state["services"][k]
now = time.time()
try:
# Collect incident_ids currently referenced by any service entry.
linked_ids: set = {
svc.get("incident_id")
for svc in self.world_state["services"].values()
if svc.get("incident_id")
}
# Case 1 — service is healthy but still points at an active incident.
# process_event already calls _resolve_incident on service_healthy events,
# but if the observer restarted with on-disk state where the link was
# intact (inconsistency from a pre-atomic-write crash), it may not get
# resolved until the next service_healthy event is processed. Resolve
# immediately — a healthy service cannot have an ongoing incident.
for svc_key, svc in self.world_state["services"].items():
if svc.get("status") != "healthy":
continue
inc_id = svc.get("incident_id")
if not inc_id:
continue
inc = self.world_state["incidents"].get(inc_id, {})
if inc.get("status") == "active":
logger.info(
f"Auto-resolving incident {inc_id} for {svc_key}: "
f"service is healthy"
)
inc["status"] = "resolved"
inc["resolved_at"] = now
svc["incident_id"] = None
linked_ids.discard(inc_id)
# Case 2 — orphaned active incident: no service entry links to it and
# last_occurrence is older than 5 minutes (guard against creation races).
# These are the stale records left behind when on-disk state was
# inconsistent: the service entry had incident_id cleared but incidents.json
# still had the record as "active".
for inc_id, inc in self.world_state["incidents"].items():
if inc.get("status") != "active":
continue
if inc_id in linked_ids:
continue
age = now - _parse_ts(inc.get("last_occurrence"))
if age > 300: # 5-minute guard
logger.info(
f"Auto-resolving orphaned incident {inc_id} "
f"(service={inc.get('service')}, node={inc.get('node')}): "
f"no service references it, age={int(age)}s"
)
inc["status"] = "resolved"
inc["resolved_at"] = now
except Exception as exc:
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
# Remove resolved incidents older than 7 days.
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
stale_incidents = [
k for k, v in self.world_state["incidents"].items()
if v.get("status") == "resolved"
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
]
for k in stale_incidents:
del self.world_state["incidents"][k]
def _save_world(self):
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
active_incidents = [
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
]
self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
self.world_state["summary"]["service_count"] = len(self.world_state["services"])
if active_incidents:
self.world_state["summary"]["status"] = "degraded"
else:
self.world_state["summary"]["status"] = "nominal"
files = {
"nodes.json": self.world_state["nodes"],
"services.json": self.world_state["services"],
"deployments.json": self.world_state["deployments"],
"incidents.json": self.world_state["incidents"],
"recommendations.json": [],
"runtime-summary.json": self.world_state["summary"]
}
for filename, data in files.items():
try:
_atomic_write_json(WORLD_DIR / filename, data)
except Exception as e:
logger.error(f"Failed to save {filename}: {e}")
def process_event(self, event):
etype = event.get("type")
node = event.get("node")
service = event.get("service")
severity = event.get("severity")
timestamp = event.get("timestamp")
cid = event.get("correlation_id")
payload = event.get("payload", {})
# 1. Update Node State
if node not in self.world_state["nodes"]:
self.world_state["nodes"][node] = {
"status": "unknown",
"last_seen": None,
"roles": self.inventory["nodes"].get(node, {}).get("roles", [])
}
self.world_state["nodes"][node]["last_seen"] = timestamp
if etype == "node_online":
self.world_state["nodes"][node]["status"] = "online"
elif etype == "node_offline":
self.world_state["nodes"][node]["status"] = "offline"
elif etype == "node_health":
# Regular heartbeat from node-agent; updates resource metrics.
# Clears disk_pressure if disk is now healthy (< warn threshold).
self.world_state["nodes"][node]["status"] = "online"
self.world_state["nodes"][node].update({
"disk_usage_pct": payload.get("disk_pct"),
"mem_usage_pct": payload.get("mem_pct"),
"cpu_usage_pct": payload.get("cpu_pct"),
})
if (payload.get("disk_pct") or 0) < 75:
self.world_state["nodes"][node].pop("disk_pressure", None)
elif etype == "disk_pressure":
# Emitted when disk usage crosses 75 % (medium) or 85 % (high).
# The supervisor reads disk_pressure to generate disk_cleanup actions.
self.world_state["nodes"][node]["disk_pressure"] = severity
self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
elif etype == "high_memory":
# Memory pressure observation; recorded on the node for correlation.
# No automated action — operator decides if a container restart helps.
self.world_state["nodes"][node]["memory_pressure"] = severity
self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
elif etype == "high_cpu":
# CPU pressure observation; recorded for visibility.
self.world_state["nodes"][node]["cpu_pressure"] = severity
self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
# 2. Update Service State
if service and service != "all":
svc_key = f"{node}/{service}"
if svc_key not in self.world_state["services"]:
self.world_state["services"][svc_key] = {
"node": node,
"service": service,
"status": "unknown",
"last_check": None,
"incident_id": None
}
self.world_state["services"][svc_key]["last_check"] = timestamp
if etype == "service_recovered":
self.world_state["services"][svc_key]["status"] = "healthy"
self._resolve_incident(svc_key, timestamp)
elif etype == "service_healthy":
# Positive confirmation from node-agent that a managed container
# is running. This keeps services.json populated so the supervisor
# can correctly detect drift (absent entry = never reported = unknown,
# not the same as confirmed missing).
# Also resolve any active incident — if a service that had been
# unhealthy/crashing is now confirmed healthy, the incident is over.
self.world_state["services"][svc_key]["status"] = "healthy"
self._resolve_incident(svc_key, timestamp)
elif etype in ["service_unhealthy", "healthcheck_failed"]:
self.world_state["services"][svc_key]["status"] = "unhealthy"
self._handle_incident(svc_key, event)
# 3. Update Deployment State
if etype.startswith("deployment_") and cid:
if cid not in self.world_state["deployments"]:
self.world_state["deployments"][cid] = {
"node": node,
"service": service,
"status": "unknown",
"started_at": None,
"finished_at": None,
"events": []
}
self.world_state["deployments"][cid]["events"].append({
"type": etype,
"timestamp": timestamp,
"payload": payload
})
if etype == "deployment_started":
self.world_state["deployments"][cid]["status"] = "in_progress"
self.world_state["deployments"][cid]["started_at"] = timestamp
elif etype == "deployment_completed":
self.world_state["deployments"][cid]["status"] = "completed"
self.world_state["deployments"][cid]["finished_at"] = timestamp
elif etype == "deployment_failed":
self.world_state["deployments"][cid]["status"] = "failed"
self.world_state["deployments"][cid]["finished_at"] = timestamp
# Deployment failure often creates an incident
self._handle_deployment_failure(event)
def _handle_incident(self, svc_key, event):
# Correlation: collapse repeated failures for the same service on the same node
active_incident = self.world_state["services"][svc_key].get("incident_id")
if active_incident and active_incident in self.world_state["incidents"]:
incident = self.world_state["incidents"][active_incident]
if incident["status"] == "active":
incident["last_occurrence"] = event["timestamp"]
incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
incident["events"].append(event["timestamp"])
return
# Create new incident
incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
self.world_state["incidents"][incident_id] = {
"id": incident_id,
"node": event.get("node"),
"service": event.get("service"),
"status": "active",
"severity": event.get("severity"),
# trigger_type records the event type that opened this incident so that
# the supervisor can choose the appropriate remediation action
# (e.g. container_restart for containers_not_running / mqtt_unreachable
# vs. a full redeploy for other causes).
"trigger_type": event.get("type"),
"started_at": event.get("timestamp"),
"last_occurrence": event.get("timestamp"),
"occurrence_count": 1,
"events": [event["timestamp"]],
"correlation_id": event.get("correlation_id")
}
self.world_state["services"][svc_key]["incident_id"] = incident_id
def _resolve_incident(self, svc_key, timestamp):
incident_id = self.world_state["services"][svc_key].get("incident_id")
if incident_id and incident_id in self.world_state["incidents"]:
if self.world_state["incidents"][incident_id]["status"] == "active":
self.world_state["incidents"][incident_id]["status"] = "resolved"
self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
self.world_state["services"][svc_key]["incident_id"] = None
def _handle_deployment_failure(self, event):
# Specific logic for deployment failures
svc_key = f"{event.get('node')}/{event.get('service')}"
self._handle_incident(svc_key, event)
# Link diagnostics if available in payload
incident_id = self.world_state["services"][svc_key].get("incident_id")
if incident_id and incident_id in self.world_state["incidents"]:
payload = event.get("payload", {})
if "diagnostics_file" in payload:
self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
elif "error" in payload:
self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
def run_once(self):
# Update heartbeat
heartbeat_file = STATE_DIR / "observer.heartbeat"
try:
heartbeat_file.touch()
except Exception as e:
logger.error(f"Failed to touch heartbeat file: {e}")
# Collect all event files grouped by node directory.
# Per-node checkpoints are compared within each directory independently,
# so late-arriving events from remote nodes (sorted earlier in the path)
# are never skipped just because another node's checkpoint is further ahead.
all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
new_files = []
for file_path in all_files:
try:
node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
except (IndexError, ValueError):
node_dir = "__unknown__"
last_for_node = self.node_checkpoints.get(node_dir, "")
if file_path > last_for_node:
new_files.append((node_dir, file_path))
if not new_files:
# Even if no new events, prune stale entries and refresh summary freshness.
self._prune_stale_world()
self._save_world()
return
logger.info(f"Processing {len(new_files)} new events across "
f"{len({n for n, _ in new_files})} node(s)")
for node_dir, file_path in new_files:
try:
with open(file_path, "r") as f:
event = json.load(f)
self.process_event(event)
# Advance per-node checkpoint (only forward — no regression).
if file_path > self.node_checkpoints.get(node_dir, ""):
self.node_checkpoints[node_dir] = file_path
except Exception as e:
logger.error(
"Error processing node_dir=%s file=%s (%s: %s)",
node_dir, file_path, type(e).__name__, e,
)
self._quarantine_event_file(file_path, node_dir, e)
self._save_checkpoint()
self._prune_stale_world()
self._save_world()
def loop(self, interval=5):
logger.info("Starting observer loop")
while True:
self.run_once()
time.sleep(interval)
if __name__ == "__main__":
import sys
observer = Observer()
if "--run-once" in sys.argv:
observer.run_once()
else:
observer.loop()

View file

@ -1,83 +0,0 @@
#!/usr/bin/env bash
mkdir -p /tmp/homelab/events/2026-05-12/saturn
mkdir -p /tmp/homelab/state
mkdir -p /tmp/homelab/logs
mkdir -p /tmp/homelab/world
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
{
"timestamp": "2026-05-12T12:00:00Z",
"node": "saturn",
"type": "node_online",
"severity": "info",
"source": "system",
"service": "all",
"correlation_id": "init",
"payload": {}
}
EOF
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
{
"timestamp": "2026-05-12T12:05:00Z",
"node": "saturn",
"type": "service_unhealthy",
"severity": "error",
"source": "healthcheck",
"service": "mosquitto",
"correlation_id": "hc-1",
"payload": {"error": "connection refused"}
}
EOF
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
{
"timestamp": "2026-05-12T12:06:00Z",
"node": "saturn",
"type": "service_unhealthy",
"severity": "error",
"source": "healthcheck",
"service": "mosquitto",
"correlation_id": "hc-2",
"payload": {"error": "connection refused"}
}
EOF
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
{
"timestamp": "2026-05-12T12:10:00Z",
"node": "saturn",
"type": "service_recovered",
"severity": "info",
"source": "healthcheck",
"service": "mosquitto",
"correlation_id": "hc-3",
"payload": {}
}
EOF
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
{
"timestamp": "2026-05-12T12:15:00Z",
"node": "saturn",
"type": "deployment_started",
"severity": "info",
"source": "deploy_agent",
"service": "mosquitto",
"correlation_id": "deploy-1",
"payload": {"version": "2.0.18"}
}
EOF
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
{
"timestamp": "2026-05-12T12:16:00Z",
"node": "saturn",
"type": "deployment_failed",
"severity": "error",
"source": "deploy_agent",
"service": "mosquitto",
"correlation_id": "deploy-1",
"payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
}
EOF

View file

@ -1,139 +0,0 @@
# scripts/onboard — Node Onboarding Tool
Idempotentny, deklaratywny onboarding nodów przez bash — bez Ansible.
Każdy node opisany jest manifestem `hosts/<node>/node.yaml`; skrypt
`onboard.sh` czyta manifest i woła numerowane kroki w kolejności.
## Użycie
```bash
scripts/onboard/onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
```
| Flaga | Opis |
|-------|------|
| `--node <name>` | Nazwa node'a (wymagana); pasuje do `hosts/<name>/node.yaml` |
| `--step <name>` | Uruchom tylko ten jeden krok (np. `00-access`) |
| `--from <step>` | Zacznij od tego kroku i kontynuuj do końca |
| `--dry-run` | Ustawia `DRY_RUN=1`; mutacje symulowane przez `run()`, sondy wykonywane naprawdę |
```bash
# Pełny onboarding
scripts/onboard/onboard.sh --node lustro
# Tylko jeden krok
scripts/onboard/onboard.sh --node lustro --step 00-access
# Od kroku wzwyż
scripts/onboard/onboard.sh --node lustro --from 10-bootstrap-runtime
# Podgląd bez zmian (sondy stanu wykonują się naprawdę — plan jest realistyczny)
scripts/onboard/onboard.sh --node lustro --dry-run
```
## hosts/\<node\>/node.yaml — schemat
```yaml
name: LUSTRO # nazwa node'a (ALL CAPS)
role: edge # edge | compute | infra
location: KEN # identyfikator lokalizacji
ssh_user: pi # user SSH; może różnić się od "oskar" na edge nodach
# (kolizja uid=1000 — użyj istniejącego usera)
first_contact: pi@192.168.31.19 # cel SSH przed Tailscale; KONIECZNIE IP, nie .local
# (mDNS .local zawodny w automatyzacji)
tailscale:
hostname: lustro # nazwa w mesh; cel po tailscale up
ip: # wypełniane po join (opcjonalne)
deploy_autonomy: true # true = onboard.sh może wykonywać mutacje autonomicznie
# false = wydrukuj instrukcje manualne i zatrzymaj
git_control: false # true = node pulluje z Forgejo
# false = push-based z SATURN (edge nodes)
hardware:
arch: arm64 # aarch64 | x86_64 | armv7l; wypełnia 00-preflight
ram_mb: 4096 # RAM w MB; wypełnia 00-preflight
swap:
kind: zram # zram | file | none; zram zalecany (SD wear)
docker_present: true # docker już zainstalowany?; wypełnia 00-preflight
mm_runtime: systemd:magicmirror.service
# runtime MagicMirror: systemd:<unit> | pm2 | process | none
# wypełnia 00-preflight
services:
node-agent:
runtime:
engine: docker # docker | docker-compose
mem_limit: 256m # obowiązkowy (RPi4 RAM profil jak VPS — OOM ryzyko)
```
### Uwagi do pól
- **`ssh_user`** — na edge nodach z istniejącym uid=1000 (np. `pi` na RPi OS) użyj
tego usera zamiast tworzyć `oskar`; docker group membership i `mem_limit` node-agenta
są zaprojektowane pod `1000:1000`.
- **`first_contact`** — zawsze IP, nie hostname `.local`. mDNS okazał się zawodny
w automatyzacji (transient resolve fail). Po `tailscale up` używaj `tailscale.hostname`.
- **`deploy_autonomy`** — gdy `false`, kroki 10+ wypisują instrukcje manualne i kończą
pracę bez mutacji. Przydatne dla nodów zarządzanych przez inną osobę.
- **`git_control`** — gdy `false`, kroki z `git`/`repo`/`clone` w nazwie są pomijane.
## Status kroków
| Krok | Plik | Status | Opis |
|------|------|--------|------|
| `00-access` | `steps/00-access.sh` | **DONE** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interaktywny URL), verify `pi@<ts_hostname>` arch=aarch64 |
| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: zbiera fakty (arch, RAM, docker, swap, MM runtime), wypisuje raport + YAML snippet do wklejenia w node.yaml |
| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | TODO | Tworzy `/opt/homelab/` layout, `chown <ssh_user>` |
| `20-install-docker` | `steps/20-install-docker.sh` | TODO | Instaluje Docker Engine jeśli `docker_present=false`; skip gdy już zainstalowany |
| `30-install-tailscale` | `steps/30-install-tailscale.sh` | TODO | Superseded przez `00-access` dla nowych nodów; może służyć do re-join |
| `40-deploy-node-agent` | `steps/40-deploy-node-agent.sh` | TODO | Deploy node-agent docker; user 1000:1000; `mem_limit` z node.yaml |
| `50-verify` | `steps/50-verify.sh` | TODO | End-to-end smoke: event dotarł do control plane, widać w UI, alert path Telegram |
## Architektura lib/
```
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers, ONBOARD_SSH_USER/HOST)
```
### run() i dry-run
`DRY_RUN=1` jest eksportowane do wszystkich step-skryptów przez orchestrator.
```bash
# Mutacje owijamy w run() — w dry-run drukuje intent, nie wykonuje
run ssh-copy-id -i ~/.ssh/id_ed25519.pub pi@192.168.31.19
# Sondy stanu (ssh BatchMode test, command -v, status query) wykonują się ZAWSZE
# — dry-run musi pokazywać realistyczny plan oparty na aktualnym stanie
if ssh -o BatchMode=yes pi@192.168.31.19 true 2>/dev/null; then
log "key already present — skip"
fi
```
### yaml_get — fallback bez yq
Gdy `yq` nie jest dostępne, używany jest `grep`+`sed` fallback. Pułapki:
- Inline komentarze YAML (`key: value # komentarz`) są strippowane przez
`s/[[:space:]]\+#.*$//` — wymaga co najmniej jednej spacji przed `#`, więc
`url#fragment` pozostaje nienaruszone.
- Parser jest non-greedy na `:``s/^[[:space:]]*[^:]*:[[:space:]]*//'`
wartości z dwukropkiem (np. `systemd:magicmirror.service`) są czytane poprawnie.
- Dot-path (`tailscale.hostname`) działa tylko z `yq`; fallback pasuje po ostatnim
segmencie (`hostname`). Nazwy pól w node.yaml muszą być unikalne.
## Gotchas / Learnings
| Problem | Rozwiązanie |
|---------|-------------|
| mDNS `.local` zawodny | Użyj IP w `first_contact`; `.local` OK interaktywnie, nie w automatyzacji |
| Istniejący uid=1000 na edge node | Użyj tego usera; nie twórz `oskar` (kolizja uid, zepsuje własność MM) |
| swap plik na SD | Migruj na zram — wear reduction; dodaj krok do `10-bootstrap-runtime` |
| dry-run zatrzymuje się na orchestratorze | `run()` wrapper + `export DRY_RUN=1`; sondy muszą działać też w dry-run |
| SSH known-hosts warning w parsowanym output | `-o LogLevel=ERROR` na SSH do nowego hosta w mesh |
| `yaml_get` gubi prefix po `:` w wartości | Non-greedy `^[[:space:]]*[^:]*:` zamiast `.*:` |
| yaml_get nie usuwa inline komentarzy | `s/[[:space:]]\+#.*$//` po ekstrakcji wartości |
| RPi4 4 GB RAM — OOM ryzyko | `mem_limit` w node-agent override obowiązkowy (profil jak VPS) |

View file

@ -1,84 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/lib/common.sh — shared helpers for the onboarding tool
set -euo pipefail
# ── colour codes (disabled when not a tty) ──────────────────────────────────
if [[ -t 1 ]]; then
_C_RESET='\033[0m'
_C_GREEN='\033[0;32m'
_C_YELLOW='\033[1;33m'
_C_RED='\033[0;31m'
_C_CYAN='\033[0;36m'
_C_BOLD='\033[1m'
else
_C_RESET='' _C_GREEN='' _C_YELLOW='' _C_RED='' _C_CYAN='' _C_BOLD=''
fi
# ── logging ──────────────────────────────────────────────────────────────────
log() { echo -e "${_C_GREEN}[onboard]${_C_RESET} $(date +'%H:%M:%S') ${*}"; }
warn() { echo -e "${_C_YELLOW}[WARN]${_C_RESET} $(date +'%H:%M:%S') ${*}" >&2; }
die() { echo -e "${_C_RED}[ERROR]${_C_RESET} $(date +'%H:%M:%S') ${*}" >&2; exit 1; }
step() { echo -e "${_C_CYAN}${_C_BOLD}==> ${*}${_C_RESET}"; }
dryrun() { echo -e "${_C_YELLOW}[dry-run]${_C_RESET} ${*}"; }
# ── command detection ─────────────────────────────────────────────────────────
have_cmd() { command -v "$1" >/dev/null 2>&1; }
# ── dry-run execution wrapper ─────────────────────────────────────────────────
# run CMD [ARGS…] — executes CMD in live mode; prints intent in dry-run.
# Wrap MUTATIONS with this. Read-only probes (SSH BatchMode tests, status
# queries, command -v checks) must run unconditionally — never wrap them.
run() {
if [ "${DRY_RUN:-0}" = 1 ]; then
echo "[dry-run] would: $*"
else
"$@"
fi
}
export -f run
# ── file helpers ──────────────────────────────────────────────────────────────
# ensure_line FILE LINE — appends LINE to FILE if it is not already present (idempotent)
ensure_line() {
local file="$1" line="$2"
[[ -f "$file" ]] || touch "$file"
grep -qxF "$line" "$file" || echo "$line" >> "$file"
}
# ── node.yaml parsing ─────────────────────────────────────────────────────────
# require_node_yaml NODE — sets NODE_YAML; exits if not found
require_node_yaml() {
local node="$1"
NODE_YAML="${REPO_ROOT}/hosts/${node,,}/node.yaml"
[[ -f "$NODE_YAML" ]] || die "node.yaml not found: $NODE_YAML"
export NODE_YAML
}
# yaml_get NODE_YAML KEY — read a scalar value from a YAML file
# Uses yq when available; falls back to grep/sed for simple key: value pairs.
# Supports dot-separated paths (e.g. tailscale.hostname) only in yq mode;
# the grep fallback handles only the last path component.
yaml_get() {
local file="$1" key="$2"
if have_cmd yq; then
yq -r ".${key} // empty" "$file" 2>/dev/null
else
# fallback: extract last segment of key, match " key: value"
# Strip inline YAML comment (space(s)+'#'+rest) and surrounding whitespace.
# Pattern uses \+ (BRE one-or-more) so a bare '#' inside a value is preserved.
local leaf="${key##*.}"
grep -E "^\s*${leaf}:" "$file" | head -1 \
| sed -e 's/^[[:space:]]*[^:]*:[[:space:]]*//' \
-e 's/[[:space:]]\+#.*$//' \
-e 's/^[[:space:]]*//' \
-e 's/[[:space:]]*$//' \
| tr -d '"' | tr -d "'"
fi
}
# ── git wrapper ────────────────────────────────────────────────────────────────
# All git calls from onboarding scripts must go through this so --no-pager is
# always set and there is no interactive output.
git() { command git --no-pager "$@"; }
export -f git

View file

@ -1,51 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/lib/remote.sh — SSH helpers for remote node operations
# Requires: ONBOARD_SSH_USER, ONBOARD_SSH_HOST to be set by the caller.
# Inherits: DRY_RUN (boolean string "true"/"false")
set -euo pipefail
: "${ONBOARD_SSH_USER:?remote.sh: ONBOARD_SSH_USER is not set}"
: "${ONBOARD_SSH_HOST:?remote.sh: ONBOARD_SSH_HOST is not set}"
: "${DRY_RUN:=0}"
_SSH_OPTS=(
-o StrictHostKeyChecking=accept-new
-o ConnectTimeout=10
-o BatchMode=yes
)
# rrun CMD [ARGS…] — run a command on the remote node via SSH
rrun() {
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "ssh ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST} -- $*"
return 0
fi
ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
}
# rcopy LOCAL_PATH REMOTE_PATH — copy a file to the remote node via scp
rcopy() {
local src="$1" dst="$2"
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "scp $src ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
return 0
fi
scp "${_SSH_OPTS[@]}" "$src" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
}
# rsync_dir LOCAL_DIR REMOTE_DIR [EXTRA_RSYNC_ARGS…]
rsync_dir() {
local src="$1" dst="$2"
shift 2
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "rsync -az $src ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
return 0
fi
rsync -az -e "ssh ${_SSH_OPTS[*]}" "$src" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst" "$@"
}
# rcheck — verify SSH connectivity; returns 0 if reachable
rcheck() {
ssh "${_SSH_OPTS[@]}" -o ConnectTimeout=5 "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- true 2>/dev/null
}

View file

@ -1,182 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/onboard.sh — node onboarding orchestrator
#
# Usage:
# onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
#
# Flags:
# --node <name> node name matching hosts/<name>/node.yaml (required)
# --step <name> run only this step (e.g. 00-preflight)
# --from <step> start from this step, run all subsequent steps
# --dry-run print what would be done without mutating anything
#
# Steps run in lexicographic order from scripts/onboard/steps/.
# Steps that require deploy_autonomy=true are skipped (with a warning) when
# that flag is false in node.yaml. Steps that require git_control=true are
# similarly gated.
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
STEPS_DIR="${REPO_ROOT}/scripts/onboard/steps"
LIB_DIR="${REPO_ROOT}/scripts/onboard/lib"
# ── load helpers ──────────────────────────────────────────────────────────────
# shellcheck source=lib/common.sh
source "${LIB_DIR}/common.sh"
# ── defaults ──────────────────────────────────────────────────────────────────
NODE_NAME=""
ONLY_STEP=""
FROM_STEP=""
DRY_RUN=0
export DRY_RUN REPO_ROOT
# ── argument parsing ──────────────────────────────────────────────────────────
usage() {
cat >&2 <<'EOF'
Usage: onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
--node <name> node name matching hosts/<name>/node.yaml (required)
--step <name> run only this single step (e.g. 00-preflight)
--from <step> start from this step, continue to end
--dry-run no mutations; show what would run
Examples:
onboard.sh --node lustro
onboard.sh --node lustro --step 00-preflight
onboard.sh --node lustro --from 20-install-docker
onboard.sh --node lustro --dry-run
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case "$1" in
--node) NODE_NAME="${2:?--node requires a value}"; shift 2 ;;
--step) ONLY_STEP="${2:?--step requires a value}"; shift 2 ;;
--from) FROM_STEP="${2:?--from requires a value}"; shift 2 ;;
--dry-run) DRY_RUN=1; shift ;;
-h|--help) usage ;;
*) die "Unknown argument: $1" ;;
esac
done
[[ -z "$NODE_NAME" ]] && { warn "--node is required"; usage; }
export NODE_NAME
# ── load node.yaml ────────────────────────────────────────────────────────────
require_node_yaml "$NODE_NAME"
log "Loading manifest: $NODE_YAML"
DEPLOY_AUTONOMY=$(yaml_get "$NODE_YAML" "deploy_autonomy")
GIT_CONTROL=$(yaml_get "$NODE_YAML" "git_control")
SSH_USER=$(yaml_get "$NODE_YAML" "ssh_user")
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
DEPLOY_AUTONOMY="${DEPLOY_AUTONOMY:-false}"
GIT_CONTROL="${GIT_CONTROL:-false}"
[[ -z "$SSH_USER" ]] && die "ssh_user not set in $NODE_YAML"
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
export ONBOARD_SSH_USER="$SSH_USER"
export ONBOARD_SSH_HOST="$TS_HOSTNAME"
log "Node: ${NODE_NAME} | host: ${TS_HOSTNAME} | user: ${SSH_USER}"
log "deploy_autonomy=${DEPLOY_AUTONOMY} git_control=${GIT_CONTROL} dry_run=${DRY_RUN}"
# ── collect steps ─────────────────────────────────────────────────────────────
# Steps are NN-name.sh files in lexicographic order.
mapfile -t ALL_STEPS < <(find "$STEPS_DIR" -maxdepth 1 -name '[0-9][0-9]-*.sh' | sort)
if [[ ${#ALL_STEPS[@]} -eq 0 ]]; then
die "No steps found in $STEPS_DIR"
fi
# Determine which steps to run based on flags.
declare -a STEPS_TO_RUN=()
for step_path in "${ALL_STEPS[@]}"; do
step_file=$(basename "$step_path" .sh)
if [[ -n "$ONLY_STEP" ]]; then
# Match on prefix (e.g. "00-preflight" matches "00-preflight.sh")
[[ "$step_file" == "$ONLY_STEP" ]] || continue
elif [[ -n "$FROM_STEP" ]]; then
# Skip steps before FROM_STEP
[[ "$step_file" < "$FROM_STEP" && "$step_file" != "$FROM_STEP" ]] && continue
fi
STEPS_TO_RUN+=("$step_path")
done
if [[ ${#STEPS_TO_RUN[@]} -eq 0 ]]; then
die "No matching steps found (--step='${ONLY_STEP}' --from='${FROM_STEP}')"
fi
log "Steps to run (${#STEPS_TO_RUN[@]}):"
for s in "${STEPS_TO_RUN[@]}"; do
printf " %s\n" "$(basename "$s")"
done
echo ""
# ── step execution loop ───────────────────────────────────────────────────────
# Steps that start at 10+ are "mutating" and require deploy_autonomy=true.
# Steps that start at 30+ and deal with git/repo sync require git_control=true.
# Step 00-preflight is always allowed (read-only).
_step_needs_autonomy() {
local num="${1%%[^0-9]*}" # leading digits
[[ "$num" -ge 10 ]] 2>/dev/null
}
_step_needs_git_control() {
local name="$1"
[[ "$name" == *"git"* || "$name" == *"repo"* || "$name" == *"clone"* ]]
}
FAILED_STEPS=()
for step_path in "${STEPS_TO_RUN[@]}"; do
step_file=$(basename "$step_path" .sh)
step_num="${step_file%%[^0-9]*}"
# autonomy gate
if _step_needs_autonomy "$step_num" && [[ "$DEPLOY_AUTONOMY" != "true" ]]; then
warn "Skipping $step_file — deploy_autonomy=false in $NODE_YAML"
warn "Run this step manually or set deploy_autonomy: true"
continue
fi
# git_control gate
if _step_needs_git_control "$step_file" && [[ "$GIT_CONTROL" != "true" ]]; then
warn "Skipping $step_file — git_control=false in $NODE_YAML"
continue
fi
step "Running: $step_file"
if bash "$step_path"; then
log "$step_file — OK"
else
rc=$?
warn "$step_file — FAILED (exit $rc)"
FAILED_STEPS+=("$step_file")
fi
echo ""
done
# ── summary ───────────────────────────────────────────────────────────────────
if [[ ${#FAILED_STEPS[@]} -gt 0 ]]; then
die "Onboarding finished with failures: ${FAILED_STEPS[*]}"
fi
if [ "${DRY_RUN:-0}" = 1 ]; then
log "Dry-run complete — no mutations performed."
else
log "All steps completed successfully for node ${NODE_NAME}."
fi

View file

@ -1,156 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/00-access.sh — establish remote access channel
#
# Stages:
# 1. ensure_ssh_key — copy SATURN public key to first_contact (idempotent)
# 2. ensure_tailscale — install Tailscale and join network (interactive auth URL)
# 3. verify — confirm SSH over Tailscale, assert arch=aarch64
#
# Dry-run convention (DRY_RUN=1):
# - Read-only probes (SSH BatchMode test, tailscale status, command -v) run ALWAYS
# so the plan reflects real current state ("key present → skip" vs "would: install")
# - Mutations (ssh-copy-id, curl installer, tailscale up) are wrapped with run()
#
# Does NOT configure NOPASSWD or /opt/homelab — those are later steps.
# pi user on Raspberry Pi OS has passwordless sudo — required for `tailscale up`.
set -euo pipefail
STEP_NAME="00-access"
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
: "${DRY_RUN:=0}"
# Source common.sh when run standalone (orchestrator sources it before calling steps)
if ! declare -f log >/dev/null 2>&1; then
# shellcheck source=../lib/common.sh
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
fi
# ── parse node.yaml ───────────────────────────────────────────────────────────
FIRST_CONTACT=$(yaml_get "$NODE_YAML" "first_contact")
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
[[ -z "$FIRST_CONTACT" ]] && die "first_contact not set in $NODE_YAML"
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
FC_USER="${FIRST_CONTACT%%@*}"
# ONBOARD_SSH_USER/HOST set by orchestrator to post-Tailscale coordinates;
# fall back to first_contact for standalone invocation.
export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${FC_USER}}"
export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
# shellcheck source=../lib/remote.sh
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
# ── SSH option arrays ─────────────────────────────────────────────────────────
# No BatchMode — used for ssh-copy-id where a password prompt may appear
_FC_SSH_NOKEY=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10)
# BatchMode — used for all probes and post-key-install operations
_FC_SSH=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes)
# Tailscale verify — LogLevel=ERROR suppresses the "Permanently added" known-hosts
# INFO message that would otherwise leak into captured stdout on first connection
_TS_SSH=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes -o LogLevel=ERROR)
# ── tailscale state probe helper ──────────────────────────────────────────────
# Always runs; returns BackendState or "unknown" on any SSH/parse failure.
_ts_state() {
ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" \
'tailscale status --json 2>/dev/null | python3 -c \
"import sys,json; print(json.load(sys.stdin).get(\"BackendState\",\"unknown\"))" \
2>/dev/null || echo "unknown"' 2>/dev/null || echo "unknown"
}
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 1 — ensure_ssh_key
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 1/3 ensure_ssh_key → ${FIRST_CONTACT}"
# Probe: test key-based auth — always runs so dry-run reports real current state
if ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" true 2>/dev/null; then
log "SSH key already accepted by ${FIRST_CONTACT} — skip"
else
pubkeys=( "$HOME"/.ssh/id_*.pub )
[[ -f "${pubkeys[0]}" ]] || die "No public key found at ~/.ssh/id_*.pub on SATURN"
log "Key not yet installed on ${FIRST_CONTACT} (password prompt expected)"
# Mutation: install public key
run ssh-copy-id \
"${_FC_SSH_NOKEY[@]}" \
-i "${pubkeys[0]}" \
"$FIRST_CONTACT"
# Probe: verify key was installed (run() is a no-op in dry-run so this
# prints "would:" — avoids a false-failure after a skipped ssh-copy-id)
run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" true
log "Key installed and verified"
fi
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 2 — ensure_tailscale
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 2/3 ensure_tailscale on ${FIRST_CONTACT} → hostname=${TS_HOSTNAME}"
# Probe: check if tailscale binary present — always runs.
# SSH auth failure (key not yet installed in dry-run) falls through to the
# "not found" branch, which is correct for a fresh node.
if ! ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" 'command -v tailscale' >/dev/null 2>&1; then
log "Tailscale not found on ${FIRST_CONTACT}"
# Mutation: install tailscale
run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" \
'curl -fsSL https://tailscale.com/install.sh | sh'
else
log "Tailscale already installed on ${FIRST_CONTACT}"
fi
# Probe: check backend state — always runs
ts_state=$(_ts_state)
if [[ "$ts_state" == "Running" ]]; then
log "Tailscale already active (BackendState=Running) — skip"
else
warn "Tailscale BackendState=${ts_state} — joining network..."
echo ""
echo -e "${_C_BOLD}┌─────────────────────────────────────────────────────────────┐"
echo -e "│ ACTION REQUIRED: open the URL below in your browser to │"
echo -e "│ authorize ${TS_HOSTNAME} in your Tailscale account. │"
echo -e "└─────────────────────────────────────────────────────────────┘${_C_RESET}"
echo ""
# Mutation: tailscale up — blocks until user authenticates via printed URL
run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" "sudo tailscale up --hostname=${TS_HOSTNAME}"
echo ""
# Post-join state check — only meaningful after the mutation actually ran
if [ "${DRY_RUN:-0}" != 1 ]; then
ts_state2=$(_ts_state)
[[ "$ts_state2" == "Running" ]] \
|| die "Tailscale still not active after tailscale up (BackendState=${ts_state2})"
log "Tailscale joined successfully (BackendState=Running)"
fi
fi
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 3 — verify over Tailscale
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 3/3 verify SSH over Tailscale → ${ONBOARD_SSH_USER}@${TS_HOSTNAME}"
# Probe: always runs — on a node already joined this works even in dry-run.
# On a fresh node in dry-run mode Tailscale isn't set up yet, so SSH will fail;
# that is reported as a warning (not a fatal error) to keep dry-run informative.
# stderr is NOT merged (no 2>&1) — _TS_SSH uses LogLevel=ERROR so the
# "Permanently added … to known hosts" INFO message is suppressed at source.
if arch=$(ssh "${_TS_SSH[@]}" "${ONBOARD_SSH_USER}@${TS_HOSTNAME}" 'uname -m'); then
# Take the last non-empty stdout line to skip any unexpected preamble
arch=$(printf '%s' "$arch" | grep -v '^[[:space:]]*$' | tail -1 | tr -d '[:space:]')
if [[ "$arch" == "aarch64" ]]; then
log "Verify OK: ${ONBOARD_SSH_USER}@${TS_HOSTNAME} reachable, arch=${arch}"
else
msg="Unexpected arch '${arch}' on ${TS_HOSTNAME} — expected aarch64"
[ "${DRY_RUN:-0}" = 1 ] && warn "$msg" || die "$msg"
fi
else
msg="Verify SSH to ${ONBOARD_SSH_USER}@${TS_HOSTNAME} failed (Tailscale not yet joined?)"
[ "${DRY_RUN:-0}" = 1 ] && warn "$msg" || die "$msg"
fi
log "[$STEP_NAME] done"

View file

@ -1,144 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/00-preflight.sh — READ-ONLY remote node discovery
#
# Collects facts from the remote node and prints:
# 1. A human-readable report block
# 2. A machine-readable YAML snippet ready to paste into hosts/<node>/node.yaml
#
# NO mutations are performed on the remote host.
# Depends on: lib/common.sh (sourced by orchestrator), lib/remote.sh (sourced here)
set -euo pipefail
STEP_NAME="00-preflight"
# remote.sh is sourced here so individual steps can also be run standalone
# (when REPO_ROOT is in the environment).
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
# shellcheck source=../lib/remote.sh
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
step "[$STEP_NAME] Collecting facts from ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST} (read-only)"
# ── gather all facts in a single SSH session ──────────────────────────────────
raw=$(rrun bash -s <<'REMOTE'
set -euo pipefail
# arch / bitness
arch=$(uname -m)
bits=$(getconf LONG_BIT)
# RAM (kB → MB)
mem_kb=$(grep MemTotal /proc/meminfo | awk '{print $2}')
mem_mb=$(( mem_kb / 1024 ))
# disk root
disk_root=$(df -h / | awk 'NR==2{print $2" total, "$3" used, "$4" free ("$5" used)"}')
# docker
docker_present=false
docker_info=""
if command -v docker >/dev/null 2>&1; then
docker_present=true
docker_info=$(docker info --format '{{.ServerVersion}}' 2>/dev/null || echo "unknown")
fi
# tailscale
tailscale_present=false
tailscale_status=""
if command -v tailscale >/dev/null 2>&1; then
tailscale_present=true
tailscale_status=$(tailscale status --json 2>/dev/null | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('BackendState','unknown'))" 2>/dev/null || tailscale status 2>/dev/null | head -1 || echo "unknown")
fi
# Magic Mirror runtime detection
mm_runtime="none"
if systemctl is-active --quiet MagicMirror 2>/dev/null || systemctl is-active --quiet magicmirror 2>/dev/null; then
mm_runtime="systemd"
elif command -v pm2 >/dev/null 2>&1 && pm2 list 2>/dev/null | grep -qi "MagicMirror"; then
mm_runtime="pm2"
elif pgrep -fa "MagicMirror" >/dev/null 2>&1; then
mm_runtime="process"
fi
# swap
swap_current="none"
if command -v swapon >/dev/null 2>&1; then
swap_lines=$(swapon --show --noheadings 2>/dev/null || true)
if [[ -n "$swap_lines" ]]; then
swap_current="$swap_lines"
fi
fi
if command -v zramctl >/dev/null 2>&1; then
zram_lines=$(zramctl --noheadings 2>/dev/null || true)
[[ -n "$zram_lines" ]] && swap_current="${swap_current:+$swap_current; }zram: $zram_lines"
fi
# hostname / os
hostname=$(hostname -f 2>/dev/null || hostname)
os_pretty=$(grep PRETTY_NAME /etc/os-release 2>/dev/null | cut -d= -f2 | tr -d '"' || echo "unknown")
cat <<EOF
ARCH=$arch
BITS=$bits
MEM_MB=$mem_mb
DISK_ROOT=$disk_root
DOCKER_PRESENT=$docker_present
DOCKER_VERSION=$docker_info
TAILSCALE_PRESENT=$tailscale_present
TAILSCALE_STATUS=$tailscale_status
MM_RUNTIME=$mm_runtime
SWAP_CURRENT=$swap_current
HOSTNAME=$hostname
OS=$os_pretty
EOF
REMOTE
)
# ── parse key=value output ────────────────────────────────────────────────────
_val() { echo "$raw" | grep "^${1}=" | head -1 | cut -d= -f2-; }
arch=$(_val ARCH)
bits=$(_val BITS)
mem_mb=$(_val MEM_MB)
disk_root=$(_val DISK_ROOT)
docker_present=$(_val DOCKER_PRESENT)
docker_version=$(_val DOCKER_VERSION)
tailscale_present=$(_val TAILSCALE_PRESENT)
tailscale_status=$(_val TAILSCALE_STATUS)
mm_runtime=$(_val MM_RUNTIME)
swap_current=$(_val SWAP_CURRENT)
remote_hostname=$(_val HOSTNAME)
os_pretty=$(_val OS)
# ── human-readable report ─────────────────────────────────────────────────────
echo ""
echo "┌─────────────────────────────────────────────────────┐"
printf "│ Preflight report: %-33s│\n" "${ONBOARD_SSH_HOST}"
echo "├─────────────────────────────────────────────────────┤"
printf "│ hostname : %-35s│\n" "$remote_hostname"
printf "│ OS : %-35s│\n" "$os_pretty"
printf "│ arch : %-35s│\n" "${arch} (${bits}-bit)"
printf "│ RAM : %-35s│\n" "${mem_mb} MB"
printf "│ disk / : %-35s│\n" "$disk_root"
printf "│ docker : %-35s│\n" "${docker_present} (v${docker_version})"
printf "│ tailscale : %-35s│\n" "${tailscale_present} / ${tailscale_status}"
printf "│ MagicMirror : %-35s│\n" "$mm_runtime"
printf "│ swap : %-35s│\n" "${swap_current:-none}"
echo "└─────────────────────────────────────────────────────┘"
echo ""
# ── machine-readable YAML snippet ────────────────────────────────────────────
echo "# ── paste into hosts/${NODE_NAME,,}/node.yaml ──"
cat <<YAML
hardware:
arch: ${arch}
ram_mb: ${mem_mb}
swap: ${swap_current:-none}
docker_present: ${docker_present}
docker_version: "${docker_version}"
tailscale_status: "${tailscale_status}"
mm_runtime: ${mm_runtime}
YAML
log "[$STEP_NAME] done — no changes made to remote host"

View file

@ -1,14 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/10-bootstrap-runtime.sh — create /opt/homelab layout on remote node
#
# TODO: create /opt/homelab/{data,config,logs,state,events,world,actions/{pending,approved,running,completed,failed}}
# TODO: set ownership to ssh_user (from node.yaml)
# TODO: write /opt/homelab/state/node_name from node.yaml name field
# TODO: idempotent — skip dirs that already exist
set -euo pipefail
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
STEP_NAME="10-bootstrap-runtime"
step "[$STEP_NAME] TODO — not yet implemented"

View file

@ -1,152 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/20-base.sh — base system configuration for LUSTRO
#
# Stages:
# 1. swap→zram — disable dphys-swapfile, install + configure zram-tools
# 2. /opt/homelab — create base directory, chown <ssh_user>:<ssh_user>
# 3. event dir — create /opt/homelab/events/<ts_hostname>, chown -R
#
# Dry-run convention:
# - Probes (state queries) run unconditionally — dry-run reflects real state
# - Mutations use rrun() which skips execution when DRY_RUN=1
set -euo pipefail
STEP_NAME="20-base"
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
: "${DRY_RUN:=0}"
# Source common.sh when run standalone (orchestrator sources it before calling steps)
if ! declare -f log >/dev/null 2>&1; then
# shellcheck source=../lib/common.sh
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
fi
# ── parse node.yaml ───────────────────────────────────────────────────────────
SSH_USER=$(yaml_get "$NODE_YAML" "ssh_user")
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
[[ -z "$SSH_USER" ]] && die "ssh_user not set in $NODE_YAML"
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${SSH_USER}}"
export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
# shellcheck source=../lib/remote.sh
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
# ── rprobe: read-only remote probe — always runs, even in dry-run ─────────────
rprobe() {
ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
}
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 1 — swap→zram
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 1/3 swap→zram (PERCENT=50, algo=zstd)"
# Guard by EFFECT: zram device present in swapon AND dphys-swapfile not active
# → desired end-state already reached, skip the whole stage.
_zram_active=0
_dphys_active=0
rprobe 'sudo swapon --show 2>/dev/null | grep -q /dev/zram' && _zram_active=1 || true
rprobe 'systemctl is-active dphys-swapfile' >/dev/null 2>&1 && _dphys_active=1 || true
if [[ "$_zram_active" -eq 1 && "$_dphys_active" -eq 0 ]]; then
log "zram already active, dphys-swapfile not active — skip"
else
# Substage: disable dphys-swapfile if still active
if [[ "$_dphys_active" -eq 1 ]]; then
log "dphys-swapfile active — disabling"
rrun sudo dphys-swapfile swapoff
rrun sudo systemctl disable --now dphys-swapfile
if rprobe '[ -f /var/swap ]' 2>/dev/null; then
rrun sudo rm -f /var/swap
log "Removed /var/swap"
fi
else
log "dphys-swapfile not active — skip disable"
fi
# Substage: install zram-tools if package not present
# Use dpkg -l rather than command -v: zramswap binary may not be on PATH over SSH
if ! rprobe 'dpkg -l zram-tools 2>/dev/null | grep -q "^ii"' 2>/dev/null; then
log "zram-tools not installed — installing"
rrun sudo apt-get install -y zram-tools
else
log "zram-tools already installed"
fi
# Write config and (re)start zramswap
log "Writing /etc/default/zramswap (ALGO=zstd, PERCENT=50)"
rrun bash -c "printf '%s\n' 'ALGO=zstd' 'PERCENT=50' | sudo tee /etc/default/zramswap > /dev/null"
rrun sudo systemctl enable zramswap
rrun sudo systemctl restart zramswap
fi
# Verify (skipped in dry-run — mutations may not have run)
if [ "${DRY_RUN:-0}" != 1 ]; then
if rprobe 'sudo swapon --show 2>/dev/null | grep -q /dev/zram'; then
log "Verify OK: zram swap active"
rprobe 'sudo swapon --show' || true
else
die "zram swap not active after setup — check: systemctl status zramswap on ${TS_HOSTNAME}"
fi
if rprobe 'systemctl is-active dphys-swapfile' >/dev/null 2>&1; then
warn "dphys-swapfile still reports active — manual inspection needed"
else
log "Verify OK: dphys-swapfile not active"
fi
fi
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 2 — /opt/homelab
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 2/3 /opt/homelab (owner: ${SSH_USER}:${SSH_USER})"
# Guard: exists AND owned by SSH_USER?
_dir_ok=0
if rprobe '[ -d /opt/homelab ]' 2>/dev/null; then
_owner=$(rprobe "stat -c '%U' /opt/homelab" 2>/dev/null || echo "")
if [[ "$_owner" == "$SSH_USER" ]]; then
_dir_ok=1
log "/opt/homelab exists, owner=${SSH_USER} — skip"
else
log "/opt/homelab exists but owner='${_owner}' — fixing"
fi
else
log "/opt/homelab missing — creating"
fi
if [[ "$_dir_ok" -eq 0 ]]; then
rrun sudo mkdir -p /opt/homelab
rrun sudo chown "${SSH_USER}:${SSH_USER}" /opt/homelab
fi
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 3 — event dir
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 3/3 event dir (/opt/homelab/events/${TS_HOSTNAME})"
# Guard: event subdir exists AND /opt/homelab/events owned by SSH_USER?
_evdir_ok=0
if rprobe "[ -d /opt/homelab/events/${TS_HOSTNAME} ]" 2>/dev/null; then
_ev_owner=$(rprobe "stat -c '%U' /opt/homelab/events" 2>/dev/null || echo "")
if [[ "$_ev_owner" == "$SSH_USER" ]]; then
_evdir_ok=1
log "/opt/homelab/events/${TS_HOSTNAME} exists, owner=${SSH_USER} — skip"
else
log "/opt/homelab/events exists but owner='${_ev_owner}' — fixing"
fi
else
log "/opt/homelab/events/${TS_HOSTNAME} missing — creating"
fi
if [[ "$_evdir_ok" -eq 0 ]]; then
rrun sudo mkdir -p "/opt/homelab/events/${TS_HOSTNAME}"
rrun sudo chown -R "${SSH_USER}:${SSH_USER}" /opt/homelab/events
fi
log "[$STEP_NAME] done"

View file

@ -1,16 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/20-install-docker.sh — install Docker Engine on remote node
#
# TODO: skip if docker already present (check from 00-preflight facts or live rrun)
# TODO: detect distro (Debian/Ubuntu/Raspberry Pi OS) and use appropriate apt repo
# TODO: install docker-ce, docker-ce-cli, containerd.io
# TODO: add ssh_user to docker group
# TODO: enable + start docker.service
# TODO: gate on deploy_autonomy=true in node.yaml (skip step if false, warn operator)
set -euo pipefail
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
STEP_NAME="20-install-docker"
step "[$STEP_NAME] TODO — not yet implemented"

View file

@ -1,16 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/30-install-tailscale.sh — install and join Tailscale on remote node
#
# TODO: skip if tailscale already installed and connected
# TODO: install via https://tailscale.com/install.sh (or distro pkg)
# TODO: gate on operator-provided auth key (TAILSCALE_AUTH_KEY env var; never hardcode)
# TODO: tailscale up --auth-key=$TAILSCALE_AUTH_KEY --hostname=<node.yaml name>
# TODO: verify node appears in tailscale status within timeout
# TODO: gate on deploy_autonomy=true in node.yaml
set -euo pipefail
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
STEP_NAME="30-install-tailscale"
step "[$STEP_NAME] TODO — not yet implemented"

View file

@ -1,136 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/30-node-agent.sh — deploy node-agent to remote node
#
# Push-based deploy (git_control=false on LUSTRO): rsync services/node-agent/
# and the host override to /opt/homelab/deploy/node-agent/ on the remote, then
# docker compose build + up via SSH. Mirrors the PIHA pattern but pushes files
# instead of git-pulling them on the node.
#
# Stages:
# 1. push — rsync base compose+src, copy override to remote deploy dir
# 2. up — docker compose up -d --build (guarded: skip if already running)
# 3. verify — container running + fresh event in /opt/homelab/events/<node>/
#
# Dry-run: probes run unconditionally; rsync/rrun mutations honour DRY_RUN.
set -euo pipefail
STEP_NAME="30-node-agent"
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
: "${DRY_RUN:=0}"
# Source common.sh when run standalone (orchestrator sources it before calling steps)
if ! declare -f log >/dev/null 2>&1; then
# shellcheck source=../lib/common.sh
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
fi
# ── parse node.yaml ───────────────────────────────────────────────────────────
SSH_USER=$(yaml_get "$NODE_YAML" "ssh_user")
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
[[ -z "$SSH_USER" ]] && die "ssh_user not set in $NODE_YAML"
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${SSH_USER}}"
export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
# shellcheck source=../lib/remote.sh
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
REMOTE_DEPLOY_DIR="/opt/homelab/deploy/node-agent"
COMPOSE_BASE="${REMOTE_DEPLOY_DIR}/docker-compose.yml"
COMPOSE_OVERRIDE="${REMOTE_DEPLOY_DIR}/docker-compose.override.yml"
LOCAL_SVC_DIR="${REPO_ROOT}/services/node-agent"
LOCAL_OVERRIDE="${REPO_ROOT}/hosts/${TS_HOSTNAME}/runtime/node-agent/docker-compose.override.yml"
# ── rprobe: read-only remote probe — always runs, even in dry-run ─────────────
rprobe() {
ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
}
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 1 — push compose files to remote
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 1/3 push compose → ${ONBOARD_SSH_HOST}:${REMOTE_DEPLOY_DIR}"
# Guard by EFFECT: is node-agent already running?
_running=0
if rprobe "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}' 2>/dev/null | grep -q node-agent" 2>/dev/null; then
_running=1
log "node-agent container already running — skip push+build+up"
fi
if [[ "$_running" -eq 0 ]]; then
[[ -f "$LOCAL_OVERRIDE" ]] \
|| die "Override not found: $LOCAL_OVERRIDE"
# Ensure remote deploy dir exists (rsync does not create intermediate dirs)
# pi owns /opt/homelab, so no sudo needed
rrun mkdir -p "${REMOTE_DEPLOY_DIR}"
# Push base compose + Dockerfile + src/ (rsync_dir handles DRY_RUN)
rsync_dir "${LOCAL_SVC_DIR}/" "${REMOTE_DEPLOY_DIR}/"
# Push host-specific override (rcopy handles DRY_RUN)
rcopy "${LOCAL_OVERRIDE}" "${REMOTE_DEPLOY_DIR}/docker-compose.override.yml"
fi
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 2 — docker compose build + up
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 2/3 docker compose up node-agent"
if [[ "$_running" -eq 1 ]]; then
log "node-agent already running — skip"
else
# Build image on remote (arm64 native); then start the service.
# --build rebuilds if context changed; idempotent if image is current.
rrun docker compose \
-f "${COMPOSE_BASE}" \
-f "${COMPOSE_OVERRIDE}" \
up -d --build node-agent
fi
# ═══════════════════════════════════════════════════════════════════════════════
# Stage 3 — verify
# ═══════════════════════════════════════════════════════════════════════════════
step "[$STEP_NAME] 3/3 verify"
if [ "${DRY_RUN:-0}" = 1 ]; then
log "dry-run: skipping verify (mutations may not have run)"
else
# Verify: container running (docker ps — not command -v)
if rprobe "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}' 2>/dev/null | grep -q node-agent" 2>/dev/null; then
log "Verify OK: node-agent container running"
rprobe "docker ps --filter name=node-agent --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'" || true
else
die "node-agent container is NOT running — check: docker logs node-agent on ${TS_HOSTNAME}"
fi
# Verify: fresh events appear in /opt/homelab/events/<node>/ (confirms agent writes)
# First cycle runs at start then sleeps CHECK_INTERVAL; allow 90s.
log "Waiting for first event (up to 90 s, CHECK_INTERVAL=60)..."
_event_ok=0
for _i in $(seq 1 9); do
if rprobe "ls /opt/homelab/events/${TS_HOSTNAME}/*.json 2>/dev/null | head -1 | grep -q .json" 2>/dev/null; then
_event_ok=1
break
fi
log " ... ${_i}0 s elapsed, waiting..."
sleep 10
done
if [[ "$_event_ok" -eq 1 ]]; then
log "Verify OK: events present in /opt/homelab/events/${TS_HOSTNAME}/"
rprobe "ls -lth /opt/homelab/events/${TS_HOSTNAME}/ | head -5" || true
else
warn "No events yet in /opt/homelab/events/${TS_HOSTNAME}/ after 90 s — agent may still be initialising (CHECK_INTERVAL=60)"
warn "Re-run verify manually: docker logs node-agent on ${TS_HOSTNAME}"
fi
fi
log "[$STEP_NAME] done"

View file

@ -1,140 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/40-register.sh — wpisz node do inventory i commituj na branchu
#
# Efekty (wszystkie idempotentne):
# 1. Dopisuje blok <node> do inventory/topology.yaml
# 2. Tworzy hosts/<node>/services.yaml jeśli nie istnieje
# 3. git add + git commit na aktualnym branchu (NIE push — merge należy do operatora)
#
# Reload observera celowo poza tym krokiem — wykonywany ręcznie po merge→master,
# git pull na VPS i uruchomieniu 50-verify.sh.
set -euo pipefail
STEP_NAME="40-register"
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
: "${DRY_RUN:=0}"
if ! declare -f log >/dev/null 2>&1; then
# shellcheck source=../lib/common.sh
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
fi
NODE_ENTRY=$(yaml_get "${NODE_YAML}" "tailscale.hostname")
[[ -z "${NODE_ENTRY}" ]] && die "tailscale.hostname not set in ${NODE_YAML}"
TOPOLOGY="${REPO_ROOT}/inventory/topology.yaml"
SERVICES_YAML="${REPO_ROOT}/hosts/${NODE_ENTRY}/services.yaml"
# ── 1. inventory/topology.yaml ────────────────────────────────────────────────
step "[${STEP_NAME}] 1/3 inventory/topology.yaml"
_TOPOLOGY_BLOCK=$(cat << 'EOF'
PLACEHOLDER:
roles:
- edge
services:
- node-agent
EOF
)
# Replace the PLACEHOLDER with the actual node name
_TOPOLOGY_BLOCK="${_TOPOLOGY_BLOCK//PLACEHOLDER/${NODE_ENTRY}}"
if grep -q "^ ${NODE_ENTRY}:" "${TOPOLOGY}"; then
log "${NODE_ENTRY} already present in topology.yaml — skip"
else
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "Would append to ${TOPOLOGY}:"
echo "${_TOPOLOGY_BLOCK}"
else
printf '%s\n' "${_TOPOLOGY_BLOCK}" >> "${TOPOLOGY}"
log "Appended ${NODE_ENTRY} block to topology.yaml"
fi
fi
# ── 2. hosts/<node>/services.yaml ────────────────────────────────────────────
step "[${STEP_NAME}] 2/3 hosts/${NODE_ENTRY}/services.yaml"
if [[ -f "${SERVICES_YAML}" ]]; then
log "services.yaml already exists — skip"
else
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "Would create ${SERVICES_YAML}:"
cat << EOF
host: ${NODE_ENTRY}
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
EOF
else
mkdir -p "${REPO_ROOT}/hosts/${NODE_ENTRY}"
cat > "${SERVICES_YAML}" << EOF
host: ${NODE_ENTRY}
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
EOF
log "Created ${SERVICES_YAML}"
fi
fi
# ── 3. git commit ─────────────────────────────────────────────────────────────
step "[${STEP_NAME}] 3/3 git commit"
cd "${REPO_ROOT}"
_changed_files=()
git diff --quiet "${TOPOLOGY}" 2>/dev/null || _changed_files+=("inventory/topology.yaml")
[[ -f "${SERVICES_YAML}" ]] && \
git ls-files --error-unmatch "${SERVICES_YAML}" 2>/dev/null || \
_changed_files+=("hosts/${NODE_ENTRY}/services.yaml")
# Re-check: is anything staged or unstaged for these paths?
_needs_commit=0
if git diff --quiet && git diff --cached --quiet; then
# Nothing changed at all — may already be committed
if git ls-files --error-unmatch "${TOPOLOGY}" "${SERVICES_YAML}" >/dev/null 2>&1 && \
! git diff HEAD -- "${TOPOLOGY}" "${SERVICES_YAML}" | grep -q .; then
log "Nothing to commit — ${NODE_ENTRY} already registered and committed"
else
_needs_commit=1
fi
else
_needs_commit=1
fi
if [[ "${_needs_commit}" -eq 1 ]]; then
run git add "inventory/topology.yaml" "hosts/${NODE_ENTRY}/services.yaml"
run git commit -m "feat(onboard): register ${NODE_ENTRY} in topology + services.yaml"
if [ "${DRY_RUN:-0}" != 1 ]; then
log "Committed on $(git branch --show-current)"
log "Next: agent.sh merge task/node-onboarding → master, git pull VPS, run 50-verify.sh"
fi
fi
log "[${STEP_NAME}] done"

View file

@ -1,160 +0,0 @@
#!/usr/bin/env bash
# scripts/onboard/steps/50-verify.sh — restart observera + smoke test węzła w panelu
#
# Uruchamiaj PO: merge task/node-onboarding → master + git pull na VPS.
#
# Sprawdzenia:
# 1. SSH <node>: node-agent container running
# 2. SSH <node>: eventy obecne w /opt/homelab/events/<node>/
# 3. SSH VPS: docker restart control-plane-observer + poll observer.heartbeat
# 4. SSH VPS: <node> widoczny w /opt/homelab/world/nodes.json
#
# Exit 0 — wszystkie OK | Exit 1 — co najmniej jedno FAIL (tabela podsumowująca)
set -euo pipefail
STEP_NAME="50-verify"
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
: "${DRY_RUN:=0}"
if ! declare -f log >/dev/null 2>&1; then
# shellcheck source=../lib/common.sh
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
fi
SSH_USER=$(yaml_get "${NODE_YAML}" "ssh_user")
TS_HOSTNAME=$(yaml_get "${NODE_YAML}" "tailscale.hostname")
[[ -z "${SSH_USER}" ]] && die "ssh_user not set in ${NODE_YAML}"
[[ -z "${TS_HOSTNAME}" ]] && die "tailscale.hostname not set in ${NODE_YAML}"
VPS_SSH_USER="oskar"
VPS_SSH_HOST="100.95.58.48"
VPS_REPO_PATH="/home/oskar/homelab-codex-ws"
_SSH_OPTS=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes)
_ssh_node() { ssh "${_SSH_OPTS[@]}" "${SSH_USER}@${TS_HOSTNAME}" -- "$@"; }
_ssh_vps() { ssh "${_SSH_OPTS[@]}" "${VPS_SSH_USER}@${VPS_SSH_HOST}" -- "$@"; }
declare -A RESULTS=()
# ── 1. node-agent running on <node> ──────────────────────────────────────────
step "[${STEP_NAME}] 1/4 ${TS_HOSTNAME}: node-agent container"
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "ssh ${SSH_USER}@${TS_HOSTNAME} docker ps --filter name=^node-agent\$"
RESULTS["node-agent-running"]="skip"
elif _ssh_node "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}'" 2>/dev/null \
| grep -q "node-agent"; then
log "OK: node-agent running"
_ssh_node "docker ps --filter name=node-agent --format 'table {{.Names}}\t{{.Status}}'" 2>/dev/null || true
RESULTS["node-agent-running"]="PASS"
else
warn "FAIL: node-agent nie działa na ${TS_HOSTNAME}"
RESULTS["node-agent-running"]="FAIL"
fi
# ── 2. eventy w /opt/homelab/events/<node>/ ───────────────────────────────────
step "[${STEP_NAME}] 2/4 ${TS_HOSTNAME}: eventy"
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "ssh ${SSH_USER}@${TS_HOSTNAME} find /opt/homelab/events/${TS_HOSTNAME}/ -name '*.json'"
RESULTS["events-present"]="skip"
elif _ssh_node "find /opt/homelab/events/${TS_HOSTNAME}/ -name '*.json' 2>/dev/null | head -1" 2>/dev/null \
| grep -q ".json"; then
_latest=$(_ssh_node "ls -t /opt/homelab/events/${TS_HOSTNAME}/*.json 2>/dev/null | head -1" || echo "?")
log "OK: eventy obecne (ostatni: ${_latest})"
RESULTS["events-present"]="PASS"
else
warn "FAIL: brak eventów w /opt/homelab/events/${TS_HOSTNAME}/"
RESULTS["events-present"]="FAIL"
fi
# ── 3. restart observera + healthcheck ────────────────────────────────────────
step "[${STEP_NAME}] 3/4 VPS: restart control-plane-observer"
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} docker restart control-plane-observer"
dryrun "poll /opt/homelab/state/observer.heartbeat (max 30s)"
RESULTS["observer-healthy"]="skip"
else
log "Restarting control-plane-observer na VPS..."
_ssh_vps "docker restart control-plane-observer"
log "Polling observer.heartbeat (max 30s)..."
_ok=0
for _i in $(seq 1 6); do
sleep 5
_age=$(_ssh_vps "python3 -c \
\"import os,time; s=os.stat('/opt/homelab/state/observer.heartbeat'); \
print(int(time.time()-s.st_mtime))\" 2>/dev/null" || echo "999")
if [[ "${_age}" -lt 20 ]]; then
log "OK: observer.heartbeat fresh (${_age}s temu)"
_ok=1
break
fi
log " ... ${_i}×5s, heartbeat ${_age}s old..."
done
if [[ "${_ok}" -eq 1 ]]; then
RESULTS["observer-healthy"]="PASS"
else
warn "FAIL: observer.heartbeat nie odświeżony po 30s"
warn "Sprawdź: ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} docker logs control-plane-observer --tail 30"
RESULTS["observer-healthy"]="FAIL"
fi
fi
# ── 4. <node> widoczny w world/nodes.json ─────────────────────────────────────
step "[${STEP_NAME}] 4/4 VPS: ${TS_HOSTNAME} w world/nodes.json"
if [ "${DRY_RUN:-0}" = 1 ]; then
dryrun "ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} python3 -c \"json.load(.../world/nodes.json)['${TS_HOSTNAME}']\""
RESULTS["world-state"]="skip"
else
_node_status=$(_ssh_vps "python3 -c \"
import json, sys
try:
d = json.load(open('/opt/homelab/world/nodes.json'))
node = d.get('${TS_HOSTNAME}', {})
print(node.get('status', 'missing'))
except Exception as e:
print('error:' + str(e))
\"" 2>/dev/null || echo "ssh-error")
case "${_node_status}" in
online|offline)
log "OK: ${TS_HOSTNAME} w world/nodes.json (status=${_node_status})"
RESULTS["world-state"]="PASS"
;;
missing)
warn "FAIL: ${TS_HOSTNAME} nie ma wpisu w world/nodes.json"
warn "Możliwa przyczyna: observer nie przetworzyл jeszcze eventów (poczekaj 60s i spróbuj ponownie)"
RESULTS["world-state"]="FAIL"
;;
*)
warn "FAIL: nieoczekiwana odpowiedź: ${_node_status}"
RESULTS["world-state"]="FAIL"
;;
esac
fi
# ── tabela podsumowująca ──────────────────────────────────────────────────────
echo ""
printf '%s\n' "══════════════════════════════════════════"
printf " %-30s %s\n" "CHECK" "RESULT"
printf '%s\n' "──────────────────────────────────────────"
for _key in "node-agent-running" "events-present" "observer-healthy" "world-state"; do
_val="${RESULTS[${_key}]:-???}"
printf " %-30s %s\n" "${_key}" "${_val}"
done
printf '%s\n' "══════════════════════════════════════════"
echo ""
for _val in "${RESULTS[@]}"; do
[[ "${_val}" == "FAIL" ]] && { warn "Verify: co najmniej jeden check nie przeszedł"; exit 1; }
done
log "[${STEP_NAME}] Verify OK — ${TS_HOSTNAME} zarejestrowany i widoczny w panelu"

View file

@ -5,24 +5,6 @@ Central runtime materializer and Operator Control Plane UI.
- **Redis**: Central state store (on PIHA).
- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
- **Web UI**: Exposes API endpoints and serving the Operator UI.
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
#### Configuration
Environment variables should be set in `.env` (see `env.example`).
Key variables for the Telegram Bot:
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
#### Telegram Commands
- `/status`: Check bot and API connectivity.
- `/summary`: System health overview.
- `/nodes`: List homelab nodes and their status.
- `/services`: Summary of services across nodes.
- `/unhealthy`: List all unhealthy components.
- `/incidents`: View active incidents.
- `/actions`: Summary of operator actions.
- `/help`: List all commands.
#### Deployment (on PIHA)
```bash

View file

@ -1,52 +0,0 @@
### Action Approval Data Model
Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
#### Statuses
- `pending`: Waiting for operator approval. AI agents create actions in this state.
- `approved`: Approved by operator, ready for execution.
- `rejected`: Rejected by operator, will not be executed.
- `running`: Currently being executed by an agent (e.g. `materializer`).
- `completed`: Successfully executed.
- `failed`: Execution failed.
#### Human-in-the-Loop (HIL) Protocol
1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
2. **Notification**: System notifies the human operator.
3. **Audit**: Human reviews `details.reason` and `details.diff`.
4. **Authorization**: Human moves file to `approved/`.
5. **Execution**: Agent monitors `approved/` and executes the task.
#### Schema
```json
{
"action_id": "string",
"service": "string",
"node": "string",
"type": "deploy_service | restart_service | rollback | scale",
"risk": "nominal | guarded | critical",
"status": "pending | approved | rejected | ...",
"created_at": <unix_seconds>,
"updated_at": <unix_seconds>,
"details": {
"image": "string",
"reason": "string",
"diff": "string"
},
"transition_history": [
{
"from": "string | null",
"to": "string",
"timestamp": <unix_seconds>,
"by": "string (system | operator-tg-12345 | webui)"
}
]
}
```
#### Workflow
1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
2. `telegram-bot` detects the file, sends a message to allowed users.
3. Operator clicks "Approve" or "Reject".
4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.

View file

@ -8,13 +8,7 @@ echo ">>> Building and starting Agent System services..."
docker compose up -d --build
echo ">>> Services status:"
docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
echo ">>> Telegram bot status: DISABLED (token missing)"
else
echo ">>> Telegram bot status: ENABLED"
fi
docker ps --filter "name=agent-system"
echo ">>> Verifying API endpoints..."
sleep 5 # Give it a moment to start

View file

@ -31,17 +31,3 @@ services:
depends_on:
- redis
restart: unless-stopped
telegram-bot:
build: ./telegram-bot
container_name: agent-system-telegram-bot
environment:
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
ACTIONS_ROOT: /opt/homelab/actions
volumes:
- /opt/homelab:/opt/homelab
restart: on-failure

View file

@ -1,19 +0,0 @@
# Telegram Bot Configuration
# Get token from @BotFather
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
# Comma-separated list of Telegram User IDs
TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
# Local control-plane API (default is internal compose address)
CONTROL_PLANE_URL=http://webui:8080
# Optional LLM fallback logic
ENABLE_LLM_FALLBACK=false
OPENCLAW_BASE_URL=http://openclaw.internal
# Runtime Materializer Configuration
REDIS_HOST=100.108.208.3
REDIS_PORT=6379
# Paths
HOMELAB_ROOT=/opt/homelab
ACTIONS_ROOT=/opt/homelab/actions
WORLD_DIR=/opt/homelab/world

View file

@ -3,8 +3,6 @@ import json
import os
import time
import argparse
import urllib.request
import urllib.error
from datetime import datetime
# Configuration from environment variables
@ -12,15 +10,6 @@ REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
# When set, materialize from the control-plane HTTP API instead of Redis.
# This is the authoritative source of truth: the observer writes clean world
# state to the control-plane API, which the materializer mirrors locally so
# the webui's /snapshot (and all other endpoints) reflect the same data.
#
# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
def get_redis_client():
"""Returns a Redis client with decoding enabled."""
return redis.Redis(
@ -52,61 +41,6 @@ def normalize_health(health):
return "degraded"
return "error"
def _fetch_json(url):
"""Fetch JSON from a URL, returning parsed data or None on error."""
try:
with urllib.request.urlopen(url, timeout=10) as resp:
return json.loads(resp.read())
except Exception as e:
print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
return None
def write_json(filename, data):
path = os.path.join(WORLD_DIR, filename)
with open(path, "w") as f:
json.dump(data, f, indent=2)
def materialize_from_api():
"""Mirror world state from the control-plane API to local world files.
The control-plane observer on VPS is the single authoritative writer of
world state. By fetching from its HTTP API we get the same clean, pruned
data that the /summary endpoint serves no stale Redis artefacts.
Returns True if all fetches succeeded and files were written, False otherwise.
"""
print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
endpoints = {
"nodes.json": f"{CONTROL_PLANE_URL}/nodes",
"services.json": f"{CONTROL_PLANE_URL}/services",
"incidents.json": f"{CONTROL_PLANE_URL}/incidents",
"deployments.json": f"{CONTROL_PLANE_URL}/deployments",
"recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
"runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
"events.json": f"{CONTROL_PLANE_URL}/events",
}
fetched = {}
for filename, url in endpoints.items():
data = _fetch_json(url)
if data is None:
print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
return False
fetched[filename] = data
os.makedirs(WORLD_DIR, exist_ok=True)
for filename, data in fetched.items():
write_json(filename, data)
svc_count = len(fetched.get("services.json") or [])
print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
return True
def materialize():
"""Reads state from Redis and writes JSON files to the world directory."""
print(f"[{datetime.now().isoformat()}] Materializing world state...")
@ -212,6 +146,11 @@ def materialize():
# Ensure directory exists
os.makedirs(WORLD_DIR, exist_ok=True)
def write_json(filename, data):
path = os.path.join(WORLD_DIR, filename)
with open(path, "w") as f:
json.dump(data, f, indent=2)
write_json("runtime-summary.json", summary)
write_json("nodes.json", nodes)
write_json("services.json", services)
@ -233,19 +172,10 @@ if __name__ == "__main__":
parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
args = parser.parse_args()
if CONTROL_PLANE_URL:
print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
run_fn = materialize_from_api
else:
print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
run_fn = materialize
interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
if args.once:
run_fn()
materialize()
else:
print(f"Starting materializer loop (interval: {interval}s)...")
print(f"Starting materializer loop (interval: {args.interval}s)...")
while True:
run_fn()
time.sleep(interval)
materialize()
time.sleep(args.interval)

View file

@ -1,39 +0,0 @@
#!/bin/bash
# Script to create a test pending action for Telegram bot verification.
ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
mkdir -p "$ACTIONS_PENDING_DIR"
ACTION_ID="test-$(date +%s)"
FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
TIMESTAMP=$(date +%s)
cat <<EOF > "$FILE_PATH"
{
"action_id": "$ACTION_ID",
"service": "frigate",
"node": "chelsty",
"type": "deploy_service",
"risk": "guarded",
"status": "pending",
"created_at": $TIMESTAMP,
"updated_at": $TIMESTAMP,
"details": {
"image": "blakeblackshear/frigate:0.13.0",
"reason": "Security update for Frigate",
"diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
},
"transition_history": [
{
"from": null,
"to": "pending",
"timestamp": $TIMESTAMP,
"by": "system-test"
}
]
}
EOF
echo "Test action created: $FILE_PATH"
echo "If the telegram-bot is running and configured, you should receive a notification."

View file

@ -1,10 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY bot.py .
CMD ["python", "bot.py"]

View file

@ -1,454 +0,0 @@
import os
import json
import time
import asyncio
import logging
import urllib.request
import urllib.error
from pathlib import Path
from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
# Setup logging
logging.basicConfig(
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
level=logging.INFO
)
logger = logging.getLogger(__name__)
# Configuration
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
async def fetch_api(path):
"""Helper to fetch JSON from the Control Plane API."""
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
try:
def do_request():
req = urllib.request.Request(url)
with urllib.request.urlopen(req, timeout=5) as response:
if response.status != 200:
return None
return json.loads(response.read().decode())
return await asyncio.to_thread(do_request)
except Exception as e:
logger.error(f"Error fetching {url}: {e}")
return None
async def post_api(path, data):
"""Helper to POST JSON to the Control Plane API."""
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
try:
body = json.dumps(data).encode("utf-8")
def do_request():
req = urllib.request.Request(url, data=body, method="POST")
req.add_header("Content-Type", "application/json")
with urllib.request.urlopen(req, timeout=5) as response:
return response.status == 200
return await asyncio.to_thread(do_request)
except Exception as e:
logger.error(f"Error posting to {url}: {e}")
return False
def _format_pending_action(action_id: str, data: dict) -> str:
"""Build the Telegram Markdown message for a pending action notification.
Extracted so it can be unit-tested without a live Telegram connection.
"""
# Supervisor writes risk_level; action-model.md legacy schema used risk.
risk = data.get("risk_level") or data.get("risk", "unknown")
message = (
f"⚠️ *Pending Action*\n"
f"ID: `{action_id}`\n"
f"Type: `{data.get('type', 'unknown')}`\n"
f"Service: `{data.get('service', 'unknown')}`\n"
f"Node: `{data.get('node', 'unknown')}`\n"
f"Risk: *{risk}*\n"
)
# description carries the human-readable substance of the action (required for
# alert_only actions where it is the entire operator-visible message).
description = data.get("description", "")
if description:
truncated = description[:300] + ("..." if len(description) > 300 else "")
message += f"Description: `{truncated}`\n"
# Legacy details block (old action-model.md schema) — kept for backwards compat.
if "details" in data:
details_str = json.dumps(data["details"], indent=2)
if len(details_str) > 1000:
details_str = details_str[:1000] + "..."
message += f"\nDetails:\n```json\n{details_str}\n```"
return message
class ApprovalBot:
def __init__(self):
self.pending_dir = ACTIONS_ROOT / "pending"
self.approved_dir = ACTIONS_ROOT / "approved"
self.rejected_dir = ACTIONS_ROOT / "rejected"
# Track which action IDs we have already notified in this session to avoid spam
self.notified_actions = set()
async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
"""Job that periodically checks for new pending action files."""
if not self.pending_dir.exists():
return
try:
for action_file in self.pending_dir.glob("*.json"):
action_id = action_file.stem
if action_id in self.notified_actions:
continue
try:
data = json.loads(action_file.read_text())
# Only notify if it's truly pending
if data.get("status") == "pending":
await self.notify_users(context, action_id, data)
self.notified_actions.add(action_id)
except Exception as e:
logger.error(f"Error processing action file {action_file}: {e}")
except Exception as e:
logger.error(f"Error scanning pending directory: {e}")
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
"""Sends an approval request message to all allowed users."""
message = _format_pending_action(action_id, data)
keyboard = [
[
InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
]
]
reply_markup = InlineKeyboardMarkup(keyboard)
for user_id in ALLOWED_IDS:
try:
await context.bot.send_message(
chat_id=user_id,
text=message,
parse_mode="Markdown",
reply_markup=reply_markup
)
logger.info(f"Notified user {user_id} about action {action_id}")
except Exception as e:
logger.error(f"Failed to notify user {user_id}: {e}")
async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Handles button clicks for Approve/Reject."""
query = update.callback_query
user_id = query.from_user.id
if user_id not in ALLOWED_IDS:
await query.answer("Unauthorized", show_alert=True)
return
await query.answer()
cb_data = query.data
if ":" not in cb_data:
return
action, action_id = cb_data.split(":", 1)
target_status = "approved" if action == "approve" else "rejected"
# Use API for mutation if available, fallback to local disk move
success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
msg = "Success" if success else "API call failed"
if not success:
# Fallback to direct disk manipulation (original behavior)
success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
if success:
status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
await query.edit_message_text(
text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
parse_mode="Markdown"
)
# Remove from notified list as it's no longer pending
if action_id in self.notified_actions:
self.notified_actions.remove(action_id)
else:
await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
def move_action(self, action_id, target_status, user_id, username):
"""Moves action file and updates its status and history."""
source_path = self.pending_dir / f"{action_id}.json"
if not source_path.exists():
return False, "Action file no longer exists in pending."
target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
target_dir.mkdir(parents=True, exist_ok=True)
target_path = target_dir / f"{action_id}.json"
try:
data = json.loads(source_path.read_text())
current_status = data.get("status", "pending")
# Update data
data["status"] = target_status
data["updated_at"] = time.time()
history = data.get("transition_history", [])
history.append({
"from": current_status,
"to": target_status,
"timestamp": time.time(),
"by": f"tg:{username}"
})
data["transition_history"] = history
# Atomic move: write to new location, then delete old
target_path.write_text(json.dumps(data, indent=2))
source_path.unlink()
logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
return True, "Success"
except Exception as e:
logger.error(f"Error moving action file: {e}")
return False, str(e)
async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Simple start command to help users find their ID."""
user = update.effective_user
message = (
f"Hello {user.first_name}! 🤖\n"
f"Your Telegram User ID is: `{user.id}`\n\n"
)
if user.id in ALLOWED_IDS:
message += "✅ You are authorized to manage the homelab.\n\n"
message += "Use /help to see available commands."
else:
message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
await update.message.reply_text(message, parse_mode="Markdown")
async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
res = await fetch_api("/summary")
status = "✅ Online" if res else "❌ Unreachable"
message = (
f"🤖 *Telegram Bot Status*\n"
f"Control Plane API: {status}\n"
f"Target URL: `{CONTROL_PLANE_URL}`\n"
)
await update.message.reply_text(message, parse_mode="Markdown")
async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
data = await fetch_api("/summary")
if not data:
await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
return
msg = "📊 *System Summary*\n"
msg += f"Status: `{data.get('status', 'unknown')}`\n"
msg += f"Nodes: {data.get('node_count', 0)}\n"
msg += f"Services: {data.get('service_count', 0)}\n"
msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
if data.get('stale'):
msg += "\n⚠️ *Warning: Data is stale!*"
await update.message.reply_text(msg, parse_mode="Markdown")
async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
nodes = await fetch_api("/nodes")
if nodes is None:
await update.message.reply_text("❌ Failed to fetch nodes.")
return
if not nodes:
await update.message.reply_text("No nodes discovered in the fleet.")
return
msg = "🖥️ *Nodes Status*\n"
for node in nodes:
health_icon = "" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else ""
msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
msg += f" Last seen: {node.get('last_seen', 'N/A')}\n"
await update.message.reply_text(msg, parse_mode="Markdown")
async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
services = await fetch_api("/services")
if services is None:
await update.message.reply_text("❌ Failed to fetch services.")
return
# Summarize by node
nodes = {}
for s in services:
node = s.get("node", "unknown")
if node not in nodes: nodes[node] = []
nodes[node].append(s)
msg = "⚙️ *Services Summary*\n"
if not nodes:
msg += "No services discovered."
else:
for node, svc_list in sorted(nodes.items()):
nominal = len([s for s in svc_list if s.get("health") == "nominal"])
msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
msg += "\nUse /unhealthy to see issues."
await update.message.reply_text(msg, parse_mode="Markdown")
async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
services = await fetch_api("/services")
nodes = await fetch_api("/nodes")
msg = "⚠️ *Unhealthy Components*\n"
found = False
if services:
for s in services:
health = s.get("health", "").lower()
if health != "nominal":
msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
found = True
if nodes:
for n in nodes:
checks = n.get("checks", {})
if isinstance(checks, str):
try: checks = json.loads(checks)
except: checks = {}
docker = checks.get("docker", {})
if docker.get("status") == "ok":
for c in docker.get("containers", []):
if c.get("state") != "running":
msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
found = True
if not found:
msg += "All systems nominal. ✅"
await update.message.reply_text(msg, parse_mode="Markdown")
async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
incidents = await fetch_api("/incidents")
if incidents is None:
await update.message.reply_text("❌ Failed to fetch incidents.")
return
active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
if not active:
await update.message.reply_text("No active incidents. ✅")
return
msg = "🚨 *Active Incidents*\n"
for inc in active:
severity = inc.get('severity', 'info').upper()
msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
await update.message.reply_text(msg, parse_mode="Markdown")
async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
actions = await fetch_api("/actions")
if actions is None:
await update.message.reply_text("❌ Actions endpoint unavailable.")
return
msg = "⚡ *Actions Summary*\n"
total = 0
for status, act_list in actions.items():
if act_list:
msg += f"{status.capitalize()}: {len(act_list)}\n"
total += len(act_list)
if total == 0:
msg = "No actions recorded."
await update.message.reply_text(msg, parse_mode="Markdown")
async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
msg = (
"📖 *Supported Commands*\n\n"
"/status - Check bot and API connectivity\n"
"/summary - System health overview\n"
"/nodes - List homelab nodes and their status\n"
"/services - Summary of services across nodes\n"
"/unhealthy - List all unhealthy components\n"
"/incidents - View active incidents\n"
"/actions - Summary of operator actions\n"
"/help - Show this help message\n\n"
"Free text will be handled by the guidance system."
)
await update.message.reply_text(msg, parse_mode="Markdown")
async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Handles non-command messages."""
if update.effective_user.id not in ALLOWED_IDS: return
if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
# Placeholder for OpenClaw LLM fallback
# In a real scenario, this would call the LLM API
logger.info(f"LLM fallback requested for: {update.message.text}")
await update.message.reply_text(
"Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
)
async def run_bot():
if not TOKEN:
print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
# Keep process alive to not crash compose if not desired, but here we just exit
# Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
return
bot_logic = ApprovalBot()
application = ApplicationBuilder().token(TOKEN).build()
application.add_handler(CommandHandler("start", start_command))
application.add_handler(CommandHandler("status", status_command))
application.add_handler(CommandHandler("summary", summary_command))
application.add_handler(CommandHandler("nodes", nodes_command))
application.add_handler(CommandHandler("services", services_command))
application.add_handler(CommandHandler("unhealthy", unhealthy_command))
application.add_handler(CommandHandler("incidents", incidents_command))
application.add_handler(CommandHandler("actions", actions_command))
application.add_handler(CommandHandler("help", help_command))
application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
# Schedule the pending actions check
job_queue = application.job_queue
if job_queue:
job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
else:
logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
logger.info("Starting Telegram Approval Bot...")
await application.initialize()
await application.start()
await application.updater.start_polling()
# Run until the application is stopped
stop_event = asyncio.Event()
try:
await stop_event.wait()
except (KeyboardInterrupt, SystemExit):
logger.info("Stopping bot...")
finally:
await application.stop()
await application.shutdown()
if __name__ == "__main__":
try:
asyncio.run(run_bot())
except KeyboardInterrupt:
pass
except Exception as e:
logger.error(f"Fatal error: {e}")

View file

@ -1 +0,0 @@
python-telegram-bot[job-queue]==20.7

View file

@ -1,38 +0,0 @@
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
from __future__ import annotations
import sys
import types
from unittest.mock import MagicMock
def _make_telegram_stub() -> types.ModuleType:
mod = types.ModuleType("telegram")
mod.Update = MagicMock
mod.InlineKeyboardButton = MagicMock
mod.InlineKeyboardMarkup = MagicMock
return mod
def _make_telegram_ext_stub() -> types.ModuleType:
mod = types.ModuleType("telegram.ext")
mod.ApplicationBuilder = MagicMock
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
ContextTypesMock = MagicMock()
ContextTypesMock.DEFAULT_TYPE = type(None)
mod.ContextTypes = ContextTypesMock
mod.CommandHandler = MagicMock
mod.CallbackQueryHandler = MagicMock
mod.MessageHandler = MagicMock
mod.filters = MagicMock()
return mod
# Insert before any import of bot.py
if "telegram" not in sys.modules:
sys.modules["telegram"] = _make_telegram_stub()
if "telegram.ext" not in sys.modules:
sys.modules["telegram.ext"] = _make_telegram_ext_stub()

View file

@ -1,116 +0,0 @@
"""Tests for _format_pending_action — no Telegram connection required.
telegram stubs are set up in conftest.py before this module is imported.
"""
from __future__ import annotations
import sys
from pathlib import Path
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent))
from bot import _format_pending_action
# ---------------------------------------------------------------------------
# Bug 1 — risk_level field
# ---------------------------------------------------------------------------
def test_risk_level_shown_when_present():
data = {
"type": "container_restart", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "low",
}
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
assert "Risk: *low*" in msg
assert "unknown" not in msg
def test_risk_falls_back_to_legacy_risk_key():
data = {
"type": "redeploy", "service": "mosquitto",
"node": "chelsty-infra", "risk": "guarded",
}
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
assert "Risk: *guarded*" in msg
def test_risk_unknown_when_both_absent():
data = {"type": "redeploy", "service": "foo", "node": "bar"}
msg = _format_pending_action("redeploy-bar-foo", data)
assert "Risk: *unknown*" in msg
# ---------------------------------------------------------------------------
# Bug 2 — description field
# ---------------------------------------------------------------------------
def test_description_shown_for_alert_only():
data = {
"type": "alert_only", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "info",
"description": "3 entities unavailable for >1h",
}
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
assert "3 entities unavailable for >1h" in msg
assert "Description:" in msg
def test_description_shown_for_container_restart():
data = {
"type": "container_restart", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "low",
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
}
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
assert "HA WebSocket unresponsive" in msg
def test_description_absent_no_crash():
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
msg = _format_pending_action("redeploy-bar-foo", data)
assert "Description:" not in msg
assert "Risk: *guarded*" in msg
def test_description_truncated_at_300_chars():
long_desc = "x" * 400
data = {
"type": "alert_only", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "info",
"description": long_desc,
}
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
assert "x" * 300 in msg
assert "..." in msg
assert "x" * 301 not in msg
# ---------------------------------------------------------------------------
# Combined — real HA alert_only action shape
# ---------------------------------------------------------------------------
def test_ha_alert_only_full_action():
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
data = {
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
"type": "alert_only",
"node": "chelsty-ha",
"service": "homeassistant",
"risk_level": "info",
"confidence": 1.0,
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
"status": "pending",
"payload": {
"location_tag": "chelsty",
"reason": "ha_entity_unavailable_long",
"count": 3,
},
}
msg = _format_pending_action(data["action_id"], data)
assert "alert_only" in msg
assert "chelsty-ha" in msg
assert "Risk: *info*" in msg
assert "3 entities unavailable" in msg
assert "unknown" not in msg

View file

@ -277,9 +277,8 @@
<option value="maintenance">MAINTENANCE</option>
</select>
</div>
<div class="header-actions" style="display:flex; gap:8px; align-items:center">
<div class="header-actions">
<button onclick="refreshData()">Refresh</button>
<button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
</div>
</header>
@ -692,73 +691,6 @@
}
}
async function copyForAI() {
const btn = document.getElementById('copy-ai-btn');
const original = btn.textContent;
btn.textContent = 'Copying...';
btn.disabled = true;
try {
const snap = await fetchData('/snapshot');
if (!snap) throw new Error('snapshot fetch failed');
const now = new Date(snap.timestamp);
const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
const lines = [];
lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
if (snap.nodes && snap.nodes.length > 0) {
lines.push('NODES: ' + snap.nodes.map(n =>
`${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
).join(', '));
} else {
lines.push('NODES: none');
}
if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
`${s.name} (${s.node}) - ${s.health}`
).join(', '));
} else {
lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
}
const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
if (activeIncidents.length > 0) {
lines.push('INCIDENTS: ' + activeIncidents.map(i =>
`[${i.severity}] ${i.message} (${i.node})`
).join('; '));
} else {
lines.push('INCIDENTS: none');
}
if (snap.events && snap.events.length > 0) {
lines.push(`EVENTS (last ${snap.events.length}):`);
snap.events.forEach(ev => {
const ts = ev.timestamp
? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
: '?';
const svc = ev.service ? '/' + ev.service : '';
lines.push(` ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
});
} else {
lines.push('EVENTS (last 10): none');
}
const s = snap.summary || {};
lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
await navigator.clipboard.writeText(lines.join('\n'));
btn.textContent = 'Copied!';
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
} catch (e) {
console.error('copyForAI error:', e);
btn.textContent = 'Error';
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
}
}
// Initial load
refreshData();
// Poll for updates

View file

@ -1,7 +1,6 @@
import json
import os
import time
from datetime import datetime, timezone
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path
@ -68,22 +67,12 @@ def current_recommendations():
def current_summary():
path = WORLD_DIR / "runtime-summary.json"
summary = read_json_file(path, default={})
summary = read_json_file(WORLD_DIR / "runtime-summary.json", default={})
if summary:
last_update_val = summary.get("last_update")
if last_update_val:
try:
if isinstance(last_update_val, str):
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
else:
last_update = float(last_update_val)
except Exception:
last_update = os.path.getmtime(path)
else:
last_update = os.path.getmtime(path)
summary["last_update"] = last_update
summary["stale"] = (time.time() - last_update) > 60
# Check for staleness
mtime = os.path.getmtime(WORLD_DIR / "runtime-summary.json")
summary["last_update"] = mtime
summary["stale"] = (time.time() - mtime) > 60 # Stale if older than 60s
return summary
@ -152,28 +141,6 @@ def mutate_action(action_id, target_status):
return False, str(e)
def get_snapshot():
nodes = current_nodes()
services = current_services()
incidents = current_incidents()
events = current_events()
summary = current_summary()
non_nominal = [s for s in services if s.get("health") != "nominal"]
nominal_count = len(services) - len(non_nominal)
return {
"timestamp": datetime.now(timezone.utc).isoformat(),
"summary": summary,
"nodes": nodes,
"non_nominal_services": non_nominal,
"nominal_service_count": nominal_count,
"total_service_count": len(services),
"incidents": incidents,
"events": events[:10],
}
def send_json(status, payload, handler):
body = (json.dumps(payload) + "\n").encode("utf-8")
handler.send_response(status)
@ -221,10 +188,6 @@ class Handler(BaseHTTPRequestHandler):
send_json(200, current_actions(), self)
return
if self.path == "/snapshot":
send_json(200, get_snapshot(), self)
return
if self.path in ("/", "/index.html"):
body = (STATIC_DIR / "index.html").read_bytes()
self.send_response(200)

View file

@ -1,10 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
COPY src/ src/
ENV PYTHONUNBUFFERED=1
ENV PYTHONPATH=/app/src
CMD ["python", "-m", "brain_watchdog.main"]

View file

@ -1,30 +0,0 @@
services:
brain-watchdog:
build: .
container_name: brain-watchdog
restart: unless-stopped
env_file:
- /opt/homelab/config/brain-watchdog/.env
volumes:
- brain_watchdog_data:/data
healthcheck:
test:
- "CMD"
- "python"
- "-c"
- |
import os, time, json, sys
p = '/data/state.json'
if not os.path.exists(p): sys.exit(1)
age = time.time() - os.path.getmtime(p)
sys.exit(0 if age < 300 else 1)
interval: 1m
timeout: 10s
retries: 3
start_period: 30s
volumes:
brain_watchdog_data:

Some files were not shown because too many files have changed in this diff Show more