Compare commits
No commits in common. "master" and "vps-control-plane-deploy" have entirely different histories.
master
...
vps-contro
|
|
@ -1,43 +0,0 @@
|
|||
---
|
||||
name: deploy
|
||||
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
|
||||
---
|
||||
|
||||
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
|
||||
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
|
||||
|
||||
## Targets
|
||||
|
||||
| Target | What it deploys |
|
||||
|---|---|
|
||||
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
|
||||
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
|
||||
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
|
||||
| `solaria` | SOLARIA compute services |
|
||||
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
|
||||
|
||||
## Invocation
|
||||
|
||||
```bash
|
||||
scripts/deploy/deploy.sh <target> # full pipeline
|
||||
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
|
||||
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
|
||||
```
|
||||
|
||||
## Exit Code Handling
|
||||
|
||||
| Code | Meaning | Required action |
|
||||
|---|---|---|
|
||||
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
|
||||
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
|
||||
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
|
||||
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
|
||||
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
|
||||
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
|
||||
|
||||
## Rules
|
||||
|
||||
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
|
||||
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
|
||||
- Canonical branch is `master` — preflight enforces this.
|
||||
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
|
||||
|
|
@ -1,152 +0,0 @@
|
|||
---
|
||||
name: node-onboarding
|
||||
description: >
|
||||
Use when the user wants to add or onboard a new node to homelab-codex —
|
||||
repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration.
|
||||
Keywords: "nowy node", "dodaj node", "onboarding", "onboard node".
|
||||
living_doc: true
|
||||
maturity: partial # PROVEN: 00-access, 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify (live pending). Update after each step lands on a real node.
|
||||
---
|
||||
|
||||
> **Living document** — sections marked **SCAFFOLD** are stubs waiting for battle-testing on a real node.
|
||||
> Promote to **PROVEN** after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.
|
||||
|
||||
## Trigger
|
||||
|
||||
User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.
|
||||
|
||||
---
|
||||
|
||||
## Workflow — one step at a time
|
||||
|
||||
```
|
||||
preflight (read-only)
|
||||
└─ 00-access [PROVEN]
|
||||
└─ 20-base [PROVEN]
|
||||
└─ 30-node-agent [PROVEN]
|
||||
└─ 40-register [WRITTEN — live pending]
|
||||
└─ 50-verify [WRITTEN — live pending]
|
||||
```
|
||||
|
||||
Never skip ahead. Each step must exit 0 before the next begins.
|
||||
|
||||
---
|
||||
|
||||
## Invocation
|
||||
|
||||
```bash
|
||||
# Full onboarding (all steps in order)
|
||||
scripts/onboard/onboard.sh --node <name>
|
||||
|
||||
# Single step
|
||||
scripts/onboard/onboard.sh --node <name> --step 00-access
|
||||
|
||||
# Resume from a step
|
||||
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime
|
||||
|
||||
# Dry-run — probes run for real; mutations are printed, not executed
|
||||
scripts/onboard/onboard.sh --node <name> --dry-run
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step status table
|
||||
|
||||
| Step | File | Status | What it does |
|
||||
|------|------|--------|--------------|
|
||||
| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
|
||||
| `00-access` | `steps/00-access.sh` | **PROVEN** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interactive URL), verify over mesh |
|
||||
| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | SCAFFOLD | Create `/opt/homelab/` layout, `chown <ssh_user>` |
|
||||
| `20-base` | `steps/20-base.sh` | **PROVEN** | swap→zram, `/opt/homelab/` layout, event dir `/opt/homelab/events/<node>/` |
|
||||
| `20-install-docker` | `steps/20-install-docker.sh` | SCAFFOLD | Install Docker Engine if `docker_present=false`; skip if already installed |
|
||||
| `30-node-agent` | `steps/30-node-agent.sh` | **PROVEN** | rsync base compose + override, `docker compose up -d --build`, verify container + events |
|
||||
| `40-register` | `steps/40-register.sh` | WRITTEN | Dopisuje node do `inventory/topology.yaml` + tworzy `hosts/<node>/services.yaml`, commit na branchu (bez push) |
|
||||
| `50-verify` | `steps/50-verify.sh` | WRITTEN | SSH node: container+events; SSH VPS: restart observer + heartbeat poll + world/nodes.json |
|
||||
|
||||
---
|
||||
|
||||
## node.yaml — key fields
|
||||
|
||||
```yaml
|
||||
name: LUSTRO # ALL CAPS
|
||||
role: edge # edge | compute | infra
|
||||
ssh_user: pi # existing user on the node
|
||||
first_contact: pi@192.168.31.19 # LAN IP — NEVER .local (mDNS unreliable in automation)
|
||||
tailscale:
|
||||
hostname: lustro # mesh name; switch to this after tailscale up
|
||||
ip: # fill after join
|
||||
deploy_autonomy: true # false → print manual instructions and stop
|
||||
git_control: false # false → push-based from SATURN (edge nodes)
|
||||
hardware:
|
||||
arch: arm64 # filled by 00-preflight
|
||||
ram_mb: 4096 # filled by 00-preflight
|
||||
swap:
|
||||
kind: zram # zram | file | none
|
||||
docker_present: true # filled by 00-preflight
|
||||
mm_runtime: systemd:magicmirror.service # filled by 00-preflight; none if absent
|
||||
services:
|
||||
node-agent:
|
||||
runtime:
|
||||
engine: docker
|
||||
mem_limit: 256m # mandatory on RAM-constrained hosts (≤4 GB)
|
||||
```
|
||||
|
||||
preflight fills `arch`, `ram_mb`, `docker_present`, `mm_runtime` — do NOT guess these.
|
||||
|
||||
Full schema: `scripts/onboard/README.md`.
|
||||
|
||||
---
|
||||
|
||||
## Operational rules (PROVEN)
|
||||
|
||||
**PLAN-FIRST** — before any mutation, show exactly what will touch the remote host.
|
||||
Always run `--dry-run` first; dry-run must print real commands (`run()` propagation).
|
||||
|
||||
**Idempotency** — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
|
||||
|
||||
**Isolation** — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
|
||||
|
||||
**Worktree discipline** — onboarding is a feature. Work in a task worktree (`agent.sh new`), never in the main checkout (`~/homelab-codex-ws` is deploy-only). See [[worktree-aware]].
|
||||
|
||||
---
|
||||
|
||||
## Gotchas (battle-tested)
|
||||
|
||||
| Problem | Fix |
|
||||
|---------|-----|
|
||||
| mDNS `.local` resolve fail | Always use LAN IP in `first_contact`; `.local` OK interactively, not in automation |
|
||||
| uid=1000 collision on RPi OS | If `pi` already holds uid=1000 → USE that user, don't create `oskar`. node-agent `1000:1000` matches out-of-box; creating a second uid=1000 breaks MM ownership |
|
||||
| passwordless sudo not guaranteed | Verify `sudo -n true` exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
|
||||
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to `10-bootstrap-runtime` |
|
||||
| RAM ≤4 GB with heavy app | `mem_limit` on node-agent is mandatory — same OOM profile as VPS |
|
||||
| Docker already installed | Check `docker_present` from preflight; skip install step if true |
|
||||
| SSH known-hosts warning in parsed output | Pass `-o LogLevel=ERROR` to SSH for new mesh hosts |
|
||||
| `yaml_get` drops value prefix after `:` | Non-greedy colon: `s/^[[:space:]]*[^:]*:[[:space:]]*//'` — handles `systemd:unit` correctly |
|
||||
| `yaml_get` keeps inline YAML comments | Strip with `s/[[:space:]]\+#.*$//` after extraction (requires ≥1 space before `#`) |
|
||||
| dry-run stops at orchestrator level | `run()` wrapper + `export DRY_RUN=1` propagated to all step scripts; probes execute for real |
|
||||
| rsync push Permission denied to VPS events/ | ssh-user must be in the **group that owns `/opt/homelab/events/`** (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: `usermod -aG 1000 <user>` on VPS + re-login |
|
||||
| node-agent SSH key mount target | Mount the push key under the **container's HOME**: `/home/homelab/.ssh` (uid 1000 `homelab`), **NOT `/root/.ssh`** — ssh in `_ship_events_to_vps()` has no `-i` and only looks in `$HOME/.ssh`; a `/root/.ssh` mount is blind → `Permission denied` (lustro 2026-06-11, fix `a5a1352`). The new node's pubkey must also land in `authorized_keys` of `oskar@VPS` |
|
||||
| observer not seeing new node after topology.yaml edit | `_load_inventory()` runs once at `__init__`. After `git pull` on VPS (bind-mount is live), **`docker restart control-plane-observer`** is required — no redeploy needed |
|
||||
| worktree on wrong branch | Always check `git branch --show-current` on entry. One task = one worktree (`agent.sh new`). Never manually `git checkout` between task branches in the same worktree |
|
||||
|
||||
---
|
||||
|
||||
## lib/ reference
|
||||
|
||||
```
|
||||
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
|
||||
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
|
||||
```
|
||||
|
||||
`run()` contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, `command -v`, status queries) always execute so the plan is realistic.
|
||||
|
||||
---
|
||||
|
||||
## Definition of Done
|
||||
|
||||
A node is fully onboarded when:
|
||||
|
||||
1. `50-verify` exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
|
||||
2. `hosts/<node>/node.yaml` committed with all preflight fields filled.
|
||||
3. `hosts/<node>/capabilities.yaml` present and accurate.
|
||||
4. Node appears in `inventory/topology.yaml`.
|
||||
|
|
@ -1,65 +0,0 @@
|
|||
---
|
||||
name: save-session
|
||||
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
|
||||
---
|
||||
|
||||
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
|
||||
Never invoke proactively. Never invoke mid-task.
|
||||
|
||||
## 1. Determine Session Boundary
|
||||
|
||||
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
|
||||
2. Fallback if no previous entry exists: 24 hours ago.
|
||||
|
||||
## 2. Collect Facts (deterministic only — no invention)
|
||||
|
||||
Run exactly:
|
||||
```bash
|
||||
# All commits since boundary
|
||||
git --no-pager log --oneline <boundary>..HEAD
|
||||
|
||||
# Changed file summary
|
||||
git --no-pager diff --stat <boundary>..HEAD
|
||||
```
|
||||
|
||||
From the visible conversation transcript: deploys run and their outcomes, test results seen.
|
||||
|
||||
## 3. Write the Session Entry
|
||||
|
||||
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
|
||||
Never overwrite existing content.
|
||||
|
||||
```markdown
|
||||
## Session HH:MM
|
||||
|
||||
### Commits
|
||||
<output of git log --oneline>
|
||||
|
||||
### Files changed
|
||||
<output of git diff --stat>
|
||||
|
||||
### Deploys
|
||||
<list from transcript, or "None recorded">
|
||||
|
||||
### Narrative
|
||||
> _user-provided summary_
|
||||
```
|
||||
|
||||
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
|
||||
|
||||
## 4. What NOT to Touch
|
||||
|
||||
- `backlog.md` — only on explicit "update backlog" instruction
|
||||
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
|
||||
- Any other file not listed above
|
||||
|
||||
## 5. Commit
|
||||
|
||||
Stage and commit **only** the session file:
|
||||
|
||||
```bash
|
||||
git add docs/sessions/YYYY-MM-DD.md
|
||||
git commit -m "docs: session YYYY-MM-DD HH:MM"
|
||||
```
|
||||
|
||||
No other files. No `git add -A`.
|
||||
|
|
@ -1,81 +0,0 @@
|
|||
---
|
||||
name: worktree-aware
|
||||
description: >
|
||||
Use when working in a git worktree checkout for a parallel agent task.
|
||||
The presence of an .agent-task file in the current working directory indicates
|
||||
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
|
||||
to the assigned task branch, NEVER push origin master, NEVER touch the main
|
||||
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
|
||||
completion, report the branch name verbatim and stop — the human merges via
|
||||
scripts/dev/agent.sh.
|
||||
---
|
||||
|
||||
## When this applies
|
||||
|
||||
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
|
||||
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
|
||||
In the main checkout these rules do not apply.
|
||||
|
||||
## Reading the marker
|
||||
|
||||
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
|
||||
|
||||
```yaml
|
||||
task: my-feature
|
||||
branch: task/my-feature
|
||||
parent_commit: abc1234
|
||||
created_utc: 2026-06-03T10:00:00Z
|
||||
worktree_path: /home/oskar/homelab-codex-ws-my-feature
|
||||
```
|
||||
|
||||
Always read this file first before taking any action.
|
||||
|
||||
## Rules
|
||||
|
||||
1. **Commit only to your branch.**
|
||||
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
|
||||
If it does not, stop immediately and report the discrepancy.
|
||||
|
||||
2. **Push only to your branch.**
|
||||
The only permitted push is `git push origin task/<name>`.
|
||||
NEVER `git push origin master` or any other branch.
|
||||
|
||||
3. **Do not touch the main checkout.**
|
||||
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
|
||||
Do not read from, write to, or execute commands inside it.
|
||||
|
||||
4. **Stay scoped.**
|
||||
Only change files directly related to your assigned task.
|
||||
If you notice other problems, report them in your final summary as separate follow-up proposals.
|
||||
Do not fix them in this worktree.
|
||||
|
||||
5. **Never `git add -A`.**
|
||||
Always stage specific files by name: `git add path/to/file`.
|
||||
|
||||
6. **Do not manage worktrees.**
|
||||
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
|
||||
Worktree lifecycle is the human's responsibility.
|
||||
|
||||
7. **Final report before stopping.**
|
||||
When the task is done, provide a structured report containing:
|
||||
- Files changed (path and one-line summary of change)
|
||||
- Tests run and results
|
||||
- All commit hashes on the task branch
|
||||
- **Branch name verbatim** (copy-paste ready)
|
||||
- Follow-up items as bulleted proposals for separate tasks
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
|
||||
- Test suite passes
|
||||
- Branch pushed: `git push origin task/<name>`
|
||||
- Full report delivered in conversation
|
||||
|
||||
## What you do NOT do
|
||||
|
||||
- Merge branches
|
||||
- Create or push tags
|
||||
- Run deploys or healthchecks against production nodes
|
||||
- Delete branches or worktrees
|
||||
- Modify files in other worktrees
|
||||
- Push to `origin master` under any circumstances
|
||||
3
.gitignore
vendored
3
.gitignore
vendored
|
|
@ -15,13 +15,10 @@ __pycache__/
|
|||
*$py.class
|
||||
venv/
|
||||
.venv/
|
||||
*.egg-info/
|
||||
|
||||
# Tools
|
||||
.aider*
|
||||
.codex
|
||||
# worktree task marker created by scripts/dev/agent.sh new — must stay untracked per worktree
|
||||
.agent-task
|
||||
|
||||
# OS files
|
||||
.DS_Store
|
||||
|
|
|
|||
212
CLAUDE.md
212
CLAUDE.md
|
|
@ -1,212 +0,0 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## What This Repo Is
|
||||
|
||||
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
|
||||
|
||||
## Node Roles
|
||||
|
||||
| Host | Role |
|
||||
|------|------|
|
||||
| **SATURN** | Primary control node — only node where commits are made |
|
||||
| **SOLARIA** | GPU/compute/AI workloads |
|
||||
| **PIHA** | Infra, monitoring |
|
||||
| **VPS** | Public ingress, reverse proxy, control plane host |
|
||||
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
||||
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
||||
|
||||
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
|
||||
|
||||
## Deployment
|
||||
|
||||
```bash
|
||||
scripts/deploy/deploy.sh # fresh deploy on current node
|
||||
scripts/deploy/deploy.sh --resume # resume after interruption
|
||||
scripts/deploy/deploy.sh --stage verify # specific stage only
|
||||
scripts/deploy/deploy.sh --service mosquitto # specific service only
|
||||
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
|
||||
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
|
||||
./scripts/bootstrap/prepare-node.sh # general node bootstrap
|
||||
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
|
||||
scripts/onboard/onboard.sh --node <name> # onboard a new node (idempotent, bash)
|
||||
scripts/onboard/onboard.sh --node <name> --step 00-access # single step
|
||||
scripts/onboard/onboard.sh --node <name> --dry-run # simulate
|
||||
```
|
||||
|
||||
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
|
||||
|
||||
## Node Onboarding
|
||||
|
||||
New nodes are onboarded via `scripts/onboard/` — an idempotent bash tool driven by
|
||||
`hosts/<node>/node.yaml` manifests (no Ansible). See `scripts/onboard/README.md` for
|
||||
the full schema, step status table, and gotchas.
|
||||
|
||||
Key fields in `node.yaml`: `ssh_user`, `first_contact` (LAN IP — not `.local`),
|
||||
`tailscale.hostname`, `deploy_autonomy`, `git_control`, `hardware.*`.
|
||||
|
||||
## Service Structure
|
||||
|
||||
Every service must follow this layout:
|
||||
|
||||
```
|
||||
services/<service>/
|
||||
├── docker-compose.yml
|
||||
├── service.yaml # Machine-readable contract (primary source of truth for agents)
|
||||
├── README.md
|
||||
├── env.example # Template — never commit actual secrets
|
||||
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
|
||||
```
|
||||
|
||||
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
|
||||
|
||||
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
|
||||
|
||||
## Agent System Architecture
|
||||
|
||||
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
|
||||
|
||||
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
|
||||
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
|
||||
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
|
||||
4. **Executor** — Executes actions only after they transition to `approved`.
|
||||
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
|
||||
|
||||
### Action approval flow
|
||||
```
|
||||
Agent → /opt/homelab/actions/pending/<id>.json
|
||||
→ Telegram notification → Operator approves
|
||||
→ /opt/homelab/actions/approved/<id>.json
|
||||
→ Executor runs → completed / failed
|
||||
```
|
||||
|
||||
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
|
||||
|
||||
## Event System
|
||||
|
||||
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
|
||||
|
||||
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
|
||||
|
||||
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
||||
|
||||
### Supervisor event routing table
|
||||
|
||||
| Event type | Source | Action generated | Cooldown |
|
||||
|---|---|---|---|
|
||||
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
|
||||
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
|
||||
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
|
||||
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
|
||||
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
|
||||
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
|
||||
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
|
||||
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
|
||||
|
||||
## Discovery Entry Points for Agents
|
||||
|
||||
When exploring the system, use these files in order:
|
||||
1. `inventory/topology.yaml` — node list, roles, mesh type
|
||||
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
|
||||
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
||||
4. `services/<service>/service.yaml` — operational contract for a service
|
||||
|
||||
## VPS-Specific Rules
|
||||
|
||||
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
|
||||
|
||||
### Memory limit convention
|
||||
|
||||
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
myservice:
|
||||
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
|
||||
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
|
||||
```
|
||||
|
||||
Rules:
|
||||
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
|
||||
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
|
||||
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
|
||||
|
||||
### Repo-managed services on VPS
|
||||
|
||||
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
|
||||
|
||||
| Service | Compose stack | Data path |
|
||||
|---|---|---|
|
||||
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
|
||||
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
|
||||
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
|
||||
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
|
||||
|
||||
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
|
||||
|
||||
**Cutover checklist** (before running `docker compose up` for any migrated service):
|
||||
1. `git pull` on VPS
|
||||
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
|
||||
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
|
||||
4. For mosquitto: config stays at old bind path until explicitly migrated
|
||||
5. Verify named volumes exist: `docker volume ls | grep <project>`
|
||||
|
||||
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
|
||||
|
||||
## CHELSTY-Specific Rules
|
||||
|
||||
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||||
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
|
||||
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
||||
|
||||
## Runtime Path Conventions
|
||||
|
||||
`/opt/homelab/` layout on each node:
|
||||
|
||||
- `data/<service>/` — persistent volumes
|
||||
- `config/<service>/` — secrets and host-local overrides (not in Git)
|
||||
- `logs/<service>/` — service logs
|
||||
- `state/` — deployment stage markers, agent heartbeats
|
||||
- `events/` — append-only event store
|
||||
- `world/` — Observer output (synthesized state)
|
||||
- `actions/` — pending / approved / running / completed / failed
|
||||
|
||||
## Definition of Done (serwisy)
|
||||
|
||||
Before any new or changed service is considered ready:
|
||||
|
||||
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
|
||||
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
|
||||
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
||||
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
||||
- Container names must match service names
|
||||
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|
||||
|
||||
## Multi-agent worktree mode
|
||||
|
||||
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
|
||||
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
|
||||
|
||||
**DISCIPLINE RULE — enforced after 2026-06-08 session violation:**
|
||||
All feature/implementation work MUST happen in a task worktree, never directly in the main
|
||||
checkout. The main checkout is for reading context and running deploys only. If you are
|
||||
about to create a new branch or make implementation commits while `pwd` is
|
||||
`~/homelab-codex-ws`, stop and ask the operator to run `agent.sh new <name>` first.
|
||||
|
||||
If `.agent-task` exists in your current working directory, you are in a task worktree.
|
||||
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
|
||||
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
|
||||
|
||||
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
|
||||
Agents never invoke these — only the human does.
|
||||
19
README.md
19
README.md
|
|
@ -13,22 +13,6 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
|
|||
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
|
||||
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
|
||||
|
||||
## Agent System
|
||||
|
||||
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
|
||||
|
||||
| Agent | Node | Role |
|
||||
|-------|------|------|
|
||||
| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
|
||||
| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
|
||||
| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
|
||||
| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
|
||||
| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
|
||||
| **executor** | VPS | Executes actions only after operator approval |
|
||||
| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
|
||||
|
||||
Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
|
||||
|
||||
## Repository Structure
|
||||
|
||||
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
|
||||
|
|
@ -45,13 +29,10 @@ Action approval flow: `pending/` → operator approves → `approved/` → execu
|
|||
## Documentation Index
|
||||
|
||||
- [Infrastructure Standards](docs/standards.md)
|
||||
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
|
||||
- [Deployment Conventions](docs/deployment.md)
|
||||
- [Hardware](docs/hardware.md)
|
||||
- [Networking](docs/networking.md)
|
||||
- [Services](docs/services.md)
|
||||
- [Node Capabilities](docs/capabilities.md)
|
||||
- [Action Model](services/agent-system/action-model.md)
|
||||
|
||||
---
|
||||
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
|
||||
|
|
|
|||
|
|
@ -1,31 +0,0 @@
|
|||
{
|
||||
"metadata": {
|
||||
"format": "zigpy/open-coordinator-backup",
|
||||
"version": 1,
|
||||
"source": "zigbee-herdsman@10.0.7",
|
||||
"internal": {
|
||||
"date": "2026-05-14T14:48:35.098Z",
|
||||
"znpVersion": 1
|
||||
}
|
||||
},
|
||||
"stack_specific": {
|
||||
"zstack": {
|
||||
"tclk_seed": "32d69cbe3f0e15471e5d43f9401e485a"
|
||||
}
|
||||
},
|
||||
"coordinator_ieee": "00124b00257bf416",
|
||||
"pan_id": "46bc",
|
||||
"extended_pan_id": "087730b5f614ea4a",
|
||||
"nwk_update_id": 0,
|
||||
"security_level": 5,
|
||||
"channel": 11,
|
||||
"channel_mask": [
|
||||
11
|
||||
],
|
||||
"network_key": {
|
||||
"key": "049909949a950d91522cf10cc369a724",
|
||||
"sequence_number": 0,
|
||||
"frame_counter": 0
|
||||
},
|
||||
"devices": []
|
||||
}
|
||||
|
|
@ -1,49 +0,0 @@
|
|||
# Agent Operating Procedures
|
||||
|
||||
This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
|
||||
|
||||
## 1. Core Principles for Agents
|
||||
|
||||
1. **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
|
||||
2. **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
|
||||
3. **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
|
||||
4. **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
|
||||
5. **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
|
||||
|
||||
## 2. Agent Roles
|
||||
|
||||
| Role | Responsibility | Scope |
|
||||
|------|----------------|-------|
|
||||
| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
|
||||
| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
|
||||
| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
|
||||
| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
|
||||
|
||||
## 3. Discovery Protocol
|
||||
|
||||
Agents must use the following entry points to understand the system:
|
||||
|
||||
1. **Topology**: `inventory/topology.yaml` for node list and roles.
|
||||
2. **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
|
||||
3. **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
|
||||
4. **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
|
||||
|
||||
## 4. Interaction with Humans
|
||||
|
||||
Agents communicate with the operator via the `agent-system/telegram-bot`.
|
||||
|
||||
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
|
||||
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
|
||||
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
|
||||
|
||||
## 5. Decision Logic (Reasoning)
|
||||
|
||||
When making decisions, agents MUST prioritize:
|
||||
1. **Safety**: Do not violate power constraints (see `capabilities.yaml`).
|
||||
2. **Stability**: Prefer keeping services on their `owner_node` unless it's down.
|
||||
3. **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
|
||||
|
||||
## 6. Access Control for Agents
|
||||
|
||||
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
|
||||
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.
|
||||
123
docs/backlog.md
123
docs/backlog.md
|
|
@ -1,123 +0,0 @@
|
|||
# Tech-debt backlog
|
||||
|
||||
Centralny tracker tech-długu i znanych usterek. Wpisy ze sesji — dodawaj z datą i kontekstem.
|
||||
|
||||
---
|
||||
|
||||
## Aktywne
|
||||
|
||||
### 🔴 BLOKUJĄCE — FLOTA-BOMBA: node-agent SSH mount ślepy po recreate
|
||||
|
||||
**Data**: 2026-06-11
|
||||
**Źródło**: sesja lustro ssh shipping fix
|
||||
**Problem**: solaria/piha/chelsty to stare **root** kontenery node-agenta (piha Created
|
||||
2026-05-27, uid 0) — sprzed dodania `user: "1000:1000"` do bazowego compose. Ich override
|
||||
montuje klucz SSH w `/root/.ssh`, co działa tylko dla uid 0. Pierwszy `--force-recreate` /
|
||||
reboot hosta / update obrazu przełączy kontener na uid 1000 (`homelab`, HOME=/home/homelab)
|
||||
i shipping eventów na VPS padnie z "Permission denied" — dokładnie jak na lustrze
|
||||
(naprawione `a5a1352`). `ssh` w `_ship_events_to_vps()` nie ma `-i` i szuka klucza
|
||||
w `$HOME/.ssh`.
|
||||
**⚠️ NIE RECREATE node-agenta na solaria/piha/chelsty przed fixem.**
|
||||
**Fix**: ujednolicić mount → `/home/homelab/.ssh` we wszystkich
|
||||
`hosts/*/runtime/node-agent/docker-compose.override.yml` (wzór: `hosts/lustro/`)
|
||||
ALBO dodać `-i $HOME/.ssh/id_rsa` w `_ship_events_to_vps()`.
|
||||
|
||||
---
|
||||
|
||||
### ha-diag-agent deploy ZABLOKOWANY (placeholder token)
|
||||
|
||||
**Data**: 2026-06-11
|
||||
**Źródło**: sesja — deploy config merged (`5e9db5c`), `.env` na piha utworzony
|
||||
(`/opt/homelab/config/ha-diag-agent/.env`, chmod 600) ale token = PLACEHOLDER.
|
||||
**Blokada**: chelsty-ha offline → brak tokenu i połączenia.
|
||||
**Do decyzji**: cel HA — chelsty-ha vs HA Ken (`homeassistant5` na piha; z kontenera
|
||||
NIE `localhost`).
|
||||
**Przed `shadow_mode=false`**: target restartu w supervisorze = nazwa kontenera
|
||||
`homeassistant5`; curl endpointu HA z tokenem = HTTP 200.
|
||||
|
||||
---
|
||||
|
||||
### observer-poison-quarantine — review brancha (`78c9e4a`)
|
||||
|
||||
**Data**: 2026-06-11
|
||||
**Źródło**: sesja — patch Codexa zachowany na `task/observer-poison-quarantine`, NIE w master.
|
||||
**Do zrobienia**: zweryfikować, czy observer realnie wiesza się na malformed evencie
|
||||
(poison NIE był przyczyną awarii lustra — hipoteza niezweryfikowana, obalona przez
|
||||
verify-before-fix). Realny bug → merge; inaczej → drop brancha i worktree.
|
||||
|
||||
---
|
||||
|
||||
### node_agent.py — drobne sprzątanie shippingu
|
||||
|
||||
**Data**: 2026-06-11
|
||||
**Źródło**: sesja lustro ssh shipping fix
|
||||
1. **Stale komentarz** `node_agent.py:546-548` — twierdzi, że kontener "runs as root";
|
||||
nieaktualne od `user: "1000:1000"`.
|
||||
2. **Sukces shippingu na `logger.debug`** → podnieść do `info` lub dodać licznik —
|
||||
działający shipping jest niewidoczny w logach przy INFO, co utrudniało diagnozę
|
||||
(cicha awaria wyglądała identycznie jak ciche działanie).
|
||||
|
||||
---
|
||||
|
||||
### event-bloat: wyczyścić spłynięty backlog lustro na VPS
|
||||
|
||||
**Data**: 2026-06-11
|
||||
**Źródło**: sesja — po fixie shippingu 7600+ plików backlogu spłynęło do
|
||||
`/opt/homelab/events/lustro/` na VPS.
|
||||
**Fix**: wyczyścić stare pliki (observer już je przetworzył); docelowo polityka retencji
|
||||
w event-store.
|
||||
|
||||
---
|
||||
|
||||
### rsync `--omit-dir-times` (node-agent)
|
||||
|
||||
**Data**: 2026-06-09
|
||||
**Źródło**: flota recovery session
|
||||
**Objaw**: rsync exit code 23 po każdym push — `set-times` na katalogu `/opt/homelab/events/`
|
||||
zwraca EPERM (oskar nie jest właścicielem katalogu; aerbot jest). Pliki są kopiowane poprawnie,
|
||||
ale exit 23 zaśmieca logi i może maskować prawdziwe błędy.
|
||||
**Fix**: dodać `--omit-dir-times` do wywołania `rsync` w `node-agent.py`.
|
||||
**Lokalizacja**: `services/node-agent/src/node_agent.py` — wywołanie rsync w pętli push.
|
||||
**Update 2026-06-11**: potwierdzone flotowo — każdy node loguje fałszywe
|
||||
"Event shipping failed" (rsync code 23) co cykl, mimo że pliki przechodzą; katalogi
|
||||
`/opt/homelab/events/*` na VPS należą do `aerbot`, klient nie ustawi na nich czasów.
|
||||
|
||||
---
|
||||
|
||||
### Deklaratywny zapis `oskar ∈ aerbot` w manifeście VPS
|
||||
|
||||
**Data**: 2026-06-09
|
||||
**Źródło**: flota recovery — root cause: oskar spoza grupy aerbot(1000) → rsync Permission denied
|
||||
**Problem**: przynależność do grupy jest zarządzana ręcznie (`usermod -aG 1000 oskar` ad-hoc).
|
||||
Brak gwarancji po przeinstalowaniu VPS lub zmianie usera.
|
||||
**Fix**: dodać do `hosts/vps/host.yaml` lub `hosts/vps/capabilities.yaml` sekcję
|
||||
`users: oskar: groups: [aerbot]` — i wyegzekwować w deploy/bootstrap skrypcie VPS.
|
||||
Alternatywa: zmienić właściciela `/opt/homelab/events/` na `oskar:oskar` i zaktualizować
|
||||
node-agent deploy skrypty.
|
||||
|
||||
---
|
||||
|
||||
### Rozdzielenie worktree per task (agent.sh)
|
||||
|
||||
**Data**: 2026-06-09
|
||||
**Źródło**: sesja — `homelab-codex-ws-node-onboarding` używany raz dla `task/node-onboarding`,
|
||||
raz dla `task/fix-event-bloat` przez ręczne `git checkout`.
|
||||
**Problem**: jeden worktree współdzielony przez dwa branche = anty-wzorzec. `git branch`
|
||||
mogło wskazywać zły branch; `+` w listingu = pozornie "w innym worktree" ale nieprawda.
|
||||
Prowadzi do commitowania na złej gałęzi.
|
||||
**Fix**: egzekwować — jeden task = jeden worktree (`agent.sh new <task-name>`). Przy wejściu
|
||||
do worktree zawsze `git branch --show-current` i weryfikacja `.agent-task`.
|
||||
Długoterminowo: `agent.sh new` powinien odmawiać jeśli żądana gałąź jest już sprawdzona.
|
||||
|
||||
---
|
||||
|
||||
## Zamknięte
|
||||
|
||||
### Observer staleness — martwy node pokazywany NOMINAL
|
||||
|
||||
**Data**: 2026-06-08 (złapane), status: OTWARTY w sensie implementacji
|
||||
**Problem**: observer/supervisor trzyma ostatni znany stan; brak heartbeat TTL.
|
||||
Chelsty-infra milczy, ale status NOMINAL podważa zaufanie do panelu.
|
||||
**Fix**: heartbeat TTL → po przekroczeniu oznacz status `stale` lub `down`.
|
||||
**Powiązane**: brain-watchdog ślepy na per-node freshness.
|
||||
*(Otwarty jako TODO implementacyjny — przeniesiony z sesji 2026-06-08)*
|
||||
|
|
@ -83,10 +83,3 @@ Future autonomous agents will use this metadata to:
|
|||
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
||||
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
||||
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|
||||
|
||||
## Agent Reasoning Logic
|
||||
|
||||
When an agent parses `capabilities.yaml`, it should apply these heuristics:
|
||||
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
|
||||
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
|
||||
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.
|
||||
|
|
|
|||
|
|
@ -1,154 +1,60 @@
|
|||
# CHELSTY Runtime
|
||||
|
||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
|
||||
|
||||
| Node | Role | Services |
|
||||
|------|------|----------|
|
||||
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
|
||||
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
|
||||
|
||||
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
|
||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
|
||||
|
||||
## Runtime Layout
|
||||
|
||||
```
|
||||
/opt/homelab/
|
||||
├── config/ # Service-specific configs and secrets (not in Git)
|
||||
│ ├── mosquitto/
|
||||
│ └── zigbee2mqtt/
|
||||
├── data/ # Persistent service data
|
||||
│ ├── mosquitto/ # Persistence DB, password file
|
||||
│ └── zigbee2mqtt/
|
||||
│ └── data/ # z2m config, coordinator backup, network key
|
||||
└── logs/
|
||||
```
|
||||
The CHELSTY runtime is located at `/opt/homelab`.
|
||||
|
||||
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
|
||||
- `/opt/homelab/data/`: Persistent data for services.
|
||||
- `/opt/homelab/logs/`: Service logs.
|
||||
|
||||
### Key Service Locations
|
||||
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
|
||||
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
|
||||
|
||||
## SLZB-06U Integration
|
||||
|
||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
|
||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
|
||||
|
||||
- **Coordinator IP**: `192.168.1.105`
|
||||
- **Port**: `6638`
|
||||
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
|
||||
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
|
||||
- **Coordinator IP**: 192.168.1.105
|
||||
- **Port**: 6638
|
||||
- **Protocol**: TCP (ezsp adapter)
|
||||
|
||||
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
|
||||
Zigbee2MQTT is configured to connect to this coordinator over the local network.
|
||||
|
||||
## Networking Constraints
|
||||
## Offline & LTE Assumptions
|
||||
|
||||
### Mosquitto — `network_mode: host`
|
||||
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
|
||||
|
||||
### Zigbee2MQTT — bridge network + extra_hosts
|
||||
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
|
||||
|
||||
```yaml
|
||||
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
|
||||
services:
|
||||
zigbee2mqtt:
|
||||
extra_hosts:
|
||||
- "mosquitto:host-gateway"
|
||||
```
|
||||
|
||||
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
|
||||
|
||||
**Why not `network_mode: host` for z2m?**
|
||||
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
|
||||
|
||||
## Zigbee2MQTT Config Location
|
||||
|
||||
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
|
||||
|
||||
```
|
||||
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||
```
|
||||
|
||||
This path is mounted read-write by the base `docker-compose.yml`:
|
||||
```yaml
|
||||
volumes:
|
||||
- /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||
```
|
||||
|
||||
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
|
||||
|
||||
### Minimal configuration.yaml
|
||||
```yaml
|
||||
homeassistant: true
|
||||
permit_join: false
|
||||
mqtt:
|
||||
base_topic: zigbee2mqtt
|
||||
server: mqtt://mosquitto:1883
|
||||
serial:
|
||||
port: tcp://192.168.1.105:6638
|
||||
adapter: ezsp
|
||||
frontend:
|
||||
port: 8080
|
||||
advanced:
|
||||
log_level: info
|
||||
```
|
||||
|
||||
## chelsty-ha — No node-agent
|
||||
|
||||
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
|
||||
|
||||
In `hosts/chelsty-ha/services.yaml`:
|
||||
```yaml
|
||||
services:
|
||||
homeassistant:
|
||||
monitor: false # No node-agent; suppresses supervisor action generation
|
||||
```
|
||||
|
||||
Remove `monitor: false` once node-agent is bootstrapped on this VM.
|
||||
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
|
||||
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
|
||||
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
|
||||
|
||||
## Deployment Flow
|
||||
|
||||
### Initial Bootstrap
|
||||
```bash
|
||||
./scripts/bootstrap/chelsty-runtime.sh
|
||||
```
|
||||
1. **Initial Bootstrap**:
|
||||
Run the bootstrap script on the CHELSTY node:
|
||||
```bash
|
||||
./scripts/bootstrap/chelsty-runtime.sh
|
||||
```
|
||||
|
||||
### Deploy services
|
||||
```bash
|
||||
./scripts/deploy/deploy-node.sh chelsty-infra
|
||||
./scripts/deploy/deploy-node.sh chelsty-ha
|
||||
```
|
||||
2. **Manual Configuration**:
|
||||
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
|
||||
- Add Mosquitto user:
|
||||
```bash
|
||||
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
|
||||
```
|
||||
|
||||
### Manual (SSH) — chelsty-infra uses docker-compose v1
|
||||
```bash
|
||||
ssh oskar@100.122.201.22
|
||||
cd ~/homelab-codex-ws/services/<service>
|
||||
docker-compose -f docker-compose.yml \
|
||||
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
|
||||
up -d --build --force-recreate
|
||||
```
|
||||
3. **Service Deployment**:
|
||||
Use the staged deployment runtime:
|
||||
```bash
|
||||
./scripts/deploy/deploy-node.sh chelsty
|
||||
```
|
||||
|
||||
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
|
||||
## Recovery Procedure
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### Mosquitto stopped
|
||||
```bash
|
||||
ssh oskar@100.122.201.22 "docker start mosquitto"
|
||||
# Ensure restart policy is correct:
|
||||
docker update --restart unless-stopped mosquitto
|
||||
```
|
||||
|
||||
### Zigbee2MQTT won't start
|
||||
1. Check logs: `docker logs zigbee2mqtt --tail 50`
|
||||
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
|
||||
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
|
||||
4. If config missing, recreate from the minimal template above
|
||||
|
||||
### SLZB-06U unreachable
|
||||
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
|
||||
|
||||
## Critical Backup Sets
|
||||
|
||||
| Data | Path |
|
||||
|------|------|
|
||||
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
|
||||
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
|
||||
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
|
||||
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
|
||||
|
||||
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
|
||||
In case of runtime failure:
|
||||
1. Verify Docker and Compose plugin: `docker compose version`
|
||||
2. Re-run bootstrap script to ensure directory structure and basic configs.
|
||||
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
|
||||
4. Verify SLZB-06U reachability: `ping 192.168.1.105`
|
||||
|
|
|
|||
|
|
@ -1,42 +0,0 @@
|
|||
### CHELSTY Stability Agent
|
||||
|
||||
The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
|
||||
|
||||
#### Purpose
|
||||
|
||||
It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
|
||||
|
||||
#### Monitoring Scope
|
||||
|
||||
* **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
|
||||
* **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
|
||||
* **Connectivity**:
|
||||
* Checks if the Tailscale socket or interface is available.
|
||||
* Checks reachability of the local Mosquitto MQTT broker.
|
||||
* **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
|
||||
|
||||
#### Storage and Integration
|
||||
|
||||
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
|
||||
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
|
||||
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
|
||||
|
||||
#### Deployment
|
||||
|
||||
The service is deployed via Docker Compose on CHELSTY.
|
||||
|
||||
```bash
|
||||
cd services/stability-agent
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
|
||||
| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
|
||||
| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
|
||||
| `MQTT_PORT` | MQTT broker port | `1883` |
|
||||
|
|
@ -7,92 +7,57 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
|
|||
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
||||
|
||||
### Inputs
|
||||
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
|
||||
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
|
||||
- `/opt/homelab/events/`: Normalized JSON events.
|
||||
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
|
||||
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
|
||||
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
||||
|
||||
### World Model Output
|
||||
Generated under `/opt/homelab/world/`:
|
||||
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
|
||||
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
|
||||
- `nodes.json`: Current node availability, roles, and last seen timestamps.
|
||||
- `services.json`: Service health status and links to active incidents.
|
||||
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
||||
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
||||
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
||||
|
||||
## Checkpoint Format
|
||||
|
||||
The observer tracks per-node progress to avoid silently skipping event directories:
|
||||
|
||||
```json
|
||||
{
|
||||
"node_checkpoints": {
|
||||
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
|
||||
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
|
||||
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
|
||||
|
||||
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
|
||||
|
||||
## Event Types
|
||||
|
||||
### Negative events (create/escalate incidents)
|
||||
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
|
||||
- `deployment_failed` — record failure in deployments.json
|
||||
|
||||
### Positive events (resolve state)
|
||||
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
|
||||
- `service_recovered` — alias, same effect
|
||||
- `deployment_completed` — marks deployment as completed
|
||||
|
||||
### Node events
|
||||
- `node_online`, `node_offline` — update node status in nodes.json
|
||||
- `disk_pressure_*` — set `disk_pressure` field on the node record
|
||||
|
||||
## Incident Lifecycle
|
||||
|
||||
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
|
||||
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
|
||||
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
|
||||
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
|
||||
The observer implements lightweight incident correlation:
|
||||
|
||||
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
|
||||
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
|
||||
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
|
||||
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
|
||||
|
||||
### Example Incident JSON
|
||||
```json
|
||||
{
|
||||
"inc-1715518800-vps-observer": {
|
||||
"id": "inc-1715518800-vps-observer",
|
||||
"node": "vps",
|
||||
"service": "observer",
|
||||
"inc-1715518800-saturn-mosquitto": {
|
||||
"id": "inc-1715518800-saturn-mosquitto",
|
||||
"node": "saturn",
|
||||
"service": "mosquitto",
|
||||
"status": "resolved",
|
||||
"severity": "error",
|
||||
"started_at": 1715518800.0,
|
||||
"last_occurrence": 1715518860.0,
|
||||
"started_at": "2026-05-12T12:05:00Z",
|
||||
"last_occurrence": "2026-05-12T12:06:00Z",
|
||||
"occurrence_count": 2,
|
||||
"trigger_type": "containers_not_running",
|
||||
"resolved_at": 1715519100.0
|
||||
"events": [
|
||||
"2026-05-12T12:05:00Z",
|
||||
"2026-05-12T12:06:00Z"
|
||||
],
|
||||
"correlation_id": "hc-1",
|
||||
"resolved_at": "2026-05-12T12:10:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## World State Pruning
|
||||
|
||||
`_prune_stale_world()` runs every reconcile cycle and removes:
|
||||
|
||||
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
|
||||
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
|
||||
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
|
||||
4. **Expired incidents** — resolved incidents older than 7 days.
|
||||
|
||||
## Runtime Behavior
|
||||
|
||||
### Idempotency
|
||||
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
|
||||
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
|
||||
|
||||
### Resumability
|
||||
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
|
||||
|
||||
### Deployment Tracking
|
||||
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
|
||||
|
||||
### Topology Filtering
|
||||
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
|
||||
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
|
||||
|
|
|
|||
|
|
@ -1,234 +0,0 @@
|
|||
# SESSION: Budowa planner-agent — LLM-based diagnostics
|
||||
|
||||
**DATA:** 2026-05-27
|
||||
**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
|
||||
|
||||
---
|
||||
|
||||
## Co zostało zbudowane
|
||||
|
||||
### `services/planner-agent/src/llm_router.py`
|
||||
|
||||
Moduł LLM routing z local-first fallback chain:
|
||||
|
||||
- **`LLMRouter`** — główna klasa routingu przez litellm
|
||||
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
|
||||
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
|
||||
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
|
||||
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
|
||||
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
|
||||
|
||||
Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
|
||||
|
||||
Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
|
||||
|
||||
### `services/planner-agent/src/planner.py`
|
||||
|
||||
Główna pętla agenta:
|
||||
|
||||
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
|
||||
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
|
||||
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
|
||||
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
|
||||
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
|
||||
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
|
||||
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
|
||||
|
||||
Pipeline:
|
||||
```
|
||||
Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
|
||||
→ write_pending_action() → emit_event("remediation_started")
|
||||
```
|
||||
|
||||
### Pliki towarzyszące
|
||||
|
||||
| Plik | Opis |
|
||||
|------|------|
|
||||
| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
|
||||
| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
|
||||
| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
|
||||
| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
|
||||
| `requirements.txt` | litellm, redis, jsonschema, structlog |
|
||||
| `tests/test_planner.py` | 49 testów jednostkowych |
|
||||
| `tests/test_llm_router.py` | 34 testy jednostkowe |
|
||||
|
||||
---
|
||||
|
||||
## Kluczowe decyzje architektoniczne
|
||||
|
||||
### 1. HITL invariant (Human-in-the-loop)
|
||||
|
||||
Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
|
||||
Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
|
||||
|
||||
Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
|
||||
|
||||
### 2. Cooldown gate
|
||||
|
||||
Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
|
||||
propozycjami dla tego samego serwisu.
|
||||
|
||||
**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
|
||||
Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
|
||||
przez 5 minut mimo że nie powstała żadna propozycja.
|
||||
|
||||
### 3. Fallback chain — local-first
|
||||
|
||||
Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
|
||||
|
||||
Uzasadnienie:
|
||||
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
|
||||
- Haiku = szybki i tani cloud fallback
|
||||
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
|
||||
|
||||
Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
|
||||
|
||||
### 4. Brak importów z control-plane
|
||||
|
||||
`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
|
||||
`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
|
||||
`scripts/lib/events.py`).
|
||||
|
||||
Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
|
||||
cykl deploymentu.
|
||||
|
||||
### 5. structlog z PrintLoggerFactory
|
||||
|
||||
Nie używamy `structlog.stdlib.add_logger_name` — `PrintLogger` nie ma atrybutu `.name`.
|
||||
Zamiast tego łańcuch procesorów: `add_log_level` → `TimeStamper` → `StackInfoRenderer`
|
||||
→ `format_exc_info` → `JSONRenderer`.
|
||||
|
||||
### 6. NODE_NAME czytany w czasie wywołania, nie importu
|
||||
|
||||
`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
|
||||
(nie jako default parameter). Umożliwia patchowanie w testach.
|
||||
|
||||
---
|
||||
|
||||
## Problemy napotkane i rozwiązania
|
||||
|
||||
### Problem: `localhost` w kontenerze nie sięga do hosta
|
||||
|
||||
**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
|
||||
z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
|
||||
|
||||
**Rozwiązanie:**
|
||||
1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
|
||||
2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
|
||||
|
||||
### Problem: `environment` vs `env_file` — podwójne zmienne
|
||||
|
||||
**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
|
||||
w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
|
||||
że `.env` był opcjonalny a nie wymagany.
|
||||
|
||||
**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
|
||||
Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
|
||||
|
||||
### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
|
||||
|
||||
**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
|
||||
|
||||
**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
|
||||
kompatybilny z `PrintLoggerFactory`.
|
||||
|
||||
### Problem: verify stage failuje zaraz po starcie
|
||||
|
||||
**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
|
||||
|
||||
**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
|
||||
pętli i pierwsze `touch()` heartbeatu.
|
||||
|
||||
**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
|
||||
Kontener pokazuje `(healthy)` po 30s od startu.
|
||||
|
||||
### Problem: git pull z divergent branches na solaria
|
||||
|
||||
**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
|
||||
`git pull` failował z "Need to specify how to reconcile divergent branches."
|
||||
|
||||
**Rozwiązanie:**
|
||||
```bash
|
||||
git checkout -- services/planner-agent/docker-compose.yml # porzuć ręczne zmiany
|
||||
git fetch origin
|
||||
git rebase origin/master # rebase local commits on top of master
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Status deploymentu na SOLARIA
|
||||
|
||||
```
|
||||
Container: planner-agent Up ~30m (healthy)
|
||||
Image: planner-agent-planner-agent
|
||||
Node: solaria (100.100.231.104)
|
||||
Heartbeat: /opt/homelab/state/planner-agent.heartbeat (age 0s)
|
||||
|
||||
Channels subscribed:
|
||||
- health_events
|
||||
- world_updates
|
||||
|
||||
LLM chain:
|
||||
PRIMARY: ollama/qwen2.5-coder:14b @ http://host-gateway:11434
|
||||
FALLBACK: claude-haiku-4-5-20251001 (disabled — brak ANTHROPIC_API_KEY)
|
||||
FALLBACK: claude-sonnet-4-6 (disabled — brak ANTHROPIC_API_KEY)
|
||||
|
||||
Redis: redis://100.108.208.3:6379 ✓ connected
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Co zostało na później
|
||||
|
||||
### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
|
||||
|
||||
Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.
|
||||
Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
|
||||
|
||||
Aby włączyć:
|
||||
```bash
|
||||
ssh oskar@100.100.231.104
|
||||
echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
|
||||
docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
### 2. End-to-end test z prawdziwym eventem
|
||||
|
||||
Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
|
||||
przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
|
||||
|
||||
Test:
|
||||
```bash
|
||||
redis-cli -h 100.108.208.3 PUBLISH health_events '{
|
||||
"type": "service_unhealthy",
|
||||
"node": "piha",
|
||||
"service": "mosquitto",
|
||||
"severity": "error",
|
||||
"payload": {"reason": "container exited"},
|
||||
"timestamp": "2026-05-27T20:00:00Z"
|
||||
}'
|
||||
# Obserwuj: docker logs planner-agent -f
|
||||
# Sprawdź: ls /opt/homelab/actions/pending/
|
||||
```
|
||||
|
||||
### 3. Solaria local commits
|
||||
|
||||
Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
|
||||
które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
|
||||
Należy je wypchnąć lub zreviewować i ewentualnie squashować.
|
||||
|
||||
### 4. Integracja z operator UI / Telegram
|
||||
|
||||
Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
|
||||
Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
|
||||
|
||||
---
|
||||
|
||||
## Commity tej sesji
|
||||
|
||||
```
|
||||
ff6fda1 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
|
||||
ca37fca Add planner-agent: LLM-powered remediation planner
|
||||
(llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
|
||||
healthcheck.sh, Dockerfile)
|
||||
```
|
||||
|
|
@ -1,103 +0,0 @@
|
|||
# SESSION: Stabilizacja systemu wieloagentowego homelabu
|
||||
|
||||
**DATE:** 2026-05-27
|
||||
**RESULT:** System NOMINAL (97/97 services, 0 errors)
|
||||
|
||||
---
|
||||
|
||||
## PROBLEMS FOUND
|
||||
|
||||
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
|
||||
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
|
||||
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
|
||||
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
|
||||
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
|
||||
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
|
||||
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
|
||||
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
|
||||
- `service_healthy` event nie zamykał aktywnych incydentów
|
||||
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
|
||||
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
|
||||
|
||||
---
|
||||
|
||||
## FIXES SHIPPED (commits in master)
|
||||
|
||||
```
|
||||
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
|
||||
b40b832 Fix ghost service keys from hash-prefixed Docker container names
|
||||
28e9534 observer: service_healthy resolves active incidents
|
||||
46ae92b supervisor: also cancel pending actions for services removed from desired state
|
||||
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
|
||||
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
|
||||
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
|
||||
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
|
||||
fb7828b supervisor: auto-cancel pending actions when drift is resolved
|
||||
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
|
||||
267742c vps/node-agent: add network_mode: host for control-plane health probe
|
||||
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
|
||||
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
|
||||
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
|
||||
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
|
||||
65bac4e fix(node-agent): mount host SSH key into container for event shipping
|
||||
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
|
||||
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
|
||||
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
|
||||
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
|
||||
```
|
||||
|
||||
### Szczegóły kluczowych napraw
|
||||
|
||||
**fix(observer): per-node checkpoints**
|
||||
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
|
||||
|
||||
**fix(observer): ghost key pruning**
|
||||
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
|
||||
|
||||
**fix(node-agent): canonical container name**
|
||||
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
|
||||
|
||||
**fix(node-agent): service_healthy emission**
|
||||
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
|
||||
|
||||
**fix(supervisor): auto-cancel resolved actions**
|
||||
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
|
||||
- serwis stał się healthy (`drift_resolved_auto`)
|
||||
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
|
||||
|
||||
**fix(supervisor): monitor:false**
|
||||
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
|
||||
|
||||
**fix(agent-system/materializer): control-plane API as source**
|
||||
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
|
||||
|
||||
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
|
||||
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
|
||||
|
||||
**fix(chelsty-infra/zigbee2mqtt): writable config**
|
||||
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
|
||||
|
||||
---
|
||||
|
||||
## STAN KOŃCOWY
|
||||
|
||||
| Node | Status | Serwisy |
|
||||
|------|--------|---------|
|
||||
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
|
||||
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
|
||||
| solaria | online | node-agent, stability-agent, AI workloads |
|
||||
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
|
||||
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
|
||||
|
||||
**Action queue:** 0 pending, 0 approved, 0 running
|
||||
**Incidents:** 0 active
|
||||
**Ghost service keys:** 0
|
||||
|
||||
---
|
||||
|
||||
## ZNANE OGRANICZENIA / TODO
|
||||
|
||||
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
|
||||
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
|
||||
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
|
||||
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
|
||||
|
|
@ -1,100 +0,0 @@
|
|||
# Sesja 2026-06-08 — onboarding LUSTRO (RPi4 / Magic Mirror / KEN)
|
||||
|
||||
## Cel
|
||||
|
||||
Budowa reużywalnego narzędzia onboardingu nodów `scripts/onboard/` (bash idempotentny,
|
||||
NIE Ansible — świadoma decyzja), napędzanego deklaratywnym manifestem
|
||||
`hosts/<node>/node.yaml`. Pierwszy realny node: LUSTRO.
|
||||
|
||||
## Node LUSTRO (fakty z preflight)
|
||||
|
||||
- RPi4, aarch64, Debian bookworm, hostname pimirror2, sieć KEN 192.168.31.x
|
||||
- RAM 4 GB (MM zjada ~1.7 Gi — ten sam profil co VPS z OOM 2026-06-01 → `mem_limit` obowiązkowy)
|
||||
- dysk 58 G / 48% (luz)
|
||||
- docker 29.5.3 już zainstalowany (krok `20-install-docker` zbędny dla tego node'a)
|
||||
- user `pi`: uid=1000, passwordless sudo (potwierdzone `sudo -n true`=0), grupy docker+ollama
|
||||
- Magic Mirror = systemd unit `magicmirror.service` (Electron jako pi) — **NIETKNIĘTY** przez całą sesję
|
||||
- swap = 200 M plik `/var/swap` na SD → do migracji na zram (wear karty)
|
||||
- Tailscale: zainstalowany w tej sesji, Running, IP 100.99.85.73
|
||||
|
||||
## Decyzje
|
||||
|
||||
- **user = istniejący `pi`** (NIE tworzymy `oskar` — `pi` już zajmuje uid 1000, jest
|
||||
właścicielem MM, ma docker+sudo; node-agent docker `1000:1000` pasuje out-of-box).
|
||||
Świadome odstępstwo od konwencji "oskar wszędzie".
|
||||
- runtime node-agent = docker
|
||||
- `first_contact` = LAN IP `pi@192.168.31.19` (mDNS `.local` okazał się zawodny —
|
||||
transient resolve fail); po `tailscale up` kontakt przejmuje mesh (`pi@lustro`)
|
||||
- Tailscale auth = login interaktywny (URL), bez authkey
|
||||
- swap target = zram
|
||||
|
||||
## Stan: 00-access ZAMKNIĘTY
|
||||
|
||||
Idempotentny, przeszedł na ostro + re-run czysty. Lustro w mesh, kanał SATURN→lustro
|
||||
przez Tailscale działa bezhasłowo. Verify czysty (arch=aarch64).
|
||||
|
||||
## Bugi narzędzia naprawione w tej sesji
|
||||
|
||||
1. **dry-run był płytki** (tylko orchestrator) → `run()` helper + propagacja `DRY_RUN=1`
|
||||
do steps (`lib/common.sh`, `onboard.sh`, `remote.sh`, `00-access.sh`)
|
||||
2. **`yaml_get` fallback** (bez `yq`):
|
||||
- inline-comment stripping — `[[:space:]]+#.*$` po wartości
|
||||
- PRE-EXISTING greedy-colon bug — `.*:` ucinał ostatni dwukropek, gubił prefix
|
||||
w `systemd:magicmirror.service`; fix: `^[[:space:]]*[^:]*:[[:space:]]*`
|
||||
3. **`00-access` verify** — ssh known-hosts warning wpadał do parsowanego `arch`
|
||||
(`WARN "Unexpected arch 'Warning:Permanently…'"`); fix: `-o LogLevel=ERROR`
|
||||
+ czysty stdout (bez `2>&1`)
|
||||
|
||||
## Branch / commity
|
||||
|
||||
`feat/node-onboarding` (6 commitów):
|
||||
|
||||
| Hash | Opis |
|
||||
|------|------|
|
||||
| `adb8407` | scaffold — onboard.sh, lib/, steps/00-preflight, hosts/lustro/node.yaml draft |
|
||||
| `9012a36` | 00-access.sh + node.yaml ssh_user/first_contact/hardware |
|
||||
| `931fd46` | dry-run propagacja — run() helper, DRY_RUN=0/1 |
|
||||
| `eed0ad0` | yaml_get fix — inline-comment + greedy-colon |
|
||||
| `1bed855` | first_contact: IP zamiast mDNS .local |
|
||||
| `471ba09` | verify fix — LogLevel=ERROR, czysty stdout |
|
||||
|
||||
## OTWARTE — do następnej sesji (kolejność)
|
||||
|
||||
1. **WORKTREE HYGIENE** (pierwsza rzecz): cała sesja jechała w MAIN checkout wbrew
|
||||
zasadzie "main = deploy-only". Decyzja nierozstrzygnięta:
|
||||
- (A) rename `feat/` → `task/node-onboarding` + worktree + main→master
|
||||
(pełna zgodność z `agent.sh`; merge=FF)
|
||||
- (B) zostać `feat/` + ręczny `git merge --ff-only`
|
||||
|
||||
`agent.sh new` tworzy `task/<name>` od `master` i NIE bierze istniejącego brancha.
|
||||
`git worktree list` jeszcze nieodczytany (potrzebny wzorzec ścieżki).
|
||||
|
||||
2. **base step**: migracja swap 200 M-plik → zram; `/opt/homelab` + `chown pi`
|
||||
(uid 1000 już pasuje); event dir `/opt/homelab/events/lustro/`
|
||||
3. **node-agent step**: docker override, user 1000:1000 (pi=1000), `mem_limit: 256m`
|
||||
4. **register step**: observer/supervisor inventory + redis sub + UI panel agents.okit.pl
|
||||
5. **verify step (50)**: smoke end-to-end (event dotarł do control plane, widać w UI,
|
||||
realny alert path Telegram)
|
||||
6. **mm-watch**: health check `systemctl is-active magicmirror.service`
|
||||
7. **drobiazgi**: baner URL w 00-access ma defekt wyrównania; `locale pl_PL`
|
||||
niewygenerowane na lustrze (niegroźne)
|
||||
|
||||
## Learnings
|
||||
|
||||
(odzwierciedlone też w `scripts/onboard/README.md`)
|
||||
|
||||
- mDNS `.local` zawodny do automatyzacji → `first_contact` przez IP lub tailscale, nie `.local`
|
||||
- istniejący node z userem uid=1000: użyj go zamiast tworzyć `oskar` (kolizja uid)
|
||||
- swap na SD = wear → zram
|
||||
- dry-run MUSI propagować do step-skryptów (`run()` wrapper), inaczej bezużyteczny
|
||||
- yaml fallback bez `yq` musi strippować inline komentarze i nie być greedy na `:`
|
||||
|
||||
## Update — worktree hygiene
|
||||
- feat/node-onboarding → task/node-onboarding. Main checkout (~/homelab-codex-ws) wrócił na master (deploy-only). Praca onboardingu w ~/homelab-codex-ws-node-onboarding.
|
||||
- Origin: task/ pushnięty+tracking, feat/ usunięty.
|
||||
- DROBIAZG: worktree utworzony ręcznie (git worktree add) → agent.sh list pokazuje "(no marker)"/parent=?. Działa; przy finałowym `agent.sh merge node-onboarding` zweryfikować, czy brak markera nie przeszkadza — inaczej dorobić marker (wzór: ha-piha) lub ręczny `git merge --ff-only`.
|
||||
- NASTĘPNE: base step (zram, /opt/homelab, event dir /opt/homelab/events/lustro/) — z worktree node-onboarding.
|
||||
- Osobny przyszły projekt: parent-layout refaktor (bare + worktree pod jednym katalogiem) — wymaga przepisania agent.sh + zabezpieczenia dirty ha-piha.
|
||||
|
||||
## Tech-debt złapany w sesji
|
||||
- OBSERVER STALENESS: martwy node (chelsty-infra) świeci NOMINAL w agents.okit.pl — observer/supervisor trzyma ostatni znany stan i nie degraduje przy braku heartbeatu (eventy: tylko VPS raportuje świeżo, chelsty milczy a status NOMINAL). FIX (zdalny, software): heartbeat TTL → po przekroczeniu oznacz `stale`/`down`. Ważne: false-NOMINAL podważa zaufanie do statusu wszystkich nodów. Przenieść do głównego tech-debt backlogu, jeśli istnieje osobny.
|
||||
|
|
@ -1,124 +0,0 @@
|
|||
# Sesja 2026-06-09 — flota recovery + LUSTRO register
|
||||
|
||||
## Cel
|
||||
|
||||
Diagnoza cichej awarii reportingu floty; dokończenie kroku REGISTER dla LUSTRO
|
||||
(40-register.sh + 50-verify.sh); update skilla node-onboarding.
|
||||
|
||||
---
|
||||
|
||||
## GŁÓWNE: 8-dniowa cicha awaria reportingu floty — ROZWIĄZANA
|
||||
|
||||
### Root cause
|
||||
|
||||
`oskar` (uid 1002) **spoza grupy aerbot (1000)** na VPS.
|
||||
`/opt/homelab/events/*` = `aerbot:aerbot 775` → `oskar` w "other" (r-x).
|
||||
`rsync` push z każdego node'a (jako `oskar` przez SSH) = **Permission denied** przy
|
||||
zapisie → `--remove-source-files` nie czyścił backlogu → **292 000 plików** nagromadzonych
|
||||
w staging cache node-agentów.
|
||||
|
||||
### Fix
|
||||
|
||||
```bash
|
||||
usermod -aG 1000 oskar # na VPS; ssh re-login wymagany
|
||||
```
|
||||
|
||||
### Weryfikacja
|
||||
|
||||
- VPS `events/piha` 3443 pliki (rośnie)
|
||||
- `piha` lokalnie: 2 pliki (staging wyczyszczony)
|
||||
- Panel agents.okit.pl: vps / piha / solaria — Last Seen świeże
|
||||
|
||||
### Diagnoza — 5 warstw, 4 obalone hipotezy
|
||||
|
||||
Verify-before-fix obalił kolejno:
|
||||
1. `authorized_keys` missing — klucz był, SSH działał (piha→VPS ręcznie OK)
|
||||
2. Remote agent down — procesy `rsync` widoczne w `ps`, logi bez crash
|
||||
3. VPS IP zmiana — Tailscale IP niezmieniony 100.95.58.48
|
||||
4. Bridge/relay cutoff — ping VPS→piha OK przez mesh
|
||||
|
||||
5 warstwa (błąd uprawnienia) odkryta przez ręczny `rsync` jako `oskar` na VPS →
|
||||
`Permission denied (13)` → `stat /opt/homelab/events/` → `aerbot:aerbot 775`.
|
||||
|
||||
### Dlaczego awaria była CICHA (3 warstwy maskujące)
|
||||
|
||||
| Warstwa | Mechanizm |
|
||||
|---------|-----------|
|
||||
| (a) shipping fail | Logowany jako `WARNING`, nie crash — node-agent nie failował, milczał |
|
||||
| (b) observer staleness | Stale node pokazywany NOMINAL — brak heartbeat TTL, observer trzyma ostatni znany stan |
|
||||
| (c) brain-watchdog | Ślepy na per-node freshness — nie monitoruje świeżości eventów per-node |
|
||||
|
||||
### Pozostały drobny błąd
|
||||
|
||||
`rsync` exit code 23: `set-times` na katalogu = `EPERM` (oskar nie jest właścicielem
|
||||
`/opt/homelab/events/` — `aerbot` jest). Kosmetyka — rsync działa poprawnie.
|
||||
**Fix**: dodać `--omit-dir-times` do wywołania rsync w node-agent (wpisane do backlogu).
|
||||
|
||||
---
|
||||
|
||||
## LUSTRO register: stan po sesji
|
||||
|
||||
### Dokonane
|
||||
|
||||
- `40-register.sh` — napisany i zcommitowany na `task/node-onboarding`
|
||||
- Idempotentny: grep topology, `[[ -f services.yaml ]]`, `git diff --quiet`
|
||||
- Commituje tylko `inventory/topology.yaml` + `hosts/lustro/services.yaml` na bieżącym branchu
|
||||
- BEZ `git push` (merge należy do operatora)
|
||||
- `50-verify.sh` — napisany i zcommitowany
|
||||
- 4 checki: node-agent running, eventy, observer restart + heartbeat poll, world/nodes.json
|
||||
- Tabela pass/fail; exit 1 on failure
|
||||
- `40-deploy-node-agent.sh` — scaffold usunięty (deploy w 30-node-agent.sh)
|
||||
- Dry-run `40-register.sh --dry-run` przeszedł czysto
|
||||
|
||||
### Mechanizm aktywacji observera (zbadany)
|
||||
|
||||
Observer bind-mountuje repo root jako `/repo:ro` z `services/control-plane/docker-compose.yml`
|
||||
(`../..:/repo:ro` → `/home/oskar/homelab-codex-ws` na VPS). `_load_inventory()` wywoływane
|
||||
raz przy starcie. **Aktywacja po merge**: `git pull` VPS + `docker restart control-plane-observer`
|
||||
— bez redeploy.
|
||||
|
||||
### Wpis lustro w topology.yaml (minimalistyczny, 1:1 z piha)
|
||||
|
||||
```yaml
|
||||
lustro:
|
||||
roles:
|
||||
- edge
|
||||
services:
|
||||
- node-agent
|
||||
```
|
||||
|
||||
### PENDING (jutro)
|
||||
|
||||
1. Commit B: `onboard.sh --node lustro --step 40-register` live → commit na branchu
|
||||
2. `agent.sh merge task/node-onboarding` → master
|
||||
3. `git pull` na VPS + `docker restart control-plane-observer`
|
||||
4. `onboard.sh --node lustro --step 50-verify` → lustro widoczny w agents.okit.pl
|
||||
|
||||
---
|
||||
|
||||
## fix-event-bloat (task/fix-event-bloat)
|
||||
|
||||
Commit `d483274` na branchu: batch rsync, backlog trim, timeout 120s, backlog warn.
|
||||
**PENDING**: review + deploy na flotę.
|
||||
|
||||
---
|
||||
|
||||
## OOM ai-cluster (obserwacja live)
|
||||
|
||||
Zaobserwowany na VPS podczas sesji: cgroup OOM restart-loop, python workery ~195 MB,
|
||||
0 swap. **PENDING**: migracja `ai-cluster` → SOLARIA + dodanie swap na VPS.
|
||||
|
||||
---
|
||||
|
||||
## Gotcha sesji
|
||||
|
||||
**Worktree branch confusion**: `~/homelab-codex-ws-node-onboarding` był przełączony
|
||||
ręcznie na `task/fix-event-bloat` (jeden worktree, dwa branche ręcznie switchwane).
|
||||
Anty-wzorzec: zawsze sprawdzać `git branch --show-current` na wejściu do worktree.
|
||||
Docelowo: osobny worktree per task.
|
||||
|
||||
---
|
||||
|
||||
## Tech-debt złapany w sesji
|
||||
|
||||
→ wpisany do `docs/backlog.md`
|
||||
|
|
@ -1,114 +0,0 @@
|
|||
# Sesja 2026-06-10/11 — lustro SSH shipping fix + ha-diag-agent piha
|
||||
|
||||
## Cel
|
||||
|
||||
Naprawa shippingu eventów lustro → VPS; domknięcie deploy-configu ha-diag-agent na piha;
|
||||
zachowanie poison-quarantine (Codex) do osobnego review.
|
||||
|
||||
---
|
||||
|
||||
## GŁÓWNE: LUSTRO event shipping — NAPRAWIONY (merged `a5a1352`)
|
||||
|
||||
### Root cause
|
||||
|
||||
`_ship_events_to_vps()` (`services/node-agent/src/node_agent.py`) woła `ssh` **bez `-i`**,
|
||||
więc klucz jest szukany w `$HOME/.ssh` = `/home/homelab/.ssh` (kontener działa jako
|
||||
uid 1000 `homelab` od dodania `user: "1000:1000"` do bazowego
|
||||
`services/node-agent/docker-compose.yml`). Override lustra montował klucz w `/root/.ssh`
|
||||
— **ślepy mount**, ssh tam nie patrzy → `oskar@100.95.58.48: Permission denied`.
|
||||
|
||||
### Fix
|
||||
|
||||
`hosts/lustro/runtime/node-agent/docker-compose.override.yml`:
|
||||
|
||||
```yaml
|
||||
- /home/pi/.ssh:/home/homelab/.ssh:ro # było: /root/.ssh — ślepe
|
||||
```
|
||||
|
||||
Klucz `pi@pimirror2` dodany do `authorized_keys` `oskar@VPS`.
|
||||
uid match (pi=1000 = homelab=1000) spełnia strict ownership check OpenSSH.
|
||||
|
||||
### Weryfikacja
|
||||
|
||||
- 5 nodów NOMINAL w world state; lustro w `/opt/homelab/world/nodes.json` (online, świeży `last_seen`)
|
||||
- 7600+ eventów backlogu spłynęło na VPS (`/opt/homelab/events/lustro/`)
|
||||
- Staging na lustrze drenowany do zera (`--remove-source-files` działa)
|
||||
- "Permission denied" zniknął z logów node-agenta
|
||||
|
||||
### Diagnoza — lekcja verify-before-fix
|
||||
|
||||
Oba agenty (Claude Code, Codex) błędnie wskazały observer (poison event / race)
|
||||
na **nieaktualnym stanie** (`events=2` z ręcznego testu). Verify-before-fix obalił
|
||||
obie hipotezy: `events/lustro` na VPS było puste → problem w warstwie **dostarczania**
|
||||
(klucz SSH), nie w observerze.
|
||||
|
||||
---
|
||||
|
||||
## ha-diag-agent piha — deploy config merged (`5e9db5c`), deploy NIEDOKOŃCZONY
|
||||
|
||||
- `.env` utworzony na piha: `/opt/homelab/config/ha-diag-agent/.env`, chmod 600
|
||||
- **ALE token = PLACEHOLDER** — chelsty-ha offline → brak tokenu i połączenia
|
||||
- Przed `shadow_mode=false`: target restartu w supervisorze = nazwa kontenera
|
||||
`homeassistant5`; curl endpointu z tokenem musi dać HTTP 200
|
||||
- Decyzja PENDING: cel HA = chelsty-ha vs HA Ken (`homeassistant5` na piha —
|
||||
z kontenera NIE `localhost`)
|
||||
|
||||
---
|
||||
|
||||
## observer poison-quarantine (Codex)
|
||||
|
||||
Zachowany na branchu `task/observer-poison-quarantine` (`78c9e4a`) — **NIE w master**.
|
||||
Do osobnego review: czy observer realnie wiesza się na malformed evencie
|
||||
(poison NIE był przyczyną lustra; hipoteza niezweryfikowana).
|
||||
Realny bug → merge; inaczej → drop.
|
||||
|
||||
---
|
||||
|
||||
## 🔴 FLOTA-BOMBA — odkryta, NIE naprawiona (backlog, BLOKUJĄCE)
|
||||
|
||||
solaria / piha / chelsty to wciąż **stare root kontenery** node-agenta
|
||||
(piha Created 2026-05-27, uid 0). Ich mount `/root/.ssh` działa tylko dlatego,
|
||||
że kontenery są sprzed `user: "1000:1000"`. Pierwszy `--force-recreate` / reboot
|
||||
hosta / update obrazu przełączy je na uid 1000 i shipping padnie jak na lustrze.
|
||||
**NIE RECREATE bez fixu.** Szczegóły i fix: `docs/backlog.md`.
|
||||
|
||||
---
|
||||
|
||||
## Tech-debt złapany w sesji
|
||||
|
||||
→ wpisany do `docs/backlog.md` (flota-bomba, ha-diag-agent blocked,
|
||||
poison-quarantine review, `--omit-dir-times`, stale komentarz node_agent.py,
|
||||
shipping success na `logger.debug`, event-bloat lustro na VPS).
|
||||
|
||||
## Session 20:19
|
||||
|
||||
### Commits
|
||||
fa59625 docs(ha-diag-agent): replace curl verify commands with docker exec
|
||||
d7e0d31 fix(ha-diag-agent): remove host port mapping for 8087
|
||||
|
||||
### Files changed
|
||||
services/ha-diag-agent/DEPLOY.md | 4 ++--
|
||||
services/ha-diag-agent/README.md | 4 ++--
|
||||
services/ha-diag-agent/docker-compose.yml | 3 ---
|
||||
services/ha-diag-agent/service.yaml | 3 ---
|
||||
4 files changed, 4 insertions(+), 10 deletions(-))
|
||||
|
||||
### Deploys
|
||||
None recorded
|
||||
|
||||
### Narrative
|
||||
> _user-provided summary_
|
||||
|
||||
## Session 20:35
|
||||
|
||||
### Commits
|
||||
(brak nowych — commity d7e0d31 i fa59625 z tej sesji trafiły do mastera przed tym wpisem)
|
||||
|
||||
### Files changed
|
||||
(bez zmian — zob. Session 20:19)
|
||||
|
||||
### Deploys
|
||||
None recorded
|
||||
|
||||
### Narrative
|
||||
> _user-provided summary_
|
||||
|
|
@ -1,62 +0,0 @@
|
|||
# Stability Agent Multi-Node Rollout
|
||||
|
||||
## Architecture Summary
|
||||
The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
|
||||
|
||||
- **Source**: `services/stability-agent`
|
||||
- **State Path**: `/opt/homelab/state`
|
||||
- **Events Path**: `/opt/homelab/events`
|
||||
- **Redis Target**: `100.108.208.3:6379` (PIHA)
|
||||
|
||||
## Why UI only showed CHELSTY
|
||||
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
|
||||
|
||||
## Deployment
|
||||
|
||||
Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
|
||||
|
||||
```bash
|
||||
# Print commands
|
||||
./scripts/deploy/deploy-stability-agent.sh <node-name>
|
||||
|
||||
# Deploy via SSH (executes ssh oskar@<ip>)
|
||||
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
|
||||
```
|
||||
|
||||
### Manual Steps per Node
|
||||
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
|
||||
```bash
|
||||
cd /home/oskar/homelab-codex-ws
|
||||
git fetch origin
|
||||
git checkout master
|
||||
git pull origin master
|
||||
cd services/stability-agent
|
||||
./deploy-local.sh <node-name>
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
### Fleet Overview
|
||||
Run the verification script from any node with `redis-cli` access:
|
||||
```bash
|
||||
./scripts/deploy/verify-agent-fleet.sh
|
||||
```
|
||||
|
||||
### Redis Inspection (on PIHA)
|
||||
```bash
|
||||
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
|
||||
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
|
||||
```
|
||||
|
||||
Verify Web UI backend:
|
||||
```bash
|
||||
curl -s http://127.0.0.1:18180/nodes
|
||||
curl -k https://agents.okit.pl/nodes
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
|
||||
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
|
||||
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
|
||||
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.
|
||||
|
|
@ -49,10 +49,9 @@ Runtime state must live outside the repository to keep it immutable and clean.
|
|||
## Service Standards
|
||||
|
||||
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
|
||||
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
|
||||
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
|
||||
4. **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
|
||||
5. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
|
||||
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
|
||||
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
|
||||
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
|
||||
|
||||
## Docker Compose Standards
|
||||
|
||||
|
|
|
|||
|
|
@ -1,126 +1,78 @@
|
|||
# VPS Control Plane
|
||||
|
||||
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
|
||||
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
|
||||
|
||||
## Architecture
|
||||
|
||||
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
|
||||
The control plane consists of four core services running as a Docker Compose stack:
|
||||
|
||||
| Container | Role |
|
||||
|-----------|------|
|
||||
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
|
||||
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
|
||||
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
|
||||
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
|
||||
1. **Observer**: Synthesizes world state from events.
|
||||
2. **Supervisor**: Detects drifts between desired and actual state.
|
||||
3. **Executor**: Executes approved actions from the queue.
|
||||
4. **Operator UI**: Web interface for system monitoring and action approval.
|
||||
|
||||
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
|
||||
All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
|
||||
|
||||
## Supervisor Behavior
|
||||
## Deployment Flow
|
||||
|
||||
### Desired State
|
||||
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
|
||||
### 1. Prerequisites
|
||||
- Target VPS node must be onboarded (Tailscale active, Docker installed).
|
||||
- Repository cloned to `/home/oskar/homelab-codex-ws`.
|
||||
|
||||
### Drift Types
|
||||
- `missing_service` — service is in desired state but absent from `services.json`
|
||||
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
|
||||
|
||||
### Action Types
|
||||
| Trigger | Action type | Risk |
|
||||
|---------|-------------|------|
|
||||
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
|
||||
| Any other / unknown | `redeploy` | guarded |
|
||||
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
|
||||
|
||||
### Action ID Stability
|
||||
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
|
||||
|
||||
### Auto-Cancel
|
||||
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
|
||||
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
|
||||
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
|
||||
|
||||
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
|
||||
|
||||
### Node Name Resolution
|
||||
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
|
||||
### 2. Bootstrap
|
||||
Run the bootstrap script to initialize the runtime filesystem and start the stack:
|
||||
|
||||
```bash
|
||||
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
|
||||
./scripts/bootstrap/vps-control-plane.sh
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### From SATURN (primary control node)
|
||||
```bash
|
||||
# Full deploy via SSH
|
||||
./scripts/deploy/deploy-control-plane.sh --ssh
|
||||
|
||||
# Or manually:
|
||||
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
|
||||
```
|
||||
|
||||
### Direct on VPS
|
||||
```bash
|
||||
cd ~/homelab-codex-ws/services/control-plane
|
||||
docker compose up -d --build --force-recreate
|
||||
```
|
||||
|
||||
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# On VPS
|
||||
docker ps --filter "name=control-plane"
|
||||
curl -s http://localhost:18180/summary | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Action Approval Workflow
|
||||
|
||||
```
|
||||
Supervisor writes → /opt/homelab/actions/pending/<id>.json
|
||||
→ Operator UI (port 18180) or Telegram Bot notifies
|
||||
→ Operator clicks Approve
|
||||
→ /opt/homelab/actions/approved/<id>.json
|
||||
→ Executor executes → completed / failed
|
||||
```
|
||||
|
||||
Possible action states: `pending → approved → running → completed / failed / rejected`
|
||||
Auto-cancel path: `pending → cancelled/`
|
||||
|
||||
## Recovery
|
||||
|
||||
### World state is stale or corrupt
|
||||
```bash
|
||||
# On VPS — delete checkpoint to force full replay
|
||||
rm /opt/homelab/state/observer_checkpoint.json
|
||||
docker restart control-plane-observer
|
||||
```
|
||||
|
||||
### Flood of pending actions after bootstrap
|
||||
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
|
||||
### 3. Verification
|
||||
Verify the stack is healthy:
|
||||
|
||||
```bash
|
||||
# Check node-agent on each node
|
||||
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
|
||||
cd services/control-plane
|
||||
docker compose ps
|
||||
curl http://localhost:8080/summary
|
||||
```
|
||||
|
||||
### Rebuild from scratch
|
||||
```bash
|
||||
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
|
||||
```
|
||||
## Operational Workflows
|
||||
|
||||
### Action Approval
|
||||
1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
|
||||
2. Navigate to **Action Queue**.
|
||||
3. Review **Pending** actions recommended by the Supervisor.
|
||||
4. Click **Approve** to move actions to the execution queue.
|
||||
|
||||
### Recovery Flow
|
||||
In case of control plane failure:
|
||||
1. Check logs: `docker compose logs -f`.
|
||||
2. Restart stack: `docker compose restart`.
|
||||
3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and restart the observer service.
|
||||
|
||||
### Upgrade Flow
|
||||
1. Pull latest changes from git.
|
||||
2. Run bootstrap script again: `./scripts/bootstrap/vps-control-plane.sh`.
|
||||
- This will rebuild images and restart containers with new code.
|
||||
|
||||
### Rollback Semantics
|
||||
Since the runtime is filesystem-first and append-only:
|
||||
1. Roll back the repository state to a previous commit.
|
||||
2. Restart the control plane stack.
|
||||
3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
|
||||
|
||||
## Runtime Safety
|
||||
|
||||
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
|
||||
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
|
||||
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
|
||||
|
||||
## Integration
|
||||
|
||||
### piha agent-system webui (port 18180 on piha)
|
||||
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
|
||||
|
||||
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
|
||||
|
||||
### Nginx Proxy Manager
|
||||
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
|
||||
Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
|
||||
|
||||
### Log Locations
|
||||
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
|
||||
- Container logs: `docker compose logs`
|
||||
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
||||
- World state: `/opt/homelab/world/`
|
||||
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
|
||||
- Diagnostics: `/opt/homelab/logs/`
|
||||
|
|
|
|||
|
|
@ -1,24 +0,0 @@
|
|||
host: chelsty-ha
|
||||
site: chelsty
|
||||
|
||||
capabilities:
|
||||
networking:
|
||||
reachability: tailscale-only
|
||||
tailscale_ip: 100.122.201.23
|
||||
ingress_suitability: false
|
||||
bandwidth: LTE
|
||||
|
||||
runtime:
|
||||
container_engine: docker
|
||||
os: debian
|
||||
|
||||
operational:
|
||||
connectivity: intermittent
|
||||
availability_target: best-effort
|
||||
offline_first: true
|
||||
uplink: lte
|
||||
|
||||
deployment:
|
||||
suitability:
|
||||
- homeassistant
|
||||
restricted: false
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
hostname: chelsty-ha
|
||||
site: chelsty
|
||||
|
||||
roles:
|
||||
- homeassistant
|
||||
|
||||
network:
|
||||
tailscale_ip: 100.122.201.23
|
||||
|
||||
runtime:
|
||||
root: /opt/homelab
|
||||
|
||||
deployment:
|
||||
mode: pull
|
||||
managed_by: saturn
|
||||
|
||||
constraints:
|
||||
connectivity:
|
||||
intermittent: true
|
||||
uplink: lte
|
||||
|
|
@ -1,12 +0,0 @@
|
|||
host: chelsty-ha
|
||||
site: chelsty
|
||||
|
||||
services:
|
||||
homeassistant:
|
||||
role: home-automation-controller
|
||||
offline_required: true
|
||||
# monitor: false — chelsty-ha has no node-agent deployed, so there are no
|
||||
# container-health events for the observer to track. HA is monitored
|
||||
# indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
|
||||
# is likely down). Re-enable once node-agent is bootstrapped on this VM.
|
||||
monitor: false
|
||||
|
|
@ -1,88 +0,0 @@
|
|||
# Frigate NVR — chelsty-infra
|
||||
# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
|
||||
# Object detection: CPU (no Coral TPU)
|
||||
# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
|
||||
#
|
||||
# Required env vars in /opt/homelab/config/frigate/frigate.env:
|
||||
# CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
|
||||
# CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
|
||||
# MQTT_USER, MQTT_PASS (if mosquitto auth is enabled)
|
||||
|
||||
mqtt:
|
||||
enabled: true
|
||||
host: 127.0.0.1
|
||||
port: 1883
|
||||
# user: "{MQTT_USER}"
|
||||
# password: "{MQTT_PASS}"
|
||||
|
||||
detectors:
|
||||
cpu1:
|
||||
type: cpu
|
||||
num_threads: 3
|
||||
|
||||
ffmpeg:
|
||||
hwaccel_args: preset-vaapi
|
||||
global_args:
|
||||
- -hide_banner
|
||||
- -loglevel
|
||||
- warning
|
||||
|
||||
record:
|
||||
enabled: true
|
||||
retain:
|
||||
days: 7
|
||||
mode: all
|
||||
events:
|
||||
retain:
|
||||
default: 14
|
||||
mode: motion
|
||||
|
||||
snapshots:
|
||||
enabled: true
|
||||
retain:
|
||||
default: 7
|
||||
quality: 70
|
||||
|
||||
objects:
|
||||
track:
|
||||
- person
|
||||
- car
|
||||
- bicycle
|
||||
filters:
|
||||
person:
|
||||
min_area: 5000
|
||||
max_area: 100000
|
||||
threshold: 0.7
|
||||
|
||||
cameras:
|
||||
camera1:
|
||||
ffmpeg:
|
||||
inputs:
|
||||
# Main stream — high-res recording
|
||||
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
|
||||
roles:
|
||||
- record
|
||||
# Sub stream — low-res detection (lower CPU cost)
|
||||
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
|
||||
roles:
|
||||
- detect
|
||||
detect:
|
||||
enabled: true
|
||||
width: 640
|
||||
height: 480
|
||||
fps: 5
|
||||
|
||||
camera2:
|
||||
ffmpeg:
|
||||
inputs:
|
||||
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
|
||||
roles:
|
||||
- record
|
||||
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
|
||||
roles:
|
||||
- detect
|
||||
detect:
|
||||
enabled: true
|
||||
width: 640
|
||||
height: 480
|
||||
fps: 5
|
||||
|
|
@ -1,25 +0,0 @@
|
|||
services:
|
||||
frigate:
|
||||
container_name: frigate
|
||||
image: ghcr.io/blakeblackshear/frigate:stable
|
||||
restart: unless-stopped
|
||||
privileged: true
|
||||
shm_size: "256mb"
|
||||
network_mode: host
|
||||
devices:
|
||||
- /dev/dri/renderD128:/dev/dri/renderD128
|
||||
volumes:
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
- /opt/homelab/config/frigate/config.yml:/config/config.yml
|
||||
- /opt/homelab/config/frigate:/config/credentials:ro
|
||||
- /opt/homelab/data/frigate:/media/frigate
|
||||
tmpfs:
|
||||
- /tmp/cache
|
||||
env_file:
|
||||
- /opt/homelab/config/frigate/frigate.env
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=chelsty-infra
|
||||
- NODE_TYPE=lte_node
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
- /home/oskar/.ssh:/home/homelab/.ssh:ro
|
||||
|
|
@ -1,12 +0,0 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=chelsty-infra
|
||||
- SITE_NAME=chelsty
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
- STABILITY_CHECK_INTERVAL=60
|
||||
- DISK_THRESHOLD_PCT=85
|
||||
- MQTT_HOST=mosquitto
|
||||
- MQTT_PORT=1883
|
||||
|
|
@ -1,21 +0,0 @@
|
|||
services:
|
||||
zigbee2mqtt:
|
||||
# mosquitto runs with network_mode: host on chelsty-infra.
|
||||
# extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
|
||||
# mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
|
||||
# mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
|
||||
extra_hosts:
|
||||
- "mosquitto:host-gateway"
|
||||
environment:
|
||||
- TZ=Europe/Warsaw
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 90s
|
||||
# Note: volumes NOT overridden here.
|
||||
# The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||
# (read-write). configuration.yaml must be placed in that directory on the node:
|
||||
# /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||
# z2m rewrites this file during migrations — read-only mount is not viable.
|
||||
|
|
@ -1,37 +0,0 @@
|
|||
host: chelsty-infra
|
||||
site: chelsty
|
||||
|
||||
services:
|
||||
ha-diag-agent:
|
||||
role: ha-diagnostic-agent
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local: []
|
||||
external: [homeassistant]
|
||||
config:
|
||||
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
|
||||
location_tag: "chelsty"
|
||||
events_dir: /opt/homelab/events/chelsty-infra
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/ha-diag-agent
|
||||
data_path: /var/lib/ha-diag-agent
|
||||
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
|
||||
# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
|
||||
# own retain policy is the correct remediation, not docker prune.
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
|
||||
mosquitto:
|
||||
role: local-mqtt-broker
|
||||
|
||||
zigbee2mqtt:
|
||||
role: zigbee-mqtt-bridge
|
||||
|
||||
frigate:
|
||||
role: nvr
|
||||
|
|
@ -1,6 +1,3 @@
|
|||
host: chelsty-infra
|
||||
site: chelsty
|
||||
|
||||
capabilities:
|
||||
hardware:
|
||||
cpu:
|
||||
|
|
@ -34,11 +31,10 @@ capabilities:
|
|||
power_constraint: low-power
|
||||
connectivity: intermittent
|
||||
availability_target: best-effort
|
||||
offline_operation_required: true
|
||||
|
||||
deployment:
|
||||
suitability:
|
||||
- staging
|
||||
- infra
|
||||
- homeassistant
|
||||
- edge
|
||||
restricted: false
|
||||
|
|
@ -1,10 +1,9 @@
|
|||
hostname: chelsty-infra
|
||||
site: chelsty
|
||||
hostname: chelsty
|
||||
|
||||
roles:
|
||||
- edge
|
||||
- hypervisor
|
||||
- infra
|
||||
- homeassistant
|
||||
- staging
|
||||
|
||||
network:
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
host: chelsty-infra
|
||||
host: chelsty
|
||||
|
||||
uplink:
|
||||
type: lte
|
||||
|
|
@ -20,7 +20,7 @@ exposure_classes:
|
|||
|
||||
networks:
|
||||
home_automation_lan:
|
||||
purpose: MQTT broker, Zigbee coordinator, and local device control.
|
||||
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
|
||||
offline_required: true
|
||||
internet_required_for_core_operation: false
|
||||
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
host: chelsty-infra
|
||||
host: chelsty
|
||||
|
||||
runtime_root: /opt/homelab
|
||||
|
||||
|
|
@ -9,6 +9,12 @@ conventions:
|
|||
logs: /opt/homelab/logs
|
||||
|
||||
services:
|
||||
homeassistant:
|
||||
data: /opt/homelab/data/homeassistant
|
||||
config: /opt/homelab/config/homeassistant
|
||||
logs: /opt/homelab/logs/homeassistant
|
||||
backup_priority: critical
|
||||
|
||||
zigbee2mqtt:
|
||||
data: /opt/homelab/data/zigbee2mqtt
|
||||
config: /opt/homelab/config/zigbee2mqtt
|
||||
|
|
@ -21,13 +27,13 @@ services:
|
|||
logs: /opt/homelab/logs/mosquitto
|
||||
backup_priority: high
|
||||
|
||||
stability-agent:
|
||||
data: /opt/homelab/state
|
||||
config: /opt/homelab/config/stability-agent
|
||||
logs: /opt/homelab/events
|
||||
backup_priority: low
|
||||
|
||||
backup_sets:
|
||||
homeassistant:
|
||||
include:
|
||||
- /opt/homelab/config/homeassistant
|
||||
- /opt/homelab/data/homeassistant
|
||||
restore_note: Restore before starting the Home Assistant container.
|
||||
|
||||
zigbee2mqtt:
|
||||
include:
|
||||
- /opt/homelab/config/zigbee2mqtt
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
services:
|
||||
zigbee2mqtt:
|
||||
volumes:
|
||||
- ./configuration.yaml:/app/data/configuration.yaml:ro
|
||||
environment:
|
||||
- MQTT_USER=${MQTT_USER}
|
||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
||||
# Healthcheck is already defined in base service, but we ensure compatibility
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8080"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
108
hosts/chelsty/services.yaml
Normal file
108
hosts/chelsty/services.yaml
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
host: chelsty
|
||||
|
||||
exposure_classes:
|
||||
local-only:
|
||||
description: Reachable only from CHELSTY-local networks or container networks.
|
||||
public_ingress: false
|
||||
tailscale_required: false
|
||||
tailscale-internal:
|
||||
description: Reachable through the Tailscale mesh by approved tailnet clients.
|
||||
public_ingress: false
|
||||
tailscale_required: true
|
||||
public:
|
||||
description: Reachable from the public internet through an explicit ingress path.
|
||||
public_ingress: true
|
||||
tailscale_required: false
|
||||
|
||||
operational_constraints:
|
||||
uplink: lte
|
||||
connectivity: intermittent
|
||||
offline_operation_required: true
|
||||
must_not_depend_on:
|
||||
- saturn
|
||||
- vps
|
||||
- forgejo
|
||||
|
||||
services:
|
||||
homeassistant:
|
||||
role: home-automation-controller
|
||||
deployment_model: docker-compose
|
||||
exposure: tailscale-internal
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local:
|
||||
- mosquitto
|
||||
- zigbee2mqtt
|
||||
external: []
|
||||
ports:
|
||||
- name: http
|
||||
container_port: 8123
|
||||
protocol: tcp
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/homeassistant
|
||||
data_path: /opt/homelab/data/homeassistant
|
||||
logs_path: /opt/homelab/logs/homeassistant
|
||||
backup:
|
||||
recommended: true
|
||||
include:
|
||||
- /opt/homelab/config/homeassistant
|
||||
- /opt/homelab/data/homeassistant
|
||||
notes:
|
||||
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
|
||||
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
|
||||
|
||||
zigbee2mqtt:
|
||||
role: zigbee-mqtt-bridge
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local:
|
||||
- mosquitto
|
||||
external:
|
||||
- slzb-06u
|
||||
coordinator:
|
||||
name: slzb-06u
|
||||
connection: network
|
||||
usb_device: null
|
||||
ports:
|
||||
- name: frontend
|
||||
container_port: 8080
|
||||
protocol: tcp
|
||||
exposure: tailscale-internal
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/zigbee2mqtt
|
||||
data_path: /opt/homelab/data/zigbee2mqtt
|
||||
logs_path: /opt/homelab/logs/zigbee2mqtt
|
||||
backup:
|
||||
recommended: true
|
||||
include:
|
||||
- /opt/homelab/config/zigbee2mqtt
|
||||
- /opt/homelab/data/zigbee2mqtt
|
||||
notes:
|
||||
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
|
||||
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
|
||||
|
||||
mosquitto:
|
||||
role: local-mqtt-broker
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
ports:
|
||||
- name: mqtt
|
||||
container_port: 1883
|
||||
protocol: tcp
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/mosquitto
|
||||
data_path: /opt/homelab/data/mosquitto
|
||||
logs_path: /opt/homelab/logs/mosquitto
|
||||
backup:
|
||||
recommended: true
|
||||
include:
|
||||
- /opt/homelab/config/mosquitto
|
||||
- /opt/homelab/data/mosquitto
|
||||
notes:
|
||||
- Retain ACL, password, persistence, and bridge configuration if enabled.
|
||||
|
|
@ -1,32 +0,0 @@
|
|||
# hosts/lustro/node.yaml — LUSTRO edge node manifest
|
||||
# First-contact bootstrap: scripts/onboard/onboard.sh --node lustro --step 00-access
|
||||
# Full onboarding: scripts/onboard/onboard.sh --node lustro
|
||||
|
||||
name: LUSTRO
|
||||
role: edge
|
||||
location: KEN
|
||||
|
||||
ssh_user: pi
|
||||
first_contact: pi@192.168.31.19 # LAN IP KEN; mDNS .local zawodny; mesh przejmuje po tailscale up
|
||||
|
||||
tailscale:
|
||||
hostname: lustro
|
||||
# ip: TODO — fill after tailscale join (step 30-install-tailscale)
|
||||
|
||||
deploy_autonomy: true # onboard.sh may run mutating steps autonomously
|
||||
git_control: false # node does NOT pull from Forgejo; push-based via SATURN
|
||||
|
||||
hardware:
|
||||
arch: arm64
|
||||
ram_mb: 4096
|
||||
swap:
|
||||
kind: zram
|
||||
mb: 2048
|
||||
docker_present: true
|
||||
mm_runtime: systemd:magicmirror.service
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
runtime:
|
||||
engine: docker
|
||||
mem_limit: 256m
|
||||
|
|
@ -1,23 +0,0 @@
|
|||
services:
|
||||
node-agent:
|
||||
# Docker GID on LUSTRO is 991 (not the Debian default 999).
|
||||
# Compose concatenates group_add lists; 991 is what gives socket access here.
|
||||
group_add:
|
||||
- "991"
|
||||
mem_limit: 256m # RPi4 4 GiB; MagicMirror consumes ~1.9 GiB — agent must be bounded
|
||||
environment:
|
||||
- NODE_NAME=lustro
|
||||
- NODE_TYPE=sd_card
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
# pi's SSH key for rsync event shipping to VPS (push-based node, no repo
|
||||
# checkout). Container runs as uid 1000 (homelab, HOME=/home/homelab) per
|
||||
# the base compose — ssh has no -i flag, so the key must land in
|
||||
# /home/homelab/.ssh, NOT /root/.ssh. uid match (pi=1000) satisfies
|
||||
# OpenSSH strict ownership checks on the mounted key.
|
||||
- /home/pi/.ssh:/home/homelab/.ssh:ro
|
||||
# Override ../.. from the base compose to the pushed deploy dir (no repo on node)
|
||||
- /opt/homelab/deploy/node-agent:/repo:ro
|
||||
|
|
@ -1,15 +0,0 @@
|
|||
host: lustro
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
|
|
@ -1,8 +0,0 @@
|
|||
services:
|
||||
runtime-materializer:
|
||||
environment:
|
||||
# Pull world state from the VPS control-plane API instead of local Redis.
|
||||
# The observer on VPS is the authoritative writer; mirroring its API output
|
||||
# here ensures the webui /snapshot matches the clean 97-service state that
|
||||
# the control-plane /summary endpoint serves.
|
||||
CONTROL_PLANE_URL: "http://100.95.58.48:18180"
|
||||
|
|
@ -1,4 +0,0 @@
|
|||
services:
|
||||
brain-watchdog:
|
||||
mem_limit: 64m
|
||||
restart: unless-stopped
|
||||
|
|
@ -1,12 +0,0 @@
|
|||
services:
|
||||
ha-diag-agent:
|
||||
environment:
|
||||
- NODE_NAME=piha
|
||||
# Pin events to the piha-specific subdirectory; overrides the ${NODE_NAME}
|
||||
# variable substitution in the base compose file which requires a shell env var.
|
||||
volumes:
|
||||
- /opt/homelab/events/piha:/events
|
||||
- /var/lib/ha-diag-agent:/data
|
||||
- /opt/homelab/config/ha-diag-agent:/config:ro
|
||||
mem_limit: 128m
|
||||
restart: unless-stopped
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=piha
|
||||
- NODE_TYPE=sd_card
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
- /home/oskar/.ssh:/home/homelab/.ssh:ro
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=piha
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
|
|
@ -1,42 +0,0 @@
|
|||
host: piha
|
||||
|
||||
services:
|
||||
ha-diag-agent:
|
||||
role: ha-diagnostic-agent
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local: []
|
||||
external: [homeassistant]
|
||||
config:
|
||||
target_url: http://localhost:8123
|
||||
location_tag: "ken"
|
||||
events_dir: /opt/homelab/events/piha
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/ha-diag-agent
|
||||
data_path: /var/lib/ha-diag-agent
|
||||
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
|
||||
brain-watchdog:
|
||||
role: control-plane-watchdog
|
||||
deployment_model: docker-compose
|
||||
exposure: private
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local: []
|
||||
external: [control-plane]
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/brain-watchdog
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=solaria
|
||||
- NODE_TYPE=ai_node
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
- /home/oskar/.ssh:/home/homelab/.ssh:ro
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=solaria
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
|
|
@ -1,15 +0,0 @@
|
|||
host: solaria
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
|
|
@ -1,39 +0,0 @@
|
|||
# Control-plane production overrides for the VPS deployment.
|
||||
#
|
||||
# NODE_ALIAS_MAP translates the node names that appear in raw event files
|
||||
# (written by node agents / seed scripts) to the canonical names used in
|
||||
# inventory/topology.yaml and hosts/*/services.yaml.
|
||||
#
|
||||
# Current live mapping (from /opt/homelab/events/ inspection):
|
||||
# node-2 → chelsty (zigbee2mqtt / mosquitto / homeassistant node)
|
||||
#
|
||||
# Add further entries when new nodes come online and their event-source names
|
||||
# differ from their topology names. Format is a single-line JSON object, e.g.:
|
||||
# NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
|
||||
#
|
||||
# The executor inherits the canonical name from the action JSON written by the
|
||||
# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
|
||||
#
|
||||
# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
|
||||
# host kernel OOM-killer never targets control-plane containers. mem_limit
|
||||
# provides a per-container cgroup ceiling so a leaking process is restarted by
|
||||
# Docker before it can exhaust host memory.
|
||||
|
||||
services:
|
||||
operator-ui:
|
||||
mem_limit: 192m
|
||||
oom_score_adj: -900
|
||||
|
||||
observer:
|
||||
mem_limit: 192m
|
||||
oom_score_adj: -900
|
||||
|
||||
supervisor:
|
||||
mem_limit: 400m
|
||||
oom_score_adj: -900
|
||||
environment:
|
||||
- NODE_ALIAS_MAP={"node-2":"chelsty"}
|
||||
|
||||
executor:
|
||||
mem_limit: 64m
|
||||
oom_score_adj: -900
|
||||
|
|
@ -1,16 +0,0 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=vps
|
||||
- CHECK_INTERVAL=60
|
||||
# host network mode: node-agent on VPS shares the host's network namespace
|
||||
# so that localhost:18180 resolves to the control-plane's exposed port.
|
||||
# Without this, localhost inside the container is the container's own loopback
|
||||
# and the _check_control_plane_health() probe would always fail.
|
||||
network_mode: host
|
||||
# HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
|
||||
# and may accumulate Python RSS over hours; 640m cap ensures it is killed and
|
||||
# auto-restarted by Docker before consuming host memory. oom_score_adj -900
|
||||
# prevents the host kernel OOM-killer from picking it as a global victim.
|
||||
mem_limit: 640m
|
||||
oom_score_adj: -900
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=vps
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
mem_limit: 96m
|
||||
oom_score_adj: -900
|
||||
1
hosts/vps/services.txt
Normal file
1
hosts/vps/services.txt
Normal file
|
|
@ -0,0 +1 @@
|
|||
npm
|
||||
|
|
@ -1,43 +0,0 @@
|
|||
host: vps
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
|
||||
control-plane:
|
||||
role: management-and-orchestration
|
||||
deployment_model: docker-compose
|
||||
exposure: tailscale-internal
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local:
|
||||
- node-agent
|
||||
external:
|
||||
- piha:redis
|
||||
ports:
|
||||
- name: http
|
||||
container_port: 18180
|
||||
protocol: tcp
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/control-plane
|
||||
data_path: /opt/homelab/data/control-plane
|
||||
logs_path: /opt/homelab/logs/control-plane
|
||||
|
||||
node_exporter:
|
||||
role: metrics-exporter
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
|
|
@ -17,10 +17,6 @@ nodes:
|
|||
roles:
|
||||
- infra
|
||||
- monitoring
|
||||
services:
|
||||
- node-agent
|
||||
- ha-diag-agent
|
||||
- brain-watchdog
|
||||
|
||||
solaria:
|
||||
roles:
|
||||
|
|
@ -31,25 +27,12 @@ nodes:
|
|||
roles:
|
||||
- edge
|
||||
- ingress
|
||||
- control-plane
|
||||
services:
|
||||
# Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
|
||||
- node-agent
|
||||
- control-plane # executor, observer, supervisor, operator-ui
|
||||
- node_exporter
|
||||
- stability-agent
|
||||
- npm # Nginx Proxy Manager — public ingress, TLS termination
|
||||
- outline # Team wiki (outline + postgres + redis)
|
||||
- joplin # Note sync server (joplin-server + postgres)
|
||||
- ai-cluster # AI workers: codex-worker, openclaw, planner-worker,
|
||||
# service-ops-worker, redis, mosquitto
|
||||
|
||||
chelsty-infra:
|
||||
site: chelsty
|
||||
chelsty:
|
||||
roles:
|
||||
- remote
|
||||
- hypervisor
|
||||
- infra
|
||||
- homeassistant
|
||||
- staging
|
||||
connectivity:
|
||||
uplink: lte
|
||||
|
|
@ -57,28 +40,10 @@ nodes:
|
|||
home_automation:
|
||||
offline_operation_required: true
|
||||
services:
|
||||
- homeassistant
|
||||
- zigbee2mqtt
|
||||
- mosquitto
|
||||
coordinator:
|
||||
model: SLZB-06U
|
||||
connection: network
|
||||
usb: false
|
||||
|
||||
chelsty-ha:
|
||||
site: chelsty
|
||||
roles:
|
||||
- remote
|
||||
- homeassistant
|
||||
connectivity:
|
||||
uplink: lte
|
||||
intermittent: true
|
||||
home_automation:
|
||||
offline_operation_required: true
|
||||
services:
|
||||
- homeassistant
|
||||
|
||||
lustro:
|
||||
roles:
|
||||
- edge
|
||||
services:
|
||||
- node-agent
|
||||
|
|
|
|||
|
|
@ -1,23 +0,0 @@
|
|||
#!/bin/bash
|
||||
# scripts/deploy/deploy-control-plane.sh
|
||||
set -e
|
||||
|
||||
VPS_IP="100.95.58.48"
|
||||
USER="oskar"
|
||||
REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
|
||||
MODE=$1
|
||||
|
||||
case "$MODE" in
|
||||
"--ssh")
|
||||
echo "Deploying to VPS ($VPS_IP) via SSH..."
|
||||
ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
|
||||
;;
|
||||
"--print")
|
||||
echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
|
||||
;;
|
||||
*)
|
||||
echo "Usage: $0 [--ssh|--print]"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
|
@ -1,26 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
|
||||
|
||||
MODE="print"
|
||||
[[ "$1" == "--ssh" ]] && MODE="ssh"
|
||||
|
||||
TARGET="100.122.201.22"
|
||||
NODE="chelsty-infra"
|
||||
REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
|
||||
|
||||
echo "HOST: $NODE"
|
||||
echo "MODE: $MODE"
|
||||
echo "TARGET: $TARGET"
|
||||
|
||||
# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
|
||||
# before first deploy. See config.yml for required variables.
|
||||
DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
|
||||
|
||||
if [[ "$MODE" == "ssh" ]]; then
|
||||
echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
|
||||
ssh oskar@$TARGET "$DEPLOY_CMD"
|
||||
else
|
||||
echo "# --- Deployment commands for $NODE ---"
|
||||
echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
|
||||
fi
|
||||
|
|
@ -8,7 +8,6 @@ set -e
|
|||
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||
RUNTIME_PATH="/opt/homelab"
|
||||
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
|
||||
HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"
|
||||
|
||||
echo "--- Starting Deployment on ${HOSTNAME} ---"
|
||||
|
||||
|
|
@ -23,33 +22,20 @@ echo "Pulling latest changes..."
|
|||
git pull
|
||||
|
||||
# 2. Identify Services
|
||||
SERVICES=()
|
||||
if [ -f "${HOST_DIR}/services.txt" ]; then
|
||||
mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
|
||||
elif [ -f "${HOST_DIR}/services.yaml" ]; then
|
||||
SERVICES=($(python3 -c "
|
||||
import yaml, sys
|
||||
try:
|
||||
with open('${HOST_DIR}/services.yaml', 'r') as f:
|
||||
data = yaml.safe_load(f)
|
||||
if data and 'services' in data:
|
||||
if isinstance(data['services'], dict):
|
||||
print(' '.join(data['services'].keys()))
|
||||
elif isinstance(data['services'], list):
|
||||
print(' '.join(data['services']))
|
||||
except Exception as e:
|
||||
print(f'Error parsing YAML: {e}', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
"))
|
||||
fi
|
||||
# Based on our convention, we look for services assigned to this host
|
||||
# For now, we'll check if a 'services.txt' exists in the host folder
|
||||
SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"
|
||||
|
||||
if [ ${#SERVICES[@]} -eq 0 ]; then
|
||||
echo "No services found for ${HOSTNAME}. Skipping service deployment."
|
||||
if [ ! -f "$SERVICE_LIST" ]; then
|
||||
echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# 3. Deploy Services
|
||||
for service in "${SERVICES[@]}"; do
|
||||
while IFS= read -r service || [ -n "$service" ]; do
|
||||
[[ "$service" =~ ^#.*$ ]] && continue # Skip comments
|
||||
[[ -z "$service" ]] && continue # Skip empty lines
|
||||
|
||||
echo "Deploying service: ${service}..."
|
||||
|
||||
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
|
||||
|
|
@ -59,10 +45,13 @@ for service in "${SERVICES[@]}"; do
|
|||
continue
|
||||
fi
|
||||
|
||||
# Target directory in runtime
|
||||
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
|
||||
mkdir -p "$TARGET_DIR"
|
||||
|
||||
OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
|
||||
# We use the compose file from the repo directly
|
||||
# but we can also handle overrides here
|
||||
OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
|
||||
|
||||
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
|
||||
if [ -f "$OVERRIDE_FILE" ]; then
|
||||
|
|
@ -71,6 +60,7 @@ for service in "${SERVICES[@]}"; do
|
|||
fi
|
||||
|
||||
$COMPOSE_CMD up -d --remove-orphans
|
||||
done
|
||||
|
||||
done < "$SERVICE_LIST"
|
||||
|
||||
echo "--- Deployment Complete ---"
|
||||
|
|
|
|||
|
|
@ -1,55 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
|
||||
|
||||
NODE=$1
|
||||
MODE="print"
|
||||
[[ "$2" == "--ssh" ]] && MODE="ssh"
|
||||
|
||||
if [[ -z "$NODE" ]]; then
|
||||
echo "Usage: $0 <node-name> [--ssh]"
|
||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
case "$NODE" in
|
||||
piha) TARGET="100.108.208.3" ;;
|
||||
chelsty) TARGET="100.122.201.22" ;;
|
||||
vps) TARGET="100.95.58.48" ;;
|
||||
solaria) TARGET="local" ;;
|
||||
*)
|
||||
echo "Error: Unknown node '$NODE'"
|
||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
echo "HOST: $NODE"
|
||||
echo "MODE: $MODE"
|
||||
echo "TARGET: $TARGET"
|
||||
|
||||
REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
|
||||
if [[ "$NODE" == "solaria" ]]; then
|
||||
if [[ "$MODE" == "ssh" ]]; then
|
||||
echo "--- Running local deployment for solaria ---"
|
||||
cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
|
||||
else
|
||||
echo "# --- Deployment commands for solaria ---"
|
||||
echo "cd $REPO_PATH"
|
||||
echo "git fetch origin"
|
||||
echo "git checkout master"
|
||||
echo "git pull origin master"
|
||||
echo "cd services/stability-agent"
|
||||
echo "./deploy-local.sh solaria"
|
||||
fi
|
||||
else
|
||||
# Remote nodes
|
||||
SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
|
||||
if [[ "$MODE" == "ssh" ]]; then
|
||||
echo "--- Deploying to $NODE ($TARGET) via SSH ---"
|
||||
eval "$SSH_CMD"
|
||||
else
|
||||
echo "# --- Deployment commands for $NODE ---"
|
||||
echo "$SSH_CMD"
|
||||
fi
|
||||
fi
|
||||
|
|
@ -1,321 +1,270 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
|
||||
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
|
||||
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||
# deploy.sh - Staged deployment framework for homelab nodes.
|
||||
|
||||
set -uo pipefail
|
||||
set -o pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
SSH_USER="${SSH_USER:-oskar}"
|
||||
START_TIME=$(date +%s)
|
||||
TARGET=""
|
||||
DRY_RUN=false
|
||||
NO_GATE=false
|
||||
# --- Configuration ---
|
||||
export RUNTIME_PATH="/opt/homelab"
|
||||
export STATE_DIR="${RUNTIME_PATH}/state/deploy"
|
||||
export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
|
||||
export REPO_PATH="${HOME}/homelab-codex-ws"
|
||||
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
|
||||
|
||||
usage() {
|
||||
cat >&2 <<'EOF'
|
||||
Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||
# --- Initialization ---
|
||||
mkdir -p "$STATE_DIR" "$LOG_DIR"
|
||||
|
||||
Targets:
|
||||
control-plane observer/supervisor/executor/operator-ui on VPS
|
||||
vps all VPS GitOps services
|
||||
piha PIHA services
|
||||
solaria SOLARIA compute services
|
||||
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
|
||||
# Redirection for logging
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
|
||||
Flags:
|
||||
--dry-run run preflight + gate only; stop before deploy
|
||||
--no-gate skip pytest + docker build (emergency only; logged as WARNING)
|
||||
# --- Load Libraries ---
|
||||
LIB_PATH="${REPO_PATH}/scripts/lib"
|
||||
source "${LIB_PATH}/log.sh"
|
||||
source "${LIB_PATH}/state.sh"
|
||||
source "${LIB_PATH}/inventory.sh"
|
||||
source "${LIB_PATH}/compose.sh"
|
||||
source "${LIB_PATH}/diagnostics.sh"
|
||||
|
||||
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||
EOF
|
||||
exit 1
|
||||
}
|
||||
# --- CLI Parsing ---
|
||||
TARGET_HOST=$(hostname)
|
||||
TARGET_SERVICE=""
|
||||
RESUME=false
|
||||
REQUESTED_STAGE=""
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
control-plane|vps|piha|solaria|chelsty-infra)
|
||||
TARGET="$1"; shift ;;
|
||||
--dry-run)
|
||||
DRY_RUN=true; shift ;;
|
||||
--no-gate)
|
||||
NO_GATE=true; shift ;;
|
||||
-h|--help)
|
||||
usage ;;
|
||||
--host)
|
||||
TARGET_HOST="$2"
|
||||
shift 2
|
||||
;;
|
||||
--service)
|
||||
TARGET_SERVICE="$2"
|
||||
shift 2
|
||||
;;
|
||||
--resume)
|
||||
RESUME=true
|
||||
shift
|
||||
;;
|
||||
--stage)
|
||||
REQUESTED_STAGE="$2"
|
||||
shift 2
|
||||
;;
|
||||
*)
|
||||
echo "Unknown argument: $1" >&2
|
||||
usage ;;
|
||||
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
|
||||
REQUESTED_STAGE="$1"
|
||||
fi
|
||||
shift
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
|
||||
# --- Stages ---
|
||||
|
||||
case "$TARGET" in
|
||||
control-plane) SSH_HOST="vps" ;;
|
||||
*) SSH_HOST="$TARGET" ;;
|
||||
esac
|
||||
|
||||
case "$TARGET" in
|
||||
chelsty-*) SSH_TIMEOUT=30 ;;
|
||||
*) SSH_TIMEOUT=5 ;;
|
||||
esac
|
||||
|
||||
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
|
||||
|
||||
preflight() {
|
||||
echo "=== PREFLIGHT ==="
|
||||
|
||||
local branch
|
||||
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
|
||||
if [[ "$branch" != "master" ]]; then
|
||||
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
|
||||
exit 1
|
||||
stage_prepare() {
|
||||
local host=$1
|
||||
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping PREPARE (already complete)"
|
||||
return 0
|
||||
fi
|
||||
echo "[ok] branch: master"
|
||||
|
||||
if ! git -C "$REPO_ROOT" diff --quiet; then
|
||||
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
|
||||
exit 1
|
||||
fi
|
||||
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
|
||||
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] working tree clean"
|
||||
log "INFO" "Stage: PREPARE ($host)"
|
||||
set_stage "prepare"
|
||||
|
||||
git -C "$REPO_ROOT" fetch origin master --quiet
|
||||
local unpushed
|
||||
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
|
||||
if [[ -n "$unpushed" ]]; then
|
||||
echo "ERROR: Unpushed commits on master:" >&2
|
||||
echo "$unpushed" >&2
|
||||
echo "Push first: git push origin master" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] no unpushed commits"
|
||||
emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
|
||||
|
||||
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
|
||||
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
|
||||
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
|
||||
exit 1
|
||||
cd "$REPO_PATH" || exit 1
|
||||
log "INFO" "Pulling latest changes..."
|
||||
if ! git pull; then
|
||||
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
|
||||
fi
|
||||
echo "[ok] ${SSH_HOST} reachable"
|
||||
|
||||
# Ensure runtime directories exist
|
||||
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
|
||||
|
||||
struct_log "prepare" "$host" "all" "success" "repo_updated"
|
||||
mark_stage_complete "prepare"
|
||||
}
|
||||
|
||||
# ── GATE ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
gate() {
|
||||
if [[ "$NO_GATE" == "true" ]]; then
|
||||
echo "=== GATE: SKIPPED ==="
|
||||
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
|
||||
stage_validate() {
|
||||
local host=$1
|
||||
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping VALIDATE (already complete)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "=== GATE ==="
|
||||
log "INFO" "Stage: VALIDATE ($host)"
|
||||
set_stage "validate"
|
||||
|
||||
local services=()
|
||||
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
services=("control-plane")
|
||||
else
|
||||
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
|
||||
if [[ ! -f "$svc_yaml" ]]; then
|
||||
echo "ERROR: ${svc_yaml} not found." >&2
|
||||
exit 2
|
||||
fi
|
||||
local svc_list
|
||||
svc_list=$(python3 -c "
|
||||
import yaml
|
||||
with open('${svc_yaml}') as f:
|
||||
data = yaml.safe_load(f)
|
||||
svcs = data.get('services', {})
|
||||
if isinstance(svcs, dict):
|
||||
print('\n'.join(svcs.keys()))
|
||||
elif isinstance(svcs, list):
|
||||
print('\n'.join(svcs))
|
||||
")
|
||||
while IFS= read -r svc; do
|
||||
[[ -z "$svc" ]] && continue
|
||||
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
|
||||
services+=("$svc")
|
||||
fi
|
||||
done <<< "$svc_list"
|
||||
fi
|
||||
|
||||
if [[ ${#services[@]} -eq 0 ]]; then
|
||||
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "Services under gate: ${services[*]}"
|
||||
local gate_failed=false
|
||||
|
||||
for svc in "${services[@]}"; do
|
||||
local svc_dir="${REPO_ROOT}/services/${svc}"
|
||||
|
||||
if [[ -d "${svc_dir}/tests" ]]; then
|
||||
echo "--- pytest: ${svc} ---"
|
||||
if ! python3 -m pytest "${svc_dir}/tests" -q; then
|
||||
echo "GATE FAIL: pytest failed for ${svc}" >&2
|
||||
gate_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "--- docker build: ${svc} ---"
|
||||
if ! docker build --quiet "${svc_dir}" >/dev/null; then
|
||||
echo "GATE FAIL: docker build failed for ${svc}" >&2
|
||||
gate_failed=true
|
||||
for service in "${SERVICES[@]}"; do
|
||||
log "INFO" "Validating $service..."
|
||||
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
|
||||
log "ERROR" "Service definition not found: $service"
|
||||
struct_log "validate" "$host" "$service" "fail" "not_found"
|
||||
return 1
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ "$gate_failed" == "true" ]]; then
|
||||
exit 2
|
||||
fi
|
||||
echo "[ok] gate passed"
|
||||
struct_log "validate" "$host" "all" "success" "validated"
|
||||
mark_stage_complete "validate"
|
||||
}
|
||||
|
||||
# ── EXECUTE ──────────────────────────────────────────────────────────────────
|
||||
|
||||
execute() {
|
||||
echo "=== EXECUTE ==="
|
||||
|
||||
local cmd_output
|
||||
local cmd_exit=0
|
||||
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
echo "Running deploy-control-plane.sh --ssh..."
|
||||
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|
||||
|| cmd_exit=$?
|
||||
else
|
||||
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
|
||||
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||
"${SSH_USER}@${SSH_HOST}" \
|
||||
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|
||||
|| cmd_exit=$?
|
||||
stage_deploy() {
|
||||
local host=$1
|
||||
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping DEPLOY (already complete)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "$cmd_output"
|
||||
log "INFO" "Stage: DEPLOY ($host)"
|
||||
set_stage "deploy"
|
||||
|
||||
if echo "$cmd_output" | grep -qF "[sudo] password"; then
|
||||
echo "" >&2
|
||||
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
|
||||
echo "Run manually:" >&2
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
|
||||
else
|
||||
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
|
||||
fi
|
||||
exit 5
|
||||
local last_s=$(get_last_service)
|
||||
local skip=false
|
||||
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
|
||||
skip=true
|
||||
fi
|
||||
|
||||
if [[ $cmd_exit -ne 0 ]]; then
|
||||
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
|
||||
exit 3
|
||||
fi
|
||||
|
||||
echo "[ok] execute completed"
|
||||
}
|
||||
|
||||
# ── VERIFY ───────────────────────────────────────────────────────────────────
|
||||
|
||||
verify() {
|
||||
echo "=== VERIFY ==="
|
||||
|
||||
local ps_output
|
||||
local ps_exit=0
|
||||
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||
"${SSH_USER}@${SSH_HOST}" \
|
||||
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|
||||
|| ps_exit=$?
|
||||
|
||||
if [[ $ps_exit -ne 0 ]]; then
|
||||
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
|
||||
echo "$ps_output" >&2
|
||||
exit 4
|
||||
fi
|
||||
|
||||
echo "$ps_output"
|
||||
|
||||
local failed=false
|
||||
|
||||
local not_up
|
||||
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
|
||||
if [[ -n "$not_up" ]]; then
|
||||
echo "ERROR: Containers not in Up state:" >&2
|
||||
echo "$not_up" >&2
|
||||
failed=true
|
||||
fi
|
||||
|
||||
local unhealthy
|
||||
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
|
||||
if [[ -n "$unhealthy" ]]; then
|
||||
echo "ERROR: Unhealthy containers:" >&2
|
||||
echo "$unhealthy" >&2
|
||||
failed=true
|
||||
fi
|
||||
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
for cp_svc in supervisor observer executor operator-ui; do
|
||||
if ! echo "$ps_output" | grep -q "$cp_svc"; then
|
||||
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
|
||||
failed=true
|
||||
for service in "${SERVICES[@]}"; do
|
||||
if [[ "$skip" == "true" ]]; then
|
||||
if [[ "$service" == "$last_s" ]]; then
|
||||
skip=false
|
||||
log "INFO" "Resuming from $service..."
|
||||
else
|
||||
log "INFO" "Skipping $service (already processed)"
|
||||
continue
|
||||
fi
|
||||
done
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "$failed" == "true" ]]; then
|
||||
echo "" >&2
|
||||
echo "Full docker ps output above." >&2
|
||||
exit 4
|
||||
fi
|
||||
log "INFO" "Deploying $service..."
|
||||
set_last_service "$service"
|
||||
|
||||
echo "[ok] all containers healthy"
|
||||
if ! run_compose_up "$service"; then
|
||||
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
|
||||
collect_diagnostics "$host" "$service"
|
||||
return 1
|
||||
fi
|
||||
|
||||
struct_log "deploy" "$host" "$service" "success" "deployed"
|
||||
done
|
||||
|
||||
set_last_service ""
|
||||
mark_stage_complete "deploy"
|
||||
}
|
||||
|
||||
# ── REPORT ───────────────────────────────────────────────────────────────────
|
||||
|
||||
report() {
|
||||
local mode="${1:-deploy}"
|
||||
local end_time
|
||||
end_time=$(date +%s)
|
||||
local elapsed
|
||||
elapsed=$(( end_time - START_TIME ))
|
||||
local commit_hash
|
||||
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
|
||||
local gate_s verify_s
|
||||
|
||||
if [[ "$NO_GATE" == "true" ]]; then
|
||||
gate_s="skip"
|
||||
else
|
||||
gate_s="ok"
|
||||
stage_verify() {
|
||||
local host=$1
|
||||
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping VERIFY (already complete)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
if [[ "$mode" == "dry-run" ]]; then
|
||||
verify_s="skip(dry-run)"
|
||||
else
|
||||
verify_s="green"
|
||||
fi
|
||||
log "INFO" "Stage: VERIFY ($host)"
|
||||
set_stage "verify"
|
||||
|
||||
echo ""
|
||||
if [[ "$mode" == "dry-run" ]]; then
|
||||
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||
else
|
||||
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||
fi
|
||||
for service in "${SERVICES[@]}"; do
|
||||
log "INFO" "Verifying $service..."
|
||||
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
|
||||
if [[ -f "$health_script" ]]; then
|
||||
if ! bash "$health_script"; then
|
||||
log "ERROR" "Healthcheck failed for $service"
|
||||
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
|
||||
collect_diagnostics "$host" "$service"
|
||||
return 1
|
||||
fi
|
||||
else
|
||||
# Generic check if container is running
|
||||
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
|
||||
log "ERROR" "Container $service is not running"
|
||||
struct_log "verify" "$host" "$service" "fail" "container_not_running"
|
||||
collect_diagnostics "$host" "$service"
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
struct_log "verify" "$host" "$service" "success" "verified"
|
||||
done
|
||||
mark_stage_complete "verify"
|
||||
}
|
||||
|
||||
# ── MAIN ─────────────────────────────────────────────────────────────────────
|
||||
stage_complete() {
|
||||
local host=$1
|
||||
log "INFO" "Stage: COMPLETE ($host)"
|
||||
set_stage "complete"
|
||||
struct_log "complete" "$host" "all" "success" "deployment_finished"
|
||||
clear_deployment_state
|
||||
}
|
||||
|
||||
preflight
|
||||
gate
|
||||
# --- Execution Logic ---
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
report dry-run
|
||||
exit 0
|
||||
run_deployment() {
|
||||
local start_stage=$1
|
||||
|
||||
# Sequential execution from start_stage
|
||||
case "$start_stage" in
|
||||
prepare)
|
||||
stage_prepare "$TARGET_HOST" || return 1
|
||||
;&
|
||||
validate)
|
||||
stage_validate "$TARGET_HOST" || return 1
|
||||
;&
|
||||
deploy)
|
||||
stage_deploy "$TARGET_HOST" || return 1
|
||||
;&
|
||||
verify)
|
||||
stage_verify "$TARGET_HOST" || return 1
|
||||
;&
|
||||
complete)
|
||||
stage_complete "$TARGET_HOST" || return 1
|
||||
;;
|
||||
*)
|
||||
log "ERROR" "Invalid stage: $start_stage"
|
||||
return 1
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# --- Main ---
|
||||
|
||||
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
|
||||
|
||||
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
|
||||
log "ERROR" "Failed to load inventory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
execute
|
||||
verify
|
||||
report
|
||||
EXIT_STATUS=0
|
||||
if [[ "$RESUME" == "true" ]]; then
|
||||
CURRENT=$(get_stage)
|
||||
log "INFO" "Resuming from state: $CURRENT"
|
||||
case "$CURRENT" in
|
||||
prepare|validate|deploy|verify)
|
||||
run_deployment "$CURRENT" || EXIT_STATUS=1
|
||||
;;
|
||||
complete|none)
|
||||
log "INFO" "No interrupted deployment found. Starting from scratch..."
|
||||
run_deployment "prepare" || EXIT_STATUS=1
|
||||
;;
|
||||
*)
|
||||
log "INFO" "Unknown state. Starting from prepare..."
|
||||
run_deployment "prepare" || EXIT_STATUS=1
|
||||
;;
|
||||
esac
|
||||
elif [[ -n "$REQUESTED_STAGE" ]]; then
|
||||
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
|
||||
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
|
||||
else
|
||||
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
|
||||
fi
|
||||
else
|
||||
# New deployment - clear previous state
|
||||
clear_deployment_state
|
||||
run_deployment "prepare" || EXIT_STATUS=1
|
||||
fi
|
||||
|
||||
if [[ $EXIT_STATUS -eq 0 ]]; then
|
||||
print_summary "$TARGET_HOST" "SUCCESS"
|
||||
log "INFO" "--- Homelab Deployment Finished Successfully ---"
|
||||
else
|
||||
print_summary "$TARGET_HOST" "FAILED"
|
||||
log "ERROR" "--- Homelab Deployment Failed ---"
|
||||
exit 1
|
||||
fi
|
||||
|
|
|
|||
|
|
@ -1,30 +1,15 @@
|
|||
#!/usr/bin/env bash
|
||||
# orchestrate-deploy.sh - To be run on SATURN
|
||||
# Triggers deployment on remote execution nodes via inventory.
|
||||
# Triggers deployment on remote execution nodes.
|
||||
|
||||
set -e
|
||||
|
||||
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||
USER="oskar"
|
||||
HOSTS=("solaria" "piha" "vps")
|
||||
USER="oskar" # Default user
|
||||
|
||||
while IFS=' ' read -r HOST TAG; do
|
||||
for HOST in "${HOSTS[@]}"; do
|
||||
echo ">>> Triggering deployment on ${HOST}..."
|
||||
if [[ "$TAG" == "lte" ]]; then
|
||||
ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
|
||||
echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
|
||||
else
|
||||
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
||||
fi
|
||||
done < <(python3 -c "
|
||||
import yaml, sys
|
||||
with open('${REPO_PATH}/inventory/topology.yaml') as f:
|
||||
data = yaml.safe_load(f)
|
||||
skip = {'saturn', 'solaria'}
|
||||
for name, info in (data.get('nodes') or {}).items():
|
||||
if name in skip:
|
||||
continue
|
||||
uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
|
||||
print(name, 'lte' if uplink == 'lte' else 'standard')
|
||||
")
|
||||
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
||||
done
|
||||
|
||||
echo ">>> All deployments triggered."
|
||||
|
|
|
|||
|
|
@ -1,68 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
|
||||
|
||||
REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
|
||||
|
||||
# Check if docker is available
|
||||
if ! command -v docker &> /dev/null; then
|
||||
echo "Error: docker command not found."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if container is running
|
||||
if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
|
||||
echo "Error: agent-system-redis container not found or not running."
|
||||
echo "This script must be run on PIHA (the node hosting the Redis container)."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
|
||||
MISSING_NODES=0
|
||||
|
||||
echo "--- Homelab Agent Fleet Status ---"
|
||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
|
||||
printf "%s\n" "--------------------------------------------------------------------------------"
|
||||
|
||||
for NODE in "${REQUIRED_NODES[@]}"; do
|
||||
KEY="homelab:nodes:$NODE"
|
||||
|
||||
# Check if key exists
|
||||
EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
|
||||
|
||||
if [[ "$EXISTS" != "1" ]]; then
|
||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
|
||||
MISSING_NODES=$((MISSING_NODES + 1))
|
||||
continue
|
||||
fi
|
||||
|
||||
HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
|
||||
HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
|
||||
STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
|
||||
LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
|
||||
|
||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "--- Control Plane Summary ---"
|
||||
if command -v jq >/dev/null; then
|
||||
curl -s http://127.0.0.1:18180/summary | jq .
|
||||
else
|
||||
curl -s http://127.0.0.1:18180/summary
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "--- Control Plane Nodes ---"
|
||||
if command -v jq >/dev/null; then
|
||||
curl -s http://127.0.0.1:18180/nodes | jq .
|
||||
else
|
||||
curl -s http://127.0.0.1:18180/nodes
|
||||
fi
|
||||
|
||||
if [[ $MISSING_NODES -gt 0 ]]; then
|
||||
echo ""
|
||||
echo "Error: $MISSING_NODES required nodes are missing from Redis."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
exit 0
|
||||
|
|
@ -1,361 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# Multi-agent worktree manager.
|
||||
# EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||
set -euo pipefail
|
||||
|
||||
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
|
||||
|
||||
RESERVED_NAMES=(master main HEAD list merge clean new)
|
||||
MAX_WORKTREES=4
|
||||
|
||||
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
|
||||
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
|
||||
|
||||
# ── helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
is_main_checkout() {
|
||||
local git_dir common_dir
|
||||
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
|
||||
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
|
||||
[ "$git_dir" = "$common_dir" ]
|
||||
}
|
||||
|
||||
require_main_checkout() {
|
||||
is_main_checkout || prefail "must run from the main checkout, not a worktree"
|
||||
}
|
||||
|
||||
require_master_branch() {
|
||||
local branch
|
||||
branch=$(git rev-parse --abbrev-ref HEAD)
|
||||
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
|
||||
}
|
||||
|
||||
require_clean_tree() {
|
||||
local dirty
|
||||
dirty=$(git status --porcelain)
|
||||
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
|
||||
}
|
||||
|
||||
worktree_paths() {
|
||||
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
|
||||
local main_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
git worktree list --porcelain \
|
||||
| awk '/^worktree /{p=$2} /^$/{print p}' \
|
||||
| grep -v "^${main_path}$" \
|
||||
|| true
|
||||
}
|
||||
|
||||
worktree_count() {
|
||||
worktree_paths | wc -l
|
||||
}
|
||||
|
||||
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
|
||||
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
|
||||
|
||||
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
|
||||
|
||||
age_str() {
|
||||
local created_utc="$1"
|
||||
local now_ts created_ts diff_s
|
||||
now_ts=$(date -u +%s)
|
||||
# strip Z, replace T with space for `date -d`
|
||||
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
|
||||
diff_s=$(( now_ts - created_ts ))
|
||||
if (( diff_s < 60 )); then echo "${diff_s}s"
|
||||
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
|
||||
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
|
||||
else echo "$(( diff_s/86400 ))d"
|
||||
fi
|
||||
}
|
||||
|
||||
validate_name() {
|
||||
local name="$1"
|
||||
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
|
||||
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
|
||||
fi
|
||||
for r in "${RESERVED_NAMES[@]}"; do
|
||||
if [ "$name" = "$r" ]; then
|
||||
prefail "'$name' is a reserved word"
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# ── subcommands ───────────────────────────────────────────────────────────────
|
||||
|
||||
cmd_new() {
|
||||
local name="${1:-}"
|
||||
[ -n "$name" ] || { usage; exit 1; }
|
||||
|
||||
validate_name "$name"
|
||||
require_main_checkout
|
||||
require_master_branch
|
||||
require_clean_tree
|
||||
|
||||
# worktree limit
|
||||
local count
|
||||
count=$(worktree_count)
|
||||
if (( count >= MAX_WORKTREES )); then
|
||||
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
|
||||
cmd_list
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# branch collision
|
||||
if branch_exists_local "task/$name"; then
|
||||
prefail "branch task/$name already exists locally"
|
||||
fi
|
||||
git fetch origin master --quiet
|
||||
if branch_exists_remote "refs/heads/task/$name"; then
|
||||
prefail "branch task/$name already exists on origin"
|
||||
fi
|
||||
|
||||
# directory collision
|
||||
local main_path wt_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
|
||||
|
||||
# create worktree
|
||||
git worktree add -b "task/$name" "$wt_path" origin/master \
|
||||
|| die "git worktree add failed"
|
||||
|
||||
# write marker
|
||||
local parent_commit
|
||||
parent_commit=$(git rev-parse origin/master)
|
||||
cat > "$wt_path/.agent-task" <<EOF
|
||||
task: $name
|
||||
branch: task/$name
|
||||
parent_commit: $parent_commit
|
||||
created_utc: $(utc_now)
|
||||
worktree_path: $wt_path
|
||||
EOF
|
||||
|
||||
echo ""
|
||||
echo "Worktree created: $wt_path"
|
||||
echo "Branch: task/$name"
|
||||
echo ""
|
||||
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
|
||||
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
|
||||
echo "─────────────────────────────────────────────────────────────────────────────"
|
||||
}
|
||||
|
||||
cmd_list() {
|
||||
local main_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
|
||||
# fetch to get up-to-date ahead/behind
|
||||
git fetch origin master --quiet 2>/dev/null || true
|
||||
|
||||
local paths
|
||||
paths=$(worktree_paths)
|
||||
|
||||
if [ -z "$paths" ]; then
|
||||
echo "(no active task worktrees)"
|
||||
return
|
||||
fi
|
||||
|
||||
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
|
||||
|
||||
while IFS= read -r wt_path; do
|
||||
[ -z "$wt_path" ] && continue
|
||||
|
||||
local marker="$wt_path/.agent-task"
|
||||
local task_name branch parent_commit created_utc
|
||||
if [ -f "$marker" ]; then
|
||||
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
|
||||
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
|
||||
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
|
||||
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
|
||||
else
|
||||
task_name="(no marker)"
|
||||
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
|
||||
parent_commit="?"
|
||||
created_utc=""
|
||||
fi
|
||||
|
||||
local status="clean"
|
||||
local dirty
|
||||
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
|
||||
[ -n "$dirty" ] && status="dirty"
|
||||
|
||||
local ahead behind ab
|
||||
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
|
||||
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
|
||||
ab="+${ahead}/-${behind}"
|
||||
|
||||
local age=""
|
||||
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
|
||||
|
||||
local short_parent="${parent_commit:0:7}"
|
||||
local short_created="${created_utc:0:10}"
|
||||
|
||||
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
|
||||
done <<< "$paths"
|
||||
}
|
||||
|
||||
cmd_merge() {
|
||||
local name="${1:-}"
|
||||
[ -n "$name" ] || { usage; exit 1; }
|
||||
|
||||
require_main_checkout
|
||||
require_master_branch
|
||||
require_clean_tree
|
||||
|
||||
git fetch origin --quiet
|
||||
|
||||
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
|
||||
|
||||
local main_path wt_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||
|
||||
# attempt ff-only merge
|
||||
local merge_failed=0
|
||||
git merge --ff-only "task/$name" || merge_failed=1
|
||||
|
||||
if (( merge_failed )); then
|
||||
# abort any partial merge state
|
||||
git merge --abort 2>/dev/null || true
|
||||
echo ""
|
||||
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
|
||||
echo " The branch has likely diverged from master." >&2
|
||||
echo "" >&2
|
||||
echo "Diagnose with:" >&2
|
||||
echo " git log master..task/$name # commits only on task branch" >&2
|
||||
echo " git log task/$name..master # commits master has that task doesn't" >&2
|
||||
echo "" >&2
|
||||
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
|
||||
echo "Worktree and branch are preserved — no changes made." >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
echo "Merged task/$name into master (fast-forward)."
|
||||
|
||||
git push origin master || die "git push origin master failed"
|
||||
echo "Pushed master to origin."
|
||||
|
||||
if [ -d "$wt_path" ]; then
|
||||
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
|
||||
echo "Removed worktree: $wt_path"
|
||||
else
|
||||
echo "(worktree directory $wt_path not found — skipping worktree remove)"
|
||||
fi
|
||||
|
||||
git branch -d "task/$name" || die "git branch -d task/$name failed"
|
||||
echo "Deleted local branch task/$name."
|
||||
|
||||
git push origin --delete "task/$name" 2>/dev/null \
|
||||
&& echo "Deleted remote branch task/$name." \
|
||||
|| echo "(remote branch task/$name not found — nothing to delete)"
|
||||
|
||||
echo ""
|
||||
echo "Done. task/$name merged and cleaned up."
|
||||
}
|
||||
|
||||
cmd_clean() {
|
||||
local main_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
git fetch origin --quiet 2>/dev/null || true
|
||||
|
||||
local to_remove=()
|
||||
|
||||
# orphaned registered worktrees: branch deleted or fully merged into master
|
||||
local paths
|
||||
paths=$(worktree_paths)
|
||||
while IFS= read -r wt_path; do
|
||||
[ -z "$wt_path" ] && continue
|
||||
local branch
|
||||
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
|
||||
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
|
||||
|
||||
# branch gone locally?
|
||||
if ! branch_exists_local "$branch"; then
|
||||
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
|
||||
continue
|
||||
fi
|
||||
|
||||
# branch fully merged into master?
|
||||
local ahead
|
||||
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
|
||||
if [ "$ahead" = "0" ]; then
|
||||
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
|
||||
fi
|
||||
done <<< "$paths"
|
||||
|
||||
# dangling directories: ../homelab-codex-ws-* not registered
|
||||
local registered_paths
|
||||
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
|
||||
local parent_dir
|
||||
parent_dir=$(dirname "$main_path")
|
||||
while IFS= read -r candidate; do
|
||||
[ -d "$candidate" ] || continue
|
||||
if ! echo "$registered_paths" | grep -qF "$candidate"; then
|
||||
to_remove+=("dangling:$candidate")
|
||||
fi
|
||||
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
|
||||
|
||||
if [ ${#to_remove[@]} -eq 0 ]; then
|
||||
echo "Nothing to clean."
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "Found ${#to_remove[@]} item(s) to clean:"
|
||||
for entry in "${to_remove[@]}"; do
|
||||
echo " $entry"
|
||||
done
|
||||
echo ""
|
||||
|
||||
local overall_rc=0
|
||||
for entry in "${to_remove[@]}"; do
|
||||
local kind="${entry%%:*}"
|
||||
local path="${entry#*:}"
|
||||
# strip trailing annotation in parens
|
||||
local raw_path
|
||||
raw_path="${path%% (*}"
|
||||
|
||||
local confirm
|
||||
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
|
||||
if [[ "$confirm" =~ ^[Yy]$ ]]; then
|
||||
if [ "$kind" = "worktree" ]; then
|
||||
git worktree remove --force "$raw_path" 2>/dev/null \
|
||||
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
|
||||
else
|
||||
rm -rf "$raw_path"
|
||||
fi
|
||||
echo " Removed."
|
||||
else
|
||||
echo " Skipped."
|
||||
fi
|
||||
done
|
||||
|
||||
return $overall_rc
|
||||
}
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
Usage: agent.sh <subcommand> [args]
|
||||
|
||||
agent.sh new <name> Create a new task worktree (branch task/<name>)
|
||||
agent.sh list List active task worktrees with status
|
||||
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
|
||||
agent.sh clean Remove orphaned or dangling worktrees (interactive)
|
||||
|
||||
EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||
EOF
|
||||
}
|
||||
|
||||
# ── dispatch ──────────────────────────────────────────────────────────────────
|
||||
|
||||
SUBCOMMAND="${1:-}"
|
||||
shift || true
|
||||
|
||||
case "$SUBCOMMAND" in
|
||||
new) cmd_new "$@" ;;
|
||||
list) cmd_list "$@" ;;
|
||||
merge) cmd_merge "$@" ;;
|
||||
clean) cmd_clean "$@" ;;
|
||||
*) usage; exit 1 ;;
|
||||
esac
|
||||
|
|
@ -1,338 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# health-monitor.sh - Homelab node health monitor and safe disk cleanup
|
||||
#
|
||||
# Designed to run standalone on the host (cron or direct) or to be called by
|
||||
# the node-agent Python daemon. All cleanup decisions follow the conservative
|
||||
# policy agreed in the design review:
|
||||
#
|
||||
# lte_node (chelsty-infra, chelsty-ha) : NO cleanup at all
|
||||
# sd_card (piha, saturn) : dangling images + stopped containers,
|
||||
# rate-limited to once per 24 h
|
||||
# ai_node (solaria) : dangling images + stopped containers
|
||||
# + build cache (NEVER -a)
|
||||
# standard (vps) : dangling images + stopped containers
|
||||
# + build cache
|
||||
#
|
||||
# VPS additionally rotates control-plane filesystem artefacts:
|
||||
# actions/completed + failed > 7 days
|
||||
# logs/deploy > 30 days
|
||||
# events/** > 3 days AND past observer checkpoint
|
||||
#
|
||||
# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
|
||||
# actions/pending|approved|running, Frigate recordings, Ollama models,
|
||||
# Zigbee2MQTT data, Mosquitto data, HA database/config.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Configuration
|
||||
# ---------------------------------------------------------------------------
|
||||
RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
|
||||
EVENTS_DIR="${RUNTIME_PATH}/events"
|
||||
STATE_DIR="${RUNTIME_PATH}/state"
|
||||
LOGS_DIR="${RUNTIME_PATH}/logs"
|
||||
ACTIONS_DIR="${RUNTIME_PATH}/actions"
|
||||
|
||||
NODE_NAME="${NODE_NAME:-$(hostname)}"
|
||||
TIMESTAMP=$(date +%s)
|
||||
DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
|
||||
# Thresholds
|
||||
DISK_WARN_PCT=75
|
||||
DISK_CRIT_PCT=85
|
||||
MEM_WARN_PCT=85
|
||||
MEM_CRIT_PCT=95
|
||||
|
||||
# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
|
||||
CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
|
||||
CLEANUP_INTERVAL=86400 # seconds
|
||||
|
||||
# Node classifications
|
||||
LTE_NODES="chelsty-infra chelsty-ha"
|
||||
SD_CARD_NODES="piha saturn"
|
||||
AI_NODES="solaria"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
log() { echo "$(date -u +%H:%M:%S) [INFO] $*"; }
|
||||
warn() { echo "$(date -u +%H:%M:%S) [WARN] $*" >&2; }
|
||||
err() { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
|
||||
|
||||
contains() {
|
||||
local word="$1"; shift
|
||||
for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
|
||||
return 1
|
||||
}
|
||||
|
||||
get_node_type() {
|
||||
# shellcheck disable=SC2086
|
||||
if contains "$NODE_NAME" $LTE_NODES; then echo "lte_node"; return; fi
|
||||
if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card"; return; fi
|
||||
if contains "$NODE_NAME" $AI_NODES; then echo "ai_node"; return; fi
|
||||
echo "standard"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Event emission
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
emit_event() {
|
||||
local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
|
||||
local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
|
||||
local dir="${EVENTS_DIR}/${NODE_NAME}"
|
||||
mkdir -p "$dir"
|
||||
cat > "${dir}/${id}.json" <<EOF
|
||||
{
|
||||
"id": "${id}",
|
||||
"timestamp": ${TIMESTAMP},
|
||||
"date": "${DATE}",
|
||||
"type": "${type}",
|
||||
"severity": "${severity}",
|
||||
"node": "${NODE_NAME}",
|
||||
"service": "${service}",
|
||||
"message": "${message}",
|
||||
"payload": ${payload}
|
||||
}
|
||||
EOF
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Health checks
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
check_disk() {
|
||||
# Use /opt/homelab as the check target — it lives on the host filesystem
|
||||
# and this path is correct both when running natively and in a container
|
||||
# that mounts /opt/homelab from the host.
|
||||
local mount="${RUNTIME_PATH}"
|
||||
local usage_pct avail_mb total_mb
|
||||
usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
|
||||
avail_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}') || return
|
||||
total_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}') || return
|
||||
|
||||
if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
|
||||
warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
|
||||
emit_event "disk_pressure" "high" "" \
|
||||
"Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
||||
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
||||
elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
|
||||
warn "Disk elevated: ${usage_pct}% used"
|
||||
emit_event "disk_pressure" "medium" "" \
|
||||
"Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
||||
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
||||
fi
|
||||
echo "${usage_pct}"
|
||||
}
|
||||
|
||||
check_memory() {
|
||||
local total avail pct avail_mb
|
||||
total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
|
||||
avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
|
||||
pct=$(( (total - avail) * 100 / total ))
|
||||
avail_mb=$(( avail / 1024 ))
|
||||
|
||||
if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
|
||||
warn "Memory CRITICAL: ${pct}% used"
|
||||
emit_event "high_memory" "high" "" \
|
||||
"Memory usage critical: ${pct}% (${avail_mb} MB available)" \
|
||||
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
||||
elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
|
||||
warn "Memory elevated: ${pct}%"
|
||||
emit_event "high_memory" "medium" "" \
|
||||
"Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
|
||||
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
||||
fi
|
||||
echo "${pct}"
|
||||
}
|
||||
|
||||
check_cpu() {
|
||||
# Two-sample /proc/stat delta for accurate instantaneous CPU usage.
|
||||
local idle1 total1 idle2 total2 pct
|
||||
read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
||||
sleep 1
|
||||
read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
||||
|
||||
local d_idle=$(( idle2 - idle1 ))
|
||||
local d_total=$(( total2 - total1 ))
|
||||
pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
|
||||
|
||||
if [[ "${pct}" -ge 90 ]]; then
|
||||
warn "CPU elevated: ${pct}%"
|
||||
emit_event "high_cpu" "medium" "" \
|
||||
"CPU usage elevated: ${pct}%" \
|
||||
"{\"usage_pct\": ${pct}}"
|
||||
fi
|
||||
echo "${pct}"
|
||||
}
|
||||
|
||||
check_containers() {
|
||||
command -v docker &>/dev/null || return
|
||||
|
||||
# Containers that have exited but carry a restart policy meaning they should be up
|
||||
local cname
|
||||
while IFS= read -r cname; do
|
||||
[[ -z "$cname" ]] && continue
|
||||
warn "Container exited (should be running): ${cname}"
|
||||
emit_event "containers_not_running" "high" "${cname}" \
|
||||
"Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
|
||||
"{\"container\": \"${cname}\"}"
|
||||
done < <(docker ps -a \
|
||||
--filter "status=exited" \
|
||||
--filter "label=com.docker.compose.project" \
|
||||
--format "{{.Names}}" 2>/dev/null || true)
|
||||
|
||||
# Containers that are running but their health check is failing
|
||||
while IFS= read -r cname; do
|
||||
[[ -z "$cname" ]] && continue
|
||||
warn "Container unhealthy: ${cname}"
|
||||
emit_event "healthcheck_failed" "high" "${cname}" \
|
||||
"Container '${cname}' is running but health check is failing" \
|
||||
"{\"container\": \"${cname}\"}"
|
||||
done < <(docker ps \
|
||||
--filter "health=unhealthy" \
|
||||
--format "{{.Names}}" 2>/dev/null || true)
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Safe Docker cleanup (per policy)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_sd_card_rate_ok() {
|
||||
if [[ -f "${CLEANUP_LOCK}" ]]; then
|
||||
local last_ts elapsed
|
||||
last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
|
||||
elapsed=$(( TIMESTAMP - last_ts ))
|
||||
if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
|
||||
log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
_mark_cleanup_done() {
|
||||
echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
|
||||
}
|
||||
|
||||
run_safe_cleanup() {
|
||||
command -v docker &>/dev/null || return
|
||||
local node_type
|
||||
node_type=$(get_node_type)
|
||||
|
||||
case "${node_type}" in
|
||||
lte_node)
|
||||
# NO cleanup on LTE nodes. Any docker operation risks triggering
|
||||
# a pull over a metered/intermittent connection.
|
||||
log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
|
||||
;;
|
||||
|
||||
sd_card)
|
||||
# Dangling images + stopped containers only.
|
||||
# Rate-limited to once per 24 hours to protect SD card write endurance.
|
||||
_sd_card_rate_ok || return
|
||||
log "Running rate-limited Docker cleanup (SD card node)"
|
||||
docker image prune -f >/dev/null 2>&1 || true
|
||||
docker container prune -f >/dev/null 2>&1 || true
|
||||
_mark_cleanup_done
|
||||
;;
|
||||
|
||||
ai_node)
|
||||
# Dangling images + stopped containers + build cache.
|
||||
# NEVER docker image prune -a (would remove Ollama runtime images,
|
||||
# requiring a multi-hour re-pull of model weights).
|
||||
log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
|
||||
docker image prune -f >/dev/null 2>&1 || true
|
||||
docker container prune -f >/dev/null 2>&1 || true
|
||||
docker builder prune -f >/dev/null 2>&1 || true
|
||||
;;
|
||||
|
||||
standard)
|
||||
# VPS and other standard nodes: full safe cleanup.
|
||||
log "Running standard Docker cleanup"
|
||||
docker image prune -f >/dev/null 2>&1 || true
|
||||
docker container prune -f >/dev/null 2>&1 || true
|
||||
docker builder prune -f >/dev/null 2>&1 || true
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# VPS-specific: control-plane filesystem rotation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
cleanup_control_plane_fs() {
|
||||
log "Running control-plane filesystem rotation"
|
||||
|
||||
# Completed / failed actions older than 7 days
|
||||
for status in completed failed; do
|
||||
local dir="${ACTIONS_DIR}/${status}"
|
||||
[[ -d "${dir}" ]] || continue
|
||||
find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
|
||||
log "Cleaned ${status} actions older than 7 days" || true
|
||||
done
|
||||
|
||||
# Deploy logs older than 30 days
|
||||
local deploy_logs="${LOGS_DIR}/deploy"
|
||||
if [[ -d "${deploy_logs}" ]]; then
|
||||
find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
|
||||
log "Cleaned deploy logs older than 30 days" || true
|
||||
fi
|
||||
|
||||
# Event files older than 3 days AND already past the observer checkpoint.
|
||||
# The dual condition ensures we never delete an event the observer hasn't seen.
|
||||
local checkpoint="${STATE_DIR}/observer_checkpoint.json"
|
||||
if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
|
||||
local last_processed
|
||||
last_processed=$(python3 -c "
|
||||
import json, sys
|
||||
try:
|
||||
d = json.load(open('${checkpoint}'))
|
||||
print(d.get('last_processed_file', ''))
|
||||
except Exception:
|
||||
print('')
|
||||
" 2>/dev/null || echo "")
|
||||
|
||||
if [[ -n "${last_processed}" ]]; then
|
||||
find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
|
||||
# Only delete files that sort before the checkpoint path
|
||||
# (i.e., the observer has already processed them).
|
||||
if [[ "$f" < "${last_processed}" ]]; then
|
||||
rm -f "$f"
|
||||
log "Cleaned old event: $(basename "$f")"
|
||||
fi
|
||||
done
|
||||
else
|
||||
log "No observer checkpoint set; skipping event file cleanup"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
|
||||
|
||||
log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
|
||||
|
||||
disk_pct=$(check_disk || echo 0)
|
||||
mem_pct=$(check_memory || echo 0)
|
||||
cpu_pct=$(check_cpu || echo 0)
|
||||
check_containers
|
||||
|
||||
run_safe_cleanup
|
||||
|
||||
# VPS: also rotate control-plane filesystem artefacts
|
||||
if [[ "${NODE_NAME}" == "vps" ]]; then
|
||||
cleanup_control_plane_fs
|
||||
fi
|
||||
|
||||
# Emit a node_health heartbeat so the observer can update node status
|
||||
# and the supervisor can see up-to-date resource metrics.
|
||||
emit_event "node_health" "info" "" \
|
||||
"Health check completed on ${NODE_NAME}" \
|
||||
"{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
|
||||
|
||||
log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"
|
||||
|
|
@ -7,34 +7,6 @@ import yaml
|
|||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _atomic_write_json(path: Path, data) -> None:
|
||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||
tmp = path.with_suffix(".tmp")
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
f.flush()
|
||||
os.fsync(f.fileno())
|
||||
os.replace(tmp, path)
|
||||
|
||||
|
||||
def _parse_ts(ts) -> float:
|
||||
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
|
||||
|
||||
Events from node-agent use int(time.time()); events from stability-agent / events.py
|
||||
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
|
||||
last_occurrence and resolved_at, so any arithmetic on them must go through here.
|
||||
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
|
||||
"""
|
||||
if ts is None:
|
||||
return 0.0
|
||||
if isinstance(ts, (int, float)):
|
||||
return float(ts)
|
||||
try:
|
||||
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
|
||||
except Exception:
|
||||
return 0.0
|
||||
|
||||
# Constants and Paths
|
||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
||||
|
|
@ -42,7 +14,6 @@ STATE_DIR = Path(RUNTIME_PATH) / "state"
|
|||
LOGS_DIR = Path(RUNTIME_PATH) / "logs"
|
||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||
OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
|
||||
FAILED_EVENTS_DIR = STATE_DIR / "observer_failed_events"
|
||||
|
||||
REPO_ROOT = Path(__file__).parent.parent.parent
|
||||
INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
|
||||
|
|
@ -53,10 +24,7 @@ logger = logging.getLogger("observer")
|
|||
|
||||
class Observer:
|
||||
def __init__(self):
|
||||
# Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
|
||||
# Replaces the old single last_processed_file which silently skipped event dirs
|
||||
# that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
|
||||
self.node_checkpoints: dict = {}
|
||||
self.last_processed_file = None
|
||||
self.world_state = {
|
||||
"nodes": {},
|
||||
"services": {},
|
||||
|
|
@ -77,27 +45,6 @@ class Observer:
|
|||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||
EVENTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
LOGS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
FAILED_EVENTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _quarantine_event_file(self, file_path: str, node_dir: str, exc: Exception) -> None:
|
||||
"""Move an unreadable/unprocessable event out of the hot path."""
|
||||
src = Path(file_path)
|
||||
dest_dir = FAILED_EVENTS_DIR / node_dir
|
||||
dest_dir.mkdir(parents=True, exist_ok=True)
|
||||
dest = dest_dir / src.name
|
||||
if dest.exists():
|
||||
dest = dest_dir / f"{src.stem}-{int(time.time())}{src.suffix}"
|
||||
try:
|
||||
os.replace(src, dest)
|
||||
logger.error(
|
||||
"Quarantined bad event for node_dir=%s: %s -> %s (%s: %s)",
|
||||
node_dir, src, dest, type(exc).__name__, exc,
|
||||
)
|
||||
except Exception as move_exc:
|
||||
logger.error(
|
||||
"Failed to quarantine bad event for node_dir=%s: %s (%s: %s); move error=%s: %s",
|
||||
node_dir, src, type(exc).__name__, exc, type(move_exc).__name__, move_exc,
|
||||
)
|
||||
|
||||
def _load_inventory(self):
|
||||
inventory = {"nodes": {}, "services": {}}
|
||||
|
|
@ -136,22 +83,11 @@ class Observer:
|
|||
try:
|
||||
with open(OBSERVER_STATE_FILE, "r") as f:
|
||||
checkpoint = json.load(f)
|
||||
|
||||
if "node_checkpoints" in checkpoint:
|
||||
# New format: per-directory checkpoints.
|
||||
self.node_checkpoints = checkpoint["node_checkpoints"]
|
||||
elif "last_processed_file" in checkpoint:
|
||||
# Migrate old single-file checkpoint: extract node dir from path.
|
||||
old = checkpoint["last_processed_file"]
|
||||
if old:
|
||||
try:
|
||||
node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
|
||||
self.node_checkpoints = {node_dir: old}
|
||||
logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
|
||||
except Exception:
|
||||
pass # Bad path — start fresh
|
||||
|
||||
self._load_world_from_disk()
|
||||
self.last_processed_file = checkpoint.get("last_processed_file")
|
||||
# We might want to persist partial world state,
|
||||
# but for now we rebuild from events (idempotent)
|
||||
# or we can load existing world state files.
|
||||
self._load_world_from_disk()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load checkpoint: {e}")
|
||||
|
||||
|
|
@ -174,128 +110,17 @@ class Observer:
|
|||
|
||||
def _save_checkpoint(self):
|
||||
try:
|
||||
_atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
|
||||
with open(OBSERVER_STATE_FILE, "w") as f:
|
||||
json.dump({"last_processed_file": self.last_processed_file}, f)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save checkpoint: {e}")
|
||||
|
||||
def _prune_stale_world(self):
|
||||
"""Remove world-state entries for nodes absent from the topology inventory.
|
||||
|
||||
Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
|
||||
falls back to socket.gethostname(), which inside a Docker container returns the
|
||||
12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
|
||||
('vps'). The observer ingests those events and creates ghost entries that never
|
||||
expire on their own.
|
||||
|
||||
Also ages out resolved incidents older than 7 days to keep world state lean.
|
||||
"""
|
||||
known_nodes = set(self.inventory["nodes"].keys())
|
||||
if not known_nodes:
|
||||
# Inventory failed to load — don't prune to avoid wiping valid state.
|
||||
return
|
||||
|
||||
stale_nodes = [n for n in list(self.world_state["nodes"].keys())
|
||||
if n not in known_nodes]
|
||||
for n in stale_nodes:
|
||||
logger.info(f"Pruning stale node from world state: {n}")
|
||||
del self.world_state["nodes"][n]
|
||||
|
||||
stale_svcs = [k for k in list(self.world_state["services"].keys())
|
||||
if k.split("/")[0] in stale_nodes]
|
||||
for k in stale_svcs:
|
||||
logger.info(f"Pruning stale service from world state: {k}")
|
||||
del self.world_state["services"][k]
|
||||
|
||||
# Prune ghost service keys whose service-name portion is a hash-prefixed
|
||||
# Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
|
||||
# These are created when node-agent incorrectly uses c.name instead of the
|
||||
# compose label, and accumulate on every container rebuild.
|
||||
# Pattern: <node>/<12hexchars>_<real-name>
|
||||
ghost_svcs = [
|
||||
k for k in list(self.world_state["services"].keys())
|
||||
if len(k.split("/", 1)) == 2
|
||||
and len(k.split("/", 1)[1]) > 13
|
||||
and k.split("/", 1)[1][12] == "_"
|
||||
and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
|
||||
]
|
||||
for k in ghost_svcs:
|
||||
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
|
||||
del self.world_state["services"][k]
|
||||
|
||||
now = time.time()
|
||||
|
||||
try:
|
||||
# Collect incident_ids currently referenced by any service entry.
|
||||
linked_ids: set = {
|
||||
svc.get("incident_id")
|
||||
for svc in self.world_state["services"].values()
|
||||
if svc.get("incident_id")
|
||||
}
|
||||
|
||||
# Case 1 — service is healthy but still points at an active incident.
|
||||
# process_event already calls _resolve_incident on service_healthy events,
|
||||
# but if the observer restarted with on-disk state where the link was
|
||||
# intact (inconsistency from a pre-atomic-write crash), it may not get
|
||||
# resolved until the next service_healthy event is processed. Resolve
|
||||
# immediately — a healthy service cannot have an ongoing incident.
|
||||
for svc_key, svc in self.world_state["services"].items():
|
||||
if svc.get("status") != "healthy":
|
||||
continue
|
||||
inc_id = svc.get("incident_id")
|
||||
if not inc_id:
|
||||
continue
|
||||
inc = self.world_state["incidents"].get(inc_id, {})
|
||||
if inc.get("status") == "active":
|
||||
logger.info(
|
||||
f"Auto-resolving incident {inc_id} for {svc_key}: "
|
||||
f"service is healthy"
|
||||
)
|
||||
inc["status"] = "resolved"
|
||||
inc["resolved_at"] = now
|
||||
svc["incident_id"] = None
|
||||
linked_ids.discard(inc_id)
|
||||
|
||||
# Case 2 — orphaned active incident: no service entry links to it and
|
||||
# last_occurrence is older than 5 minutes (guard against creation races).
|
||||
# These are the stale records left behind when on-disk state was
|
||||
# inconsistent: the service entry had incident_id cleared but incidents.json
|
||||
# still had the record as "active".
|
||||
for inc_id, inc in self.world_state["incidents"].items():
|
||||
if inc.get("status") != "active":
|
||||
continue
|
||||
if inc_id in linked_ids:
|
||||
continue
|
||||
age = now - _parse_ts(inc.get("last_occurrence"))
|
||||
if age > 300: # 5-minute guard
|
||||
logger.info(
|
||||
f"Auto-resolving orphaned incident {inc_id} "
|
||||
f"(service={inc.get('service')}, node={inc.get('node')}): "
|
||||
f"no service references it, age={int(age)}s"
|
||||
)
|
||||
inc["status"] = "resolved"
|
||||
inc["resolved_at"] = now
|
||||
|
||||
except Exception as exc:
|
||||
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
|
||||
|
||||
# Remove resolved incidents older than 7 days.
|
||||
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
|
||||
stale_incidents = [
|
||||
k for k, v in self.world_state["incidents"].items()
|
||||
if v.get("status") == "resolved"
|
||||
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
|
||||
]
|
||||
for k in stale_incidents:
|
||||
del self.world_state["incidents"][k]
|
||||
|
||||
def _save_world(self):
|
||||
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
|
||||
active_incidents = [
|
||||
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
|
||||
]
|
||||
self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
|
||||
self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
|
||||
self.world_state["summary"]["service_count"] = len(self.world_state["services"])
|
||||
|
||||
if active_incidents:
|
||||
self.world_state["summary"]["status"] = "degraded"
|
||||
|
|
@ -307,12 +132,13 @@ class Observer:
|
|||
"services.json": self.world_state["services"],
|
||||
"deployments.json": self.world_state["deployments"],
|
||||
"incidents.json": self.world_state["incidents"],
|
||||
"recommendations.json": [],
|
||||
"recommendations.json": [], # Placeholder to satisfy requirements
|
||||
"runtime-summary.json": self.world_state["summary"]
|
||||
}
|
||||
for filename, data in files.items():
|
||||
try:
|
||||
_atomic_write_json(WORLD_DIR / filename, data)
|
||||
with open(WORLD_DIR / filename, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save {filename}: {e}")
|
||||
|
||||
|
|
@ -339,35 +165,6 @@ class Observer:
|
|||
elif etype == "node_offline":
|
||||
self.world_state["nodes"][node]["status"] = "offline"
|
||||
|
||||
elif etype == "node_health":
|
||||
# Regular heartbeat from node-agent; updates resource metrics.
|
||||
# Clears disk_pressure if disk is now healthy (< warn threshold).
|
||||
self.world_state["nodes"][node]["status"] = "online"
|
||||
self.world_state["nodes"][node].update({
|
||||
"disk_usage_pct": payload.get("disk_pct"),
|
||||
"mem_usage_pct": payload.get("mem_pct"),
|
||||
"cpu_usage_pct": payload.get("cpu_pct"),
|
||||
})
|
||||
if (payload.get("disk_pct") or 0) < 75:
|
||||
self.world_state["nodes"][node].pop("disk_pressure", None)
|
||||
|
||||
elif etype == "disk_pressure":
|
||||
# Emitted when disk usage crosses 75 % (medium) or 85 % (high).
|
||||
# The supervisor reads disk_pressure to generate disk_cleanup actions.
|
||||
self.world_state["nodes"][node]["disk_pressure"] = severity
|
||||
self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
|
||||
|
||||
elif etype == "high_memory":
|
||||
# Memory pressure observation; recorded on the node for correlation.
|
||||
# No automated action — operator decides if a container restart helps.
|
||||
self.world_state["nodes"][node]["memory_pressure"] = severity
|
||||
self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
|
||||
|
||||
elif etype == "high_cpu":
|
||||
# CPU pressure observation; recorded for visibility.
|
||||
self.world_state["nodes"][node]["cpu_pressure"] = severity
|
||||
self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
|
||||
|
||||
# 2. Update Service State
|
||||
if service and service != "all":
|
||||
svc_key = f"{node}/{service}"
|
||||
|
|
@ -384,15 +181,6 @@ class Observer:
|
|||
if etype == "service_recovered":
|
||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||
self._resolve_incident(svc_key, timestamp)
|
||||
elif etype == "service_healthy":
|
||||
# Positive confirmation from node-agent that a managed container
|
||||
# is running. This keeps services.json populated so the supervisor
|
||||
# can correctly detect drift (absent entry = never reported = unknown,
|
||||
# not the same as confirmed missing).
|
||||
# Also resolve any active incident — if a service that had been
|
||||
# unhealthy/crashing is now confirmed healthy, the incident is over.
|
||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||
self._resolve_incident(svc_key, timestamp)
|
||||
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
||||
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
||||
self._handle_incident(svc_key, event)
|
||||
|
|
@ -445,11 +233,6 @@ class Observer:
|
|||
"service": event.get("service"),
|
||||
"status": "active",
|
||||
"severity": event.get("severity"),
|
||||
# trigger_type records the event type that opened this incident so that
|
||||
# the supervisor can choose the appropriate remediation action
|
||||
# (e.g. container_restart for containers_not_running / mqtt_unreachable
|
||||
# vs. a full redeploy for other causes).
|
||||
"trigger_type": event.get("type"),
|
||||
"started_at": event.get("timestamp"),
|
||||
"last_occurrence": event.get("timestamp"),
|
||||
"occurrence_count": 1,
|
||||
|
|
@ -488,47 +271,36 @@ class Observer:
|
|||
except Exception as e:
|
||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||
|
||||
# Collect all event files grouped by node directory.
|
||||
# Per-node checkpoints are compared within each directory independently,
|
||||
# so late-arriving events from remote nodes (sorted earlier in the path)
|
||||
# are never skipped just because another node's checkpoint is further ahead.
|
||||
all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
|
||||
# Find all event files
|
||||
event_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
|
||||
|
||||
new_files = []
|
||||
for file_path in all_files:
|
||||
if self.last_processed_file:
|
||||
try:
|
||||
node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
|
||||
except (IndexError, ValueError):
|
||||
node_dir = "__unknown__"
|
||||
last_for_node = self.node_checkpoints.get(node_dir, "")
|
||||
if file_path > last_for_node:
|
||||
new_files.append((node_dir, file_path))
|
||||
idx = event_files.index(self.last_processed_file)
|
||||
new_files = event_files[idx+1:]
|
||||
except ValueError:
|
||||
# If last_processed_file is gone or not in list, process all
|
||||
new_files = event_files
|
||||
else:
|
||||
new_files = event_files
|
||||
|
||||
if not new_files:
|
||||
# Even if no new events, prune stale entries and refresh summary freshness.
|
||||
self._prune_stale_world()
|
||||
# Even if no new events, we update freshness of summary
|
||||
self._save_world()
|
||||
return
|
||||
|
||||
logger.info(f"Processing {len(new_files)} new events across "
|
||||
f"{len({n for n, _ in new_files})} node(s)")
|
||||
for node_dir, file_path in new_files:
|
||||
logger.info(f"Processing {len(new_files)} new events")
|
||||
for file_path in new_files:
|
||||
try:
|
||||
with open(file_path, "r") as f:
|
||||
event = json.load(f)
|
||||
self.process_event(event)
|
||||
# Advance per-node checkpoint (only forward — no regression).
|
||||
if file_path > self.node_checkpoints.get(node_dir, ""):
|
||||
self.node_checkpoints[node_dir] = file_path
|
||||
self.last_processed_file = file_path
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
"Error processing node_dir=%s file=%s (%s: %s)",
|
||||
node_dir, file_path, type(e).__name__, e,
|
||||
)
|
||||
self._quarantine_event_file(file_path, node_dir, e)
|
||||
logger.error(f"Error processing {file_path}: {e}")
|
||||
|
||||
self._save_checkpoint()
|
||||
self._prune_stale_world()
|
||||
self._save_world()
|
||||
|
||||
def loop(self, interval=5):
|
||||
|
|
|
|||
|
|
@ -1,139 +0,0 @@
|
|||
# scripts/onboard — Node Onboarding Tool
|
||||
|
||||
Idempotentny, deklaratywny onboarding nodów przez bash — bez Ansible.
|
||||
Każdy node opisany jest manifestem `hosts/<node>/node.yaml`; skrypt
|
||||
`onboard.sh` czyta manifest i woła numerowane kroki w kolejności.
|
||||
|
||||
## Użycie
|
||||
|
||||
```bash
|
||||
scripts/onboard/onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
|
||||
```
|
||||
|
||||
| Flaga | Opis |
|
||||
|-------|------|
|
||||
| `--node <name>` | Nazwa node'a (wymagana); pasuje do `hosts/<name>/node.yaml` |
|
||||
| `--step <name>` | Uruchom tylko ten jeden krok (np. `00-access`) |
|
||||
| `--from <step>` | Zacznij od tego kroku i kontynuuj do końca |
|
||||
| `--dry-run` | Ustawia `DRY_RUN=1`; mutacje symulowane przez `run()`, sondy wykonywane naprawdę |
|
||||
|
||||
```bash
|
||||
# Pełny onboarding
|
||||
scripts/onboard/onboard.sh --node lustro
|
||||
|
||||
# Tylko jeden krok
|
||||
scripts/onboard/onboard.sh --node lustro --step 00-access
|
||||
|
||||
# Od kroku wzwyż
|
||||
scripts/onboard/onboard.sh --node lustro --from 10-bootstrap-runtime
|
||||
|
||||
# Podgląd bez zmian (sondy stanu wykonują się naprawdę — plan jest realistyczny)
|
||||
scripts/onboard/onboard.sh --node lustro --dry-run
|
||||
```
|
||||
|
||||
## hosts/\<node\>/node.yaml — schemat
|
||||
|
||||
```yaml
|
||||
name: LUSTRO # nazwa node'a (ALL CAPS)
|
||||
role: edge # edge | compute | infra
|
||||
location: KEN # identyfikator lokalizacji
|
||||
|
||||
ssh_user: pi # user SSH; może różnić się od "oskar" na edge nodach
|
||||
# (kolizja uid=1000 — użyj istniejącego usera)
|
||||
first_contact: pi@192.168.31.19 # cel SSH przed Tailscale; KONIECZNIE IP, nie .local
|
||||
# (mDNS .local zawodny w automatyzacji)
|
||||
tailscale:
|
||||
hostname: lustro # nazwa w mesh; cel po tailscale up
|
||||
ip: # wypełniane po join (opcjonalne)
|
||||
|
||||
deploy_autonomy: true # true = onboard.sh może wykonywać mutacje autonomicznie
|
||||
# false = wydrukuj instrukcje manualne i zatrzymaj
|
||||
git_control: false # true = node pulluje z Forgejo
|
||||
# false = push-based z SATURN (edge nodes)
|
||||
|
||||
hardware:
|
||||
arch: arm64 # aarch64 | x86_64 | armv7l; wypełnia 00-preflight
|
||||
ram_mb: 4096 # RAM w MB; wypełnia 00-preflight
|
||||
swap:
|
||||
kind: zram # zram | file | none; zram zalecany (SD wear)
|
||||
docker_present: true # docker już zainstalowany?; wypełnia 00-preflight
|
||||
mm_runtime: systemd:magicmirror.service
|
||||
# runtime MagicMirror: systemd:<unit> | pm2 | process | none
|
||||
# wypełnia 00-preflight
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
runtime:
|
||||
engine: docker # docker | docker-compose
|
||||
mem_limit: 256m # obowiązkowy (RPi4 RAM profil jak VPS — OOM ryzyko)
|
||||
```
|
||||
|
||||
### Uwagi do pól
|
||||
|
||||
- **`ssh_user`** — na edge nodach z istniejącym uid=1000 (np. `pi` na RPi OS) użyj
|
||||
tego usera zamiast tworzyć `oskar`; docker group membership i `mem_limit` node-agenta
|
||||
są zaprojektowane pod `1000:1000`.
|
||||
- **`first_contact`** — zawsze IP, nie hostname `.local`. mDNS okazał się zawodny
|
||||
w automatyzacji (transient resolve fail). Po `tailscale up` używaj `tailscale.hostname`.
|
||||
- **`deploy_autonomy`** — gdy `false`, kroki 10+ wypisują instrukcje manualne i kończą
|
||||
pracę bez mutacji. Przydatne dla nodów zarządzanych przez inną osobę.
|
||||
- **`git_control`** — gdy `false`, kroki z `git`/`repo`/`clone` w nazwie są pomijane.
|
||||
|
||||
## Status kroków
|
||||
|
||||
| Krok | Plik | Status | Opis |
|
||||
|------|------|--------|------|
|
||||
| `00-access` | `steps/00-access.sh` | **DONE** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interaktywny URL), verify `pi@<ts_hostname>` arch=aarch64 |
|
||||
| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: zbiera fakty (arch, RAM, docker, swap, MM runtime), wypisuje raport + YAML snippet do wklejenia w node.yaml |
|
||||
| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | TODO | Tworzy `/opt/homelab/` layout, `chown <ssh_user>` |
|
||||
| `20-install-docker` | `steps/20-install-docker.sh` | TODO | Instaluje Docker Engine jeśli `docker_present=false`; skip gdy już zainstalowany |
|
||||
| `30-install-tailscale` | `steps/30-install-tailscale.sh` | TODO | Superseded przez `00-access` dla nowych nodów; może służyć do re-join |
|
||||
| `40-deploy-node-agent` | `steps/40-deploy-node-agent.sh` | TODO | Deploy node-agent docker; user 1000:1000; `mem_limit` z node.yaml |
|
||||
| `50-verify` | `steps/50-verify.sh` | TODO | End-to-end smoke: event dotarł do control plane, widać w UI, alert path Telegram |
|
||||
|
||||
## Architektura lib/
|
||||
|
||||
```
|
||||
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
|
||||
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers, ONBOARD_SSH_USER/HOST)
|
||||
```
|
||||
|
||||
### run() i dry-run
|
||||
|
||||
`DRY_RUN=1` jest eksportowane do wszystkich step-skryptów przez orchestrator.
|
||||
|
||||
```bash
|
||||
# Mutacje owijamy w run() — w dry-run drukuje intent, nie wykonuje
|
||||
run ssh-copy-id -i ~/.ssh/id_ed25519.pub pi@192.168.31.19
|
||||
|
||||
# Sondy stanu (ssh BatchMode test, command -v, status query) wykonują się ZAWSZE
|
||||
# — dry-run musi pokazywać realistyczny plan oparty na aktualnym stanie
|
||||
if ssh -o BatchMode=yes pi@192.168.31.19 true 2>/dev/null; then
|
||||
log "key already present — skip"
|
||||
fi
|
||||
```
|
||||
|
||||
### yaml_get — fallback bez yq
|
||||
|
||||
Gdy `yq` nie jest dostępne, używany jest `grep`+`sed` fallback. Pułapki:
|
||||
|
||||
- Inline komentarze YAML (`key: value # komentarz`) są strippowane przez
|
||||
`s/[[:space:]]\+#.*$//` — wymaga co najmniej jednej spacji przed `#`, więc
|
||||
`url#fragment` pozostaje nienaruszone.
|
||||
- Parser jest non-greedy na `:` — `s/^[[:space:]]*[^:]*:[[:space:]]*//'` —
|
||||
wartości z dwukropkiem (np. `systemd:magicmirror.service`) są czytane poprawnie.
|
||||
- Dot-path (`tailscale.hostname`) działa tylko z `yq`; fallback pasuje po ostatnim
|
||||
segmencie (`hostname`). Nazwy pól w node.yaml muszą być unikalne.
|
||||
|
||||
## Gotchas / Learnings
|
||||
|
||||
| Problem | Rozwiązanie |
|
||||
|---------|-------------|
|
||||
| mDNS `.local` zawodny | Użyj IP w `first_contact`; `.local` OK interaktywnie, nie w automatyzacji |
|
||||
| Istniejący uid=1000 na edge node | Użyj tego usera; nie twórz `oskar` (kolizja uid, zepsuje własność MM) |
|
||||
| swap plik na SD | Migruj na zram — wear reduction; dodaj krok do `10-bootstrap-runtime` |
|
||||
| dry-run zatrzymuje się na orchestratorze | `run()` wrapper + `export DRY_RUN=1`; sondy muszą działać też w dry-run |
|
||||
| SSH known-hosts warning w parsowanym output | `-o LogLevel=ERROR` na SSH do nowego hosta w mesh |
|
||||
| `yaml_get` gubi prefix po `:` w wartości | Non-greedy `^[[:space:]]*[^:]*:` zamiast `.*:` |
|
||||
| yaml_get nie usuwa inline komentarzy | `s/[[:space:]]\+#.*$//` po ekstrakcji wartości |
|
||||
| RPi4 4 GB RAM — OOM ryzyko | `mem_limit` w node-agent override obowiązkowy (profil jak VPS) |
|
||||
|
|
@ -1,84 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/lib/common.sh — shared helpers for the onboarding tool
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ── colour codes (disabled when not a tty) ──────────────────────────────────
|
||||
if [[ -t 1 ]]; then
|
||||
_C_RESET='\033[0m'
|
||||
_C_GREEN='\033[0;32m'
|
||||
_C_YELLOW='\033[1;33m'
|
||||
_C_RED='\033[0;31m'
|
||||
_C_CYAN='\033[0;36m'
|
||||
_C_BOLD='\033[1m'
|
||||
else
|
||||
_C_RESET='' _C_GREEN='' _C_YELLOW='' _C_RED='' _C_CYAN='' _C_BOLD=''
|
||||
fi
|
||||
|
||||
# ── logging ──────────────────────────────────────────────────────────────────
|
||||
log() { echo -e "${_C_GREEN}[onboard]${_C_RESET} $(date +'%H:%M:%S') ${*}"; }
|
||||
warn() { echo -e "${_C_YELLOW}[WARN]${_C_RESET} $(date +'%H:%M:%S') ${*}" >&2; }
|
||||
die() { echo -e "${_C_RED}[ERROR]${_C_RESET} $(date +'%H:%M:%S') ${*}" >&2; exit 1; }
|
||||
step() { echo -e "${_C_CYAN}${_C_BOLD}==> ${*}${_C_RESET}"; }
|
||||
dryrun() { echo -e "${_C_YELLOW}[dry-run]${_C_RESET} ${*}"; }
|
||||
|
||||
# ── command detection ─────────────────────────────────────────────────────────
|
||||
have_cmd() { command -v "$1" >/dev/null 2>&1; }
|
||||
|
||||
# ── dry-run execution wrapper ─────────────────────────────────────────────────
|
||||
# run CMD [ARGS…] — executes CMD in live mode; prints intent in dry-run.
|
||||
# Wrap MUTATIONS with this. Read-only probes (SSH BatchMode tests, status
|
||||
# queries, command -v checks) must run unconditionally — never wrap them.
|
||||
run() {
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
echo "[dry-run] would: $*"
|
||||
else
|
||||
"$@"
|
||||
fi
|
||||
}
|
||||
export -f run
|
||||
|
||||
# ── file helpers ──────────────────────────────────────────────────────────────
|
||||
# ensure_line FILE LINE — appends LINE to FILE if it is not already present (idempotent)
|
||||
ensure_line() {
|
||||
local file="$1" line="$2"
|
||||
[[ -f "$file" ]] || touch "$file"
|
||||
grep -qxF "$line" "$file" || echo "$line" >> "$file"
|
||||
}
|
||||
|
||||
# ── node.yaml parsing ─────────────────────────────────────────────────────────
|
||||
# require_node_yaml NODE — sets NODE_YAML; exits if not found
|
||||
require_node_yaml() {
|
||||
local node="$1"
|
||||
NODE_YAML="${REPO_ROOT}/hosts/${node,,}/node.yaml"
|
||||
[[ -f "$NODE_YAML" ]] || die "node.yaml not found: $NODE_YAML"
|
||||
export NODE_YAML
|
||||
}
|
||||
|
||||
# yaml_get NODE_YAML KEY — read a scalar value from a YAML file
|
||||
# Uses yq when available; falls back to grep/sed for simple key: value pairs.
|
||||
# Supports dot-separated paths (e.g. tailscale.hostname) only in yq mode;
|
||||
# the grep fallback handles only the last path component.
|
||||
yaml_get() {
|
||||
local file="$1" key="$2"
|
||||
if have_cmd yq; then
|
||||
yq -r ".${key} // empty" "$file" 2>/dev/null
|
||||
else
|
||||
# fallback: extract last segment of key, match " key: value"
|
||||
# Strip inline YAML comment (space(s)+'#'+rest) and surrounding whitespace.
|
||||
# Pattern uses \+ (BRE one-or-more) so a bare '#' inside a value is preserved.
|
||||
local leaf="${key##*.}"
|
||||
grep -E "^\s*${leaf}:" "$file" | head -1 \
|
||||
| sed -e 's/^[[:space:]]*[^:]*:[[:space:]]*//' \
|
||||
-e 's/[[:space:]]\+#.*$//' \
|
||||
-e 's/^[[:space:]]*//' \
|
||||
-e 's/[[:space:]]*$//' \
|
||||
| tr -d '"' | tr -d "'"
|
||||
fi
|
||||
}
|
||||
|
||||
# ── git wrapper ────────────────────────────────────────────────────────────────
|
||||
# All git calls from onboarding scripts must go through this so --no-pager is
|
||||
# always set and there is no interactive output.
|
||||
git() { command git --no-pager "$@"; }
|
||||
export -f git
|
||||
|
|
@ -1,51 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/lib/remote.sh — SSH helpers for remote node operations
|
||||
# Requires: ONBOARD_SSH_USER, ONBOARD_SSH_HOST to be set by the caller.
|
||||
# Inherits: DRY_RUN (boolean string "true"/"false")
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
: "${ONBOARD_SSH_USER:?remote.sh: ONBOARD_SSH_USER is not set}"
|
||||
: "${ONBOARD_SSH_HOST:?remote.sh: ONBOARD_SSH_HOST is not set}"
|
||||
: "${DRY_RUN:=0}"
|
||||
|
||||
_SSH_OPTS=(
|
||||
-o StrictHostKeyChecking=accept-new
|
||||
-o ConnectTimeout=10
|
||||
-o BatchMode=yes
|
||||
)
|
||||
|
||||
# rrun CMD [ARGS…] — run a command on the remote node via SSH
|
||||
rrun() {
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "ssh ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST} -- $*"
|
||||
return 0
|
||||
fi
|
||||
ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
|
||||
}
|
||||
|
||||
# rcopy LOCAL_PATH REMOTE_PATH — copy a file to the remote node via scp
|
||||
rcopy() {
|
||||
local src="$1" dst="$2"
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "scp $src ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
|
||||
return 0
|
||||
fi
|
||||
scp "${_SSH_OPTS[@]}" "$src" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
|
||||
}
|
||||
|
||||
# rsync_dir LOCAL_DIR REMOTE_DIR [EXTRA_RSYNC_ARGS…]
|
||||
rsync_dir() {
|
||||
local src="$1" dst="$2"
|
||||
shift 2
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "rsync -az $src ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
|
||||
return 0
|
||||
fi
|
||||
rsync -az -e "ssh ${_SSH_OPTS[*]}" "$src" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst" "$@"
|
||||
}
|
||||
|
||||
# rcheck — verify SSH connectivity; returns 0 if reachable
|
||||
rcheck() {
|
||||
ssh "${_SSH_OPTS[@]}" -o ConnectTimeout=5 "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- true 2>/dev/null
|
||||
}
|
||||
|
|
@ -1,182 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/onboard.sh — node onboarding orchestrator
|
||||
#
|
||||
# Usage:
|
||||
# onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
|
||||
#
|
||||
# Flags:
|
||||
# --node <name> node name matching hosts/<name>/node.yaml (required)
|
||||
# --step <name> run only this step (e.g. 00-preflight)
|
||||
# --from <step> start from this step, run all subsequent steps
|
||||
# --dry-run print what would be done without mutating anything
|
||||
#
|
||||
# Steps run in lexicographic order from scripts/onboard/steps/.
|
||||
# Steps that require deploy_autonomy=true are skipped (with a warning) when
|
||||
# that flag is false in node.yaml. Steps that require git_control=true are
|
||||
# similarly gated.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
STEPS_DIR="${REPO_ROOT}/scripts/onboard/steps"
|
||||
LIB_DIR="${REPO_ROOT}/scripts/onboard/lib"
|
||||
|
||||
# ── load helpers ──────────────────────────────────────────────────────────────
|
||||
# shellcheck source=lib/common.sh
|
||||
source "${LIB_DIR}/common.sh"
|
||||
|
||||
# ── defaults ──────────────────────────────────────────────────────────────────
|
||||
NODE_NAME=""
|
||||
ONLY_STEP=""
|
||||
FROM_STEP=""
|
||||
DRY_RUN=0
|
||||
export DRY_RUN REPO_ROOT
|
||||
|
||||
# ── argument parsing ──────────────────────────────────────────────────────────
|
||||
usage() {
|
||||
cat >&2 <<'EOF'
|
||||
Usage: onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
|
||||
|
||||
--node <name> node name matching hosts/<name>/node.yaml (required)
|
||||
--step <name> run only this single step (e.g. 00-preflight)
|
||||
--from <step> start from this step, continue to end
|
||||
--dry-run no mutations; show what would run
|
||||
|
||||
Examples:
|
||||
onboard.sh --node lustro
|
||||
onboard.sh --node lustro --step 00-preflight
|
||||
onboard.sh --node lustro --from 20-install-docker
|
||||
onboard.sh --node lustro --dry-run
|
||||
EOF
|
||||
exit 1
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--node) NODE_NAME="${2:?--node requires a value}"; shift 2 ;;
|
||||
--step) ONLY_STEP="${2:?--step requires a value}"; shift 2 ;;
|
||||
--from) FROM_STEP="${2:?--from requires a value}"; shift 2 ;;
|
||||
--dry-run) DRY_RUN=1; shift ;;
|
||||
-h|--help) usage ;;
|
||||
*) die "Unknown argument: $1" ;;
|
||||
esac
|
||||
done
|
||||
|
||||
[[ -z "$NODE_NAME" ]] && { warn "--node is required"; usage; }
|
||||
|
||||
export NODE_NAME
|
||||
|
||||
# ── load node.yaml ────────────────────────────────────────────────────────────
|
||||
require_node_yaml "$NODE_NAME"
|
||||
|
||||
log "Loading manifest: $NODE_YAML"
|
||||
|
||||
DEPLOY_AUTONOMY=$(yaml_get "$NODE_YAML" "deploy_autonomy")
|
||||
GIT_CONTROL=$(yaml_get "$NODE_YAML" "git_control")
|
||||
SSH_USER=$(yaml_get "$NODE_YAML" "ssh_user")
|
||||
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
|
||||
|
||||
DEPLOY_AUTONOMY="${DEPLOY_AUTONOMY:-false}"
|
||||
GIT_CONTROL="${GIT_CONTROL:-false}"
|
||||
|
||||
[[ -z "$SSH_USER" ]] && die "ssh_user not set in $NODE_YAML"
|
||||
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
|
||||
|
||||
export ONBOARD_SSH_USER="$SSH_USER"
|
||||
export ONBOARD_SSH_HOST="$TS_HOSTNAME"
|
||||
|
||||
log "Node: ${NODE_NAME} | host: ${TS_HOSTNAME} | user: ${SSH_USER}"
|
||||
log "deploy_autonomy=${DEPLOY_AUTONOMY} git_control=${GIT_CONTROL} dry_run=${DRY_RUN}"
|
||||
|
||||
# ── collect steps ─────────────────────────────────────────────────────────────
|
||||
# Steps are NN-name.sh files in lexicographic order.
|
||||
mapfile -t ALL_STEPS < <(find "$STEPS_DIR" -maxdepth 1 -name '[0-9][0-9]-*.sh' | sort)
|
||||
|
||||
if [[ ${#ALL_STEPS[@]} -eq 0 ]]; then
|
||||
die "No steps found in $STEPS_DIR"
|
||||
fi
|
||||
|
||||
# Determine which steps to run based on flags.
|
||||
declare -a STEPS_TO_RUN=()
|
||||
|
||||
for step_path in "${ALL_STEPS[@]}"; do
|
||||
step_file=$(basename "$step_path" .sh)
|
||||
|
||||
if [[ -n "$ONLY_STEP" ]]; then
|
||||
# Match on prefix (e.g. "00-preflight" matches "00-preflight.sh")
|
||||
[[ "$step_file" == "$ONLY_STEP" ]] || continue
|
||||
elif [[ -n "$FROM_STEP" ]]; then
|
||||
# Skip steps before FROM_STEP
|
||||
[[ "$step_file" < "$FROM_STEP" && "$step_file" != "$FROM_STEP" ]] && continue
|
||||
fi
|
||||
|
||||
STEPS_TO_RUN+=("$step_path")
|
||||
done
|
||||
|
||||
if [[ ${#STEPS_TO_RUN[@]} -eq 0 ]]; then
|
||||
die "No matching steps found (--step='${ONLY_STEP}' --from='${FROM_STEP}')"
|
||||
fi
|
||||
|
||||
log "Steps to run (${#STEPS_TO_RUN[@]}):"
|
||||
for s in "${STEPS_TO_RUN[@]}"; do
|
||||
printf " %s\n" "$(basename "$s")"
|
||||
done
|
||||
echo ""
|
||||
|
||||
# ── step execution loop ───────────────────────────────────────────────────────
|
||||
# Steps that start at 10+ are "mutating" and require deploy_autonomy=true.
|
||||
# Steps that start at 30+ and deal with git/repo sync require git_control=true.
|
||||
# Step 00-preflight is always allowed (read-only).
|
||||
|
||||
_step_needs_autonomy() {
|
||||
local num="${1%%[^0-9]*}" # leading digits
|
||||
[[ "$num" -ge 10 ]] 2>/dev/null
|
||||
}
|
||||
|
||||
_step_needs_git_control() {
|
||||
local name="$1"
|
||||
[[ "$name" == *"git"* || "$name" == *"repo"* || "$name" == *"clone"* ]]
|
||||
}
|
||||
|
||||
FAILED_STEPS=()
|
||||
|
||||
for step_path in "${STEPS_TO_RUN[@]}"; do
|
||||
step_file=$(basename "$step_path" .sh)
|
||||
step_num="${step_file%%[^0-9]*}"
|
||||
|
||||
# autonomy gate
|
||||
if _step_needs_autonomy "$step_num" && [[ "$DEPLOY_AUTONOMY" != "true" ]]; then
|
||||
warn "Skipping $step_file — deploy_autonomy=false in $NODE_YAML"
|
||||
warn "Run this step manually or set deploy_autonomy: true"
|
||||
continue
|
||||
fi
|
||||
|
||||
# git_control gate
|
||||
if _step_needs_git_control "$step_file" && [[ "$GIT_CONTROL" != "true" ]]; then
|
||||
warn "Skipping $step_file — git_control=false in $NODE_YAML"
|
||||
continue
|
||||
fi
|
||||
|
||||
step "Running: $step_file"
|
||||
|
||||
if bash "$step_path"; then
|
||||
log "$step_file — OK"
|
||||
else
|
||||
rc=$?
|
||||
warn "$step_file — FAILED (exit $rc)"
|
||||
FAILED_STEPS+=("$step_file")
|
||||
fi
|
||||
|
||||
echo ""
|
||||
done
|
||||
|
||||
# ── summary ───────────────────────────────────────────────────────────────────
|
||||
if [[ ${#FAILED_STEPS[@]} -gt 0 ]]; then
|
||||
die "Onboarding finished with failures: ${FAILED_STEPS[*]}"
|
||||
fi
|
||||
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
log "Dry-run complete — no mutations performed."
|
||||
else
|
||||
log "All steps completed successfully for node ${NODE_NAME}."
|
||||
fi
|
||||
|
|
@ -1,156 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/00-access.sh — establish remote access channel
|
||||
#
|
||||
# Stages:
|
||||
# 1. ensure_ssh_key — copy SATURN public key to first_contact (idempotent)
|
||||
# 2. ensure_tailscale — install Tailscale and join network (interactive auth URL)
|
||||
# 3. verify — confirm SSH over Tailscale, assert arch=aarch64
|
||||
#
|
||||
# Dry-run convention (DRY_RUN=1):
|
||||
# - Read-only probes (SSH BatchMode test, tailscale status, command -v) run ALWAYS
|
||||
# so the plan reflects real current state ("key present → skip" vs "would: install")
|
||||
# - Mutations (ssh-copy-id, curl installer, tailscale up) are wrapped with run()
|
||||
#
|
||||
# Does NOT configure NOPASSWD or /opt/homelab — those are later steps.
|
||||
# pi user on Raspberry Pi OS has passwordless sudo — required for `tailscale up`.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
STEP_NAME="00-access"
|
||||
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
|
||||
: "${DRY_RUN:=0}"
|
||||
|
||||
# Source common.sh when run standalone (orchestrator sources it before calling steps)
|
||||
if ! declare -f log >/dev/null 2>&1; then
|
||||
# shellcheck source=../lib/common.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
|
||||
fi
|
||||
|
||||
# ── parse node.yaml ───────────────────────────────────────────────────────────
|
||||
FIRST_CONTACT=$(yaml_get "$NODE_YAML" "first_contact")
|
||||
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
|
||||
|
||||
[[ -z "$FIRST_CONTACT" ]] && die "first_contact not set in $NODE_YAML"
|
||||
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
|
||||
|
||||
FC_USER="${FIRST_CONTACT%%@*}"
|
||||
|
||||
# ONBOARD_SSH_USER/HOST set by orchestrator to post-Tailscale coordinates;
|
||||
# fall back to first_contact for standalone invocation.
|
||||
export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${FC_USER}}"
|
||||
export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
|
||||
|
||||
# shellcheck source=../lib/remote.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
# ── SSH option arrays ─────────────────────────────────────────────────────────
|
||||
# No BatchMode — used for ssh-copy-id where a password prompt may appear
|
||||
_FC_SSH_NOKEY=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10)
|
||||
# BatchMode — used for all probes and post-key-install operations
|
||||
_FC_SSH=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes)
|
||||
# Tailscale verify — LogLevel=ERROR suppresses the "Permanently added" known-hosts
|
||||
# INFO message that would otherwise leak into captured stdout on first connection
|
||||
_TS_SSH=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes -o LogLevel=ERROR)
|
||||
|
||||
# ── tailscale state probe helper ──────────────────────────────────────────────
|
||||
# Always runs; returns BackendState or "unknown" on any SSH/parse failure.
|
||||
_ts_state() {
|
||||
ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" \
|
||||
'tailscale status --json 2>/dev/null | python3 -c \
|
||||
"import sys,json; print(json.load(sys.stdin).get(\"BackendState\",\"unknown\"))" \
|
||||
2>/dev/null || echo "unknown"' 2>/dev/null || echo "unknown"
|
||||
}
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 1 — ensure_ssh_key
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 1/3 ensure_ssh_key → ${FIRST_CONTACT}"
|
||||
|
||||
# Probe: test key-based auth — always runs so dry-run reports real current state
|
||||
if ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" true 2>/dev/null; then
|
||||
log "SSH key already accepted by ${FIRST_CONTACT} — skip"
|
||||
else
|
||||
pubkeys=( "$HOME"/.ssh/id_*.pub )
|
||||
[[ -f "${pubkeys[0]}" ]] || die "No public key found at ~/.ssh/id_*.pub on SATURN"
|
||||
|
||||
log "Key not yet installed on ${FIRST_CONTACT} (password prompt expected)"
|
||||
# Mutation: install public key
|
||||
run ssh-copy-id \
|
||||
"${_FC_SSH_NOKEY[@]}" \
|
||||
-i "${pubkeys[0]}" \
|
||||
"$FIRST_CONTACT"
|
||||
# Probe: verify key was installed (run() is a no-op in dry-run so this
|
||||
# prints "would:" — avoids a false-failure after a skipped ssh-copy-id)
|
||||
run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" true
|
||||
log "Key installed and verified"
|
||||
fi
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 2 — ensure_tailscale
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 2/3 ensure_tailscale on ${FIRST_CONTACT} → hostname=${TS_HOSTNAME}"
|
||||
|
||||
# Probe: check if tailscale binary present — always runs.
|
||||
# SSH auth failure (key not yet installed in dry-run) falls through to the
|
||||
# "not found" branch, which is correct for a fresh node.
|
||||
if ! ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" 'command -v tailscale' >/dev/null 2>&1; then
|
||||
log "Tailscale not found on ${FIRST_CONTACT}"
|
||||
# Mutation: install tailscale
|
||||
run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" \
|
||||
'curl -fsSL https://tailscale.com/install.sh | sh'
|
||||
else
|
||||
log "Tailscale already installed on ${FIRST_CONTACT}"
|
||||
fi
|
||||
|
||||
# Probe: check backend state — always runs
|
||||
ts_state=$(_ts_state)
|
||||
if [[ "$ts_state" == "Running" ]]; then
|
||||
log "Tailscale already active (BackendState=Running) — skip"
|
||||
else
|
||||
warn "Tailscale BackendState=${ts_state} — joining network..."
|
||||
echo ""
|
||||
echo -e "${_C_BOLD}┌─────────────────────────────────────────────────────────────┐"
|
||||
echo -e "│ ACTION REQUIRED: open the URL below in your browser to │"
|
||||
echo -e "│ authorize ${TS_HOSTNAME} in your Tailscale account. │"
|
||||
echo -e "└─────────────────────────────────────────────────────────────┘${_C_RESET}"
|
||||
echo ""
|
||||
# Mutation: tailscale up — blocks until user authenticates via printed URL
|
||||
run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" "sudo tailscale up --hostname=${TS_HOSTNAME}"
|
||||
echo ""
|
||||
|
||||
# Post-join state check — only meaningful after the mutation actually ran
|
||||
if [ "${DRY_RUN:-0}" != 1 ]; then
|
||||
ts_state2=$(_ts_state)
|
||||
[[ "$ts_state2" == "Running" ]] \
|
||||
|| die "Tailscale still not active after tailscale up (BackendState=${ts_state2})"
|
||||
log "Tailscale joined successfully (BackendState=Running)"
|
||||
fi
|
||||
fi
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 3 — verify over Tailscale
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 3/3 verify SSH over Tailscale → ${ONBOARD_SSH_USER}@${TS_HOSTNAME}"
|
||||
|
||||
# Probe: always runs — on a node already joined this works even in dry-run.
|
||||
# On a fresh node in dry-run mode Tailscale isn't set up yet, so SSH will fail;
|
||||
# that is reported as a warning (not a fatal error) to keep dry-run informative.
|
||||
# stderr is NOT merged (no 2>&1) — _TS_SSH uses LogLevel=ERROR so the
|
||||
# "Permanently added … to known hosts" INFO message is suppressed at source.
|
||||
if arch=$(ssh "${_TS_SSH[@]}" "${ONBOARD_SSH_USER}@${TS_HOSTNAME}" 'uname -m'); then
|
||||
# Take the last non-empty stdout line to skip any unexpected preamble
|
||||
arch=$(printf '%s' "$arch" | grep -v '^[[:space:]]*$' | tail -1 | tr -d '[:space:]')
|
||||
if [[ "$arch" == "aarch64" ]]; then
|
||||
log "Verify OK: ${ONBOARD_SSH_USER}@${TS_HOSTNAME} reachable, arch=${arch}"
|
||||
else
|
||||
msg="Unexpected arch '${arch}' on ${TS_HOSTNAME} — expected aarch64"
|
||||
[ "${DRY_RUN:-0}" = 1 ] && warn "$msg" || die "$msg"
|
||||
fi
|
||||
else
|
||||
msg="Verify SSH to ${ONBOARD_SSH_USER}@${TS_HOSTNAME} failed (Tailscale not yet joined?)"
|
||||
[ "${DRY_RUN:-0}" = 1 ] && warn "$msg" || die "$msg"
|
||||
fi
|
||||
|
||||
log "[$STEP_NAME] done"
|
||||
|
|
@ -1,144 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/00-preflight.sh — READ-ONLY remote node discovery
|
||||
#
|
||||
# Collects facts from the remote node and prints:
|
||||
# 1. A human-readable report block
|
||||
# 2. A machine-readable YAML snippet ready to paste into hosts/<node>/node.yaml
|
||||
#
|
||||
# NO mutations are performed on the remote host.
|
||||
# Depends on: lib/common.sh (sourced by orchestrator), lib/remote.sh (sourced here)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
STEP_NAME="00-preflight"
|
||||
|
||||
# remote.sh is sourced here so individual steps can also be run standalone
|
||||
# (when REPO_ROOT is in the environment).
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
# shellcheck source=../lib/remote.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
step "[$STEP_NAME] Collecting facts from ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST} (read-only)"
|
||||
|
||||
# ── gather all facts in a single SSH session ──────────────────────────────────
|
||||
raw=$(rrun bash -s <<'REMOTE'
|
||||
set -euo pipefail
|
||||
|
||||
# arch / bitness
|
||||
arch=$(uname -m)
|
||||
bits=$(getconf LONG_BIT)
|
||||
|
||||
# RAM (kB → MB)
|
||||
mem_kb=$(grep MemTotal /proc/meminfo | awk '{print $2}')
|
||||
mem_mb=$(( mem_kb / 1024 ))
|
||||
|
||||
# disk root
|
||||
disk_root=$(df -h / | awk 'NR==2{print $2" total, "$3" used, "$4" free ("$5" used)"}')
|
||||
|
||||
# docker
|
||||
docker_present=false
|
||||
docker_info=""
|
||||
if command -v docker >/dev/null 2>&1; then
|
||||
docker_present=true
|
||||
docker_info=$(docker info --format '{{.ServerVersion}}' 2>/dev/null || echo "unknown")
|
||||
fi
|
||||
|
||||
# tailscale
|
||||
tailscale_present=false
|
||||
tailscale_status=""
|
||||
if command -v tailscale >/dev/null 2>&1; then
|
||||
tailscale_present=true
|
||||
tailscale_status=$(tailscale status --json 2>/dev/null | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('BackendState','unknown'))" 2>/dev/null || tailscale status 2>/dev/null | head -1 || echo "unknown")
|
||||
fi
|
||||
|
||||
# Magic Mirror runtime detection
|
||||
mm_runtime="none"
|
||||
if systemctl is-active --quiet MagicMirror 2>/dev/null || systemctl is-active --quiet magicmirror 2>/dev/null; then
|
||||
mm_runtime="systemd"
|
||||
elif command -v pm2 >/dev/null 2>&1 && pm2 list 2>/dev/null | grep -qi "MagicMirror"; then
|
||||
mm_runtime="pm2"
|
||||
elif pgrep -fa "MagicMirror" >/dev/null 2>&1; then
|
||||
mm_runtime="process"
|
||||
fi
|
||||
|
||||
# swap
|
||||
swap_current="none"
|
||||
if command -v swapon >/dev/null 2>&1; then
|
||||
swap_lines=$(swapon --show --noheadings 2>/dev/null || true)
|
||||
if [[ -n "$swap_lines" ]]; then
|
||||
swap_current="$swap_lines"
|
||||
fi
|
||||
fi
|
||||
if command -v zramctl >/dev/null 2>&1; then
|
||||
zram_lines=$(zramctl --noheadings 2>/dev/null || true)
|
||||
[[ -n "$zram_lines" ]] && swap_current="${swap_current:+$swap_current; }zram: $zram_lines"
|
||||
fi
|
||||
|
||||
# hostname / os
|
||||
hostname=$(hostname -f 2>/dev/null || hostname)
|
||||
os_pretty=$(grep PRETTY_NAME /etc/os-release 2>/dev/null | cut -d= -f2 | tr -d '"' || echo "unknown")
|
||||
|
||||
cat <<EOF
|
||||
ARCH=$arch
|
||||
BITS=$bits
|
||||
MEM_MB=$mem_mb
|
||||
DISK_ROOT=$disk_root
|
||||
DOCKER_PRESENT=$docker_present
|
||||
DOCKER_VERSION=$docker_info
|
||||
TAILSCALE_PRESENT=$tailscale_present
|
||||
TAILSCALE_STATUS=$tailscale_status
|
||||
MM_RUNTIME=$mm_runtime
|
||||
SWAP_CURRENT=$swap_current
|
||||
HOSTNAME=$hostname
|
||||
OS=$os_pretty
|
||||
EOF
|
||||
REMOTE
|
||||
)
|
||||
|
||||
# ── parse key=value output ────────────────────────────────────────────────────
|
||||
_val() { echo "$raw" | grep "^${1}=" | head -1 | cut -d= -f2-; }
|
||||
|
||||
arch=$(_val ARCH)
|
||||
bits=$(_val BITS)
|
||||
mem_mb=$(_val MEM_MB)
|
||||
disk_root=$(_val DISK_ROOT)
|
||||
docker_present=$(_val DOCKER_PRESENT)
|
||||
docker_version=$(_val DOCKER_VERSION)
|
||||
tailscale_present=$(_val TAILSCALE_PRESENT)
|
||||
tailscale_status=$(_val TAILSCALE_STATUS)
|
||||
mm_runtime=$(_val MM_RUNTIME)
|
||||
swap_current=$(_val SWAP_CURRENT)
|
||||
remote_hostname=$(_val HOSTNAME)
|
||||
os_pretty=$(_val OS)
|
||||
|
||||
# ── human-readable report ─────────────────────────────────────────────────────
|
||||
echo ""
|
||||
echo "┌─────────────────────────────────────────────────────┐"
|
||||
printf "│ Preflight report: %-33s│\n" "${ONBOARD_SSH_HOST}"
|
||||
echo "├─────────────────────────────────────────────────────┤"
|
||||
printf "│ hostname : %-35s│\n" "$remote_hostname"
|
||||
printf "│ OS : %-35s│\n" "$os_pretty"
|
||||
printf "│ arch : %-35s│\n" "${arch} (${bits}-bit)"
|
||||
printf "│ RAM : %-35s│\n" "${mem_mb} MB"
|
||||
printf "│ disk / : %-35s│\n" "$disk_root"
|
||||
printf "│ docker : %-35s│\n" "${docker_present} (v${docker_version})"
|
||||
printf "│ tailscale : %-35s│\n" "${tailscale_present} / ${tailscale_status}"
|
||||
printf "│ MagicMirror : %-35s│\n" "$mm_runtime"
|
||||
printf "│ swap : %-35s│\n" "${swap_current:-none}"
|
||||
echo "└─────────────────────────────────────────────────────┘"
|
||||
echo ""
|
||||
|
||||
# ── machine-readable YAML snippet ────────────────────────────────────────────
|
||||
echo "# ── paste into hosts/${NODE_NAME,,}/node.yaml ──"
|
||||
cat <<YAML
|
||||
hardware:
|
||||
arch: ${arch}
|
||||
ram_mb: ${mem_mb}
|
||||
swap: ${swap_current:-none}
|
||||
docker_present: ${docker_present}
|
||||
docker_version: "${docker_version}"
|
||||
tailscale_status: "${tailscale_status}"
|
||||
mm_runtime: ${mm_runtime}
|
||||
YAML
|
||||
|
||||
log "[$STEP_NAME] done — no changes made to remote host"
|
||||
|
|
@ -1,14 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/10-bootstrap-runtime.sh — create /opt/homelab layout on remote node
|
||||
#
|
||||
# TODO: create /opt/homelab/{data,config,logs,state,events,world,actions/{pending,approved,running,completed,failed}}
|
||||
# TODO: set ownership to ssh_user (from node.yaml)
|
||||
# TODO: write /opt/homelab/state/node_name from node.yaml name field
|
||||
# TODO: idempotent — skip dirs that already exist
|
||||
|
||||
set -euo pipefail
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
STEP_NAME="10-bootstrap-runtime"
|
||||
step "[$STEP_NAME] TODO — not yet implemented"
|
||||
|
|
@ -1,152 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/20-base.sh — base system configuration for LUSTRO
|
||||
#
|
||||
# Stages:
|
||||
# 1. swap→zram — disable dphys-swapfile, install + configure zram-tools
|
||||
# 2. /opt/homelab — create base directory, chown <ssh_user>:<ssh_user>
|
||||
# 3. event dir — create /opt/homelab/events/<ts_hostname>, chown -R
|
||||
#
|
||||
# Dry-run convention:
|
||||
# - Probes (state queries) run unconditionally — dry-run reflects real state
|
||||
# - Mutations use rrun() which skips execution when DRY_RUN=1
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
STEP_NAME="20-base"
|
||||
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
|
||||
: "${DRY_RUN:=0}"
|
||||
|
||||
# Source common.sh when run standalone (orchestrator sources it before calling steps)
|
||||
if ! declare -f log >/dev/null 2>&1; then
|
||||
# shellcheck source=../lib/common.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
|
||||
fi
|
||||
|
||||
# ── parse node.yaml ───────────────────────────────────────────────────────────
|
||||
SSH_USER=$(yaml_get "$NODE_YAML" "ssh_user")
|
||||
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
|
||||
|
||||
[[ -z "$SSH_USER" ]] && die "ssh_user not set in $NODE_YAML"
|
||||
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
|
||||
|
||||
export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${SSH_USER}}"
|
||||
export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
|
||||
|
||||
# shellcheck source=../lib/remote.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
# ── rprobe: read-only remote probe — always runs, even in dry-run ─────────────
|
||||
rprobe() {
|
||||
ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
|
||||
}
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 1 — swap→zram
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 1/3 swap→zram (PERCENT=50, algo=zstd)"
|
||||
|
||||
# Guard by EFFECT: zram device present in swapon AND dphys-swapfile not active
|
||||
# → desired end-state already reached, skip the whole stage.
|
||||
_zram_active=0
|
||||
_dphys_active=0
|
||||
rprobe 'sudo swapon --show 2>/dev/null | grep -q /dev/zram' && _zram_active=1 || true
|
||||
rprobe 'systemctl is-active dphys-swapfile' >/dev/null 2>&1 && _dphys_active=1 || true
|
||||
|
||||
if [[ "$_zram_active" -eq 1 && "$_dphys_active" -eq 0 ]]; then
|
||||
log "zram already active, dphys-swapfile not active — skip"
|
||||
else
|
||||
# Substage: disable dphys-swapfile if still active
|
||||
if [[ "$_dphys_active" -eq 1 ]]; then
|
||||
log "dphys-swapfile active — disabling"
|
||||
rrun sudo dphys-swapfile swapoff
|
||||
rrun sudo systemctl disable --now dphys-swapfile
|
||||
if rprobe '[ -f /var/swap ]' 2>/dev/null; then
|
||||
rrun sudo rm -f /var/swap
|
||||
log "Removed /var/swap"
|
||||
fi
|
||||
else
|
||||
log "dphys-swapfile not active — skip disable"
|
||||
fi
|
||||
|
||||
# Substage: install zram-tools if package not present
|
||||
# Use dpkg -l rather than command -v: zramswap binary may not be on PATH over SSH
|
||||
if ! rprobe 'dpkg -l zram-tools 2>/dev/null | grep -q "^ii"' 2>/dev/null; then
|
||||
log "zram-tools not installed — installing"
|
||||
rrun sudo apt-get install -y zram-tools
|
||||
else
|
||||
log "zram-tools already installed"
|
||||
fi
|
||||
|
||||
# Write config and (re)start zramswap
|
||||
log "Writing /etc/default/zramswap (ALGO=zstd, PERCENT=50)"
|
||||
rrun bash -c "printf '%s\n' 'ALGO=zstd' 'PERCENT=50' | sudo tee /etc/default/zramswap > /dev/null"
|
||||
rrun sudo systemctl enable zramswap
|
||||
rrun sudo systemctl restart zramswap
|
||||
fi
|
||||
|
||||
# Verify (skipped in dry-run — mutations may not have run)
|
||||
if [ "${DRY_RUN:-0}" != 1 ]; then
|
||||
if rprobe 'sudo swapon --show 2>/dev/null | grep -q /dev/zram'; then
|
||||
log "Verify OK: zram swap active"
|
||||
rprobe 'sudo swapon --show' || true
|
||||
else
|
||||
die "zram swap not active after setup — check: systemctl status zramswap on ${TS_HOSTNAME}"
|
||||
fi
|
||||
if rprobe 'systemctl is-active dphys-swapfile' >/dev/null 2>&1; then
|
||||
warn "dphys-swapfile still reports active — manual inspection needed"
|
||||
else
|
||||
log "Verify OK: dphys-swapfile not active"
|
||||
fi
|
||||
fi
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 2 — /opt/homelab
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 2/3 /opt/homelab (owner: ${SSH_USER}:${SSH_USER})"
|
||||
|
||||
# Guard: exists AND owned by SSH_USER?
|
||||
_dir_ok=0
|
||||
if rprobe '[ -d /opt/homelab ]' 2>/dev/null; then
|
||||
_owner=$(rprobe "stat -c '%U' /opt/homelab" 2>/dev/null || echo "")
|
||||
if [[ "$_owner" == "$SSH_USER" ]]; then
|
||||
_dir_ok=1
|
||||
log "/opt/homelab exists, owner=${SSH_USER} — skip"
|
||||
else
|
||||
log "/opt/homelab exists but owner='${_owner}' — fixing"
|
||||
fi
|
||||
else
|
||||
log "/opt/homelab missing — creating"
|
||||
fi
|
||||
|
||||
if [[ "$_dir_ok" -eq 0 ]]; then
|
||||
rrun sudo mkdir -p /opt/homelab
|
||||
rrun sudo chown "${SSH_USER}:${SSH_USER}" /opt/homelab
|
||||
fi
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 3 — event dir
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 3/3 event dir (/opt/homelab/events/${TS_HOSTNAME})"
|
||||
|
||||
# Guard: event subdir exists AND /opt/homelab/events owned by SSH_USER?
|
||||
_evdir_ok=0
|
||||
if rprobe "[ -d /opt/homelab/events/${TS_HOSTNAME} ]" 2>/dev/null; then
|
||||
_ev_owner=$(rprobe "stat -c '%U' /opt/homelab/events" 2>/dev/null || echo "")
|
||||
if [[ "$_ev_owner" == "$SSH_USER" ]]; then
|
||||
_evdir_ok=1
|
||||
log "/opt/homelab/events/${TS_HOSTNAME} exists, owner=${SSH_USER} — skip"
|
||||
else
|
||||
log "/opt/homelab/events exists but owner='${_ev_owner}' — fixing"
|
||||
fi
|
||||
else
|
||||
log "/opt/homelab/events/${TS_HOSTNAME} missing — creating"
|
||||
fi
|
||||
|
||||
if [[ "$_evdir_ok" -eq 0 ]]; then
|
||||
rrun sudo mkdir -p "/opt/homelab/events/${TS_HOSTNAME}"
|
||||
rrun sudo chown -R "${SSH_USER}:${SSH_USER}" /opt/homelab/events
|
||||
fi
|
||||
|
||||
log "[$STEP_NAME] done"
|
||||
|
|
@ -1,16 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/20-install-docker.sh — install Docker Engine on remote node
|
||||
#
|
||||
# TODO: skip if docker already present (check from 00-preflight facts or live rrun)
|
||||
# TODO: detect distro (Debian/Ubuntu/Raspberry Pi OS) and use appropriate apt repo
|
||||
# TODO: install docker-ce, docker-ce-cli, containerd.io
|
||||
# TODO: add ssh_user to docker group
|
||||
# TODO: enable + start docker.service
|
||||
# TODO: gate on deploy_autonomy=true in node.yaml (skip step if false, warn operator)
|
||||
|
||||
set -euo pipefail
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
STEP_NAME="20-install-docker"
|
||||
step "[$STEP_NAME] TODO — not yet implemented"
|
||||
|
|
@ -1,16 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/30-install-tailscale.sh — install and join Tailscale on remote node
|
||||
#
|
||||
# TODO: skip if tailscale already installed and connected
|
||||
# TODO: install via https://tailscale.com/install.sh (or distro pkg)
|
||||
# TODO: gate on operator-provided auth key (TAILSCALE_AUTH_KEY env var; never hardcode)
|
||||
# TODO: tailscale up --auth-key=$TAILSCALE_AUTH_KEY --hostname=<node.yaml name>
|
||||
# TODO: verify node appears in tailscale status within timeout
|
||||
# TODO: gate on deploy_autonomy=true in node.yaml
|
||||
|
||||
set -euo pipefail
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
STEP_NAME="30-install-tailscale"
|
||||
step "[$STEP_NAME] TODO — not yet implemented"
|
||||
|
|
@ -1,136 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/30-node-agent.sh — deploy node-agent to remote node
|
||||
#
|
||||
# Push-based deploy (git_control=false on LUSTRO): rsync services/node-agent/
|
||||
# and the host override to /opt/homelab/deploy/node-agent/ on the remote, then
|
||||
# docker compose build + up via SSH. Mirrors the PIHA pattern but pushes files
|
||||
# instead of git-pulling them on the node.
|
||||
#
|
||||
# Stages:
|
||||
# 1. push — rsync base compose+src, copy override to remote deploy dir
|
||||
# 2. up — docker compose up -d --build (guarded: skip if already running)
|
||||
# 3. verify — container running + fresh event in /opt/homelab/events/<node>/
|
||||
#
|
||||
# Dry-run: probes run unconditionally; rsync/rrun mutations honour DRY_RUN.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
STEP_NAME="30-node-agent"
|
||||
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
|
||||
: "${DRY_RUN:=0}"
|
||||
|
||||
# Source common.sh when run standalone (orchestrator sources it before calling steps)
|
||||
if ! declare -f log >/dev/null 2>&1; then
|
||||
# shellcheck source=../lib/common.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
|
||||
fi
|
||||
|
||||
# ── parse node.yaml ───────────────────────────────────────────────────────────
|
||||
SSH_USER=$(yaml_get "$NODE_YAML" "ssh_user")
|
||||
TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
|
||||
|
||||
[[ -z "$SSH_USER" ]] && die "ssh_user not set in $NODE_YAML"
|
||||
[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
|
||||
|
||||
export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${SSH_USER}}"
|
||||
export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
|
||||
|
||||
# shellcheck source=../lib/remote.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
|
||||
|
||||
REMOTE_DEPLOY_DIR="/opt/homelab/deploy/node-agent"
|
||||
COMPOSE_BASE="${REMOTE_DEPLOY_DIR}/docker-compose.yml"
|
||||
COMPOSE_OVERRIDE="${REMOTE_DEPLOY_DIR}/docker-compose.override.yml"
|
||||
|
||||
LOCAL_SVC_DIR="${REPO_ROOT}/services/node-agent"
|
||||
LOCAL_OVERRIDE="${REPO_ROOT}/hosts/${TS_HOSTNAME}/runtime/node-agent/docker-compose.override.yml"
|
||||
|
||||
# ── rprobe: read-only remote probe — always runs, even in dry-run ─────────────
|
||||
rprobe() {
|
||||
ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
|
||||
}
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 1 — push compose files to remote
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 1/3 push compose → ${ONBOARD_SSH_HOST}:${REMOTE_DEPLOY_DIR}"
|
||||
|
||||
# Guard by EFFECT: is node-agent already running?
|
||||
_running=0
|
||||
if rprobe "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}' 2>/dev/null | grep -q node-agent" 2>/dev/null; then
|
||||
_running=1
|
||||
log "node-agent container already running — skip push+build+up"
|
||||
fi
|
||||
|
||||
if [[ "$_running" -eq 0 ]]; then
|
||||
[[ -f "$LOCAL_OVERRIDE" ]] \
|
||||
|| die "Override not found: $LOCAL_OVERRIDE"
|
||||
|
||||
# Ensure remote deploy dir exists (rsync does not create intermediate dirs)
|
||||
# pi owns /opt/homelab, so no sudo needed
|
||||
rrun mkdir -p "${REMOTE_DEPLOY_DIR}"
|
||||
|
||||
# Push base compose + Dockerfile + src/ (rsync_dir handles DRY_RUN)
|
||||
rsync_dir "${LOCAL_SVC_DIR}/" "${REMOTE_DEPLOY_DIR}/"
|
||||
|
||||
# Push host-specific override (rcopy handles DRY_RUN)
|
||||
rcopy "${LOCAL_OVERRIDE}" "${REMOTE_DEPLOY_DIR}/docker-compose.override.yml"
|
||||
fi
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 2 — docker compose build + up
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 2/3 docker compose up node-agent"
|
||||
|
||||
if [[ "$_running" -eq 1 ]]; then
|
||||
log "node-agent already running — skip"
|
||||
else
|
||||
# Build image on remote (arm64 native); then start the service.
|
||||
# --build rebuilds if context changed; idempotent if image is current.
|
||||
rrun docker compose \
|
||||
-f "${COMPOSE_BASE}" \
|
||||
-f "${COMPOSE_OVERRIDE}" \
|
||||
up -d --build node-agent
|
||||
fi
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Stage 3 — verify
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
step "[$STEP_NAME] 3/3 verify"
|
||||
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
log "dry-run: skipping verify (mutations may not have run)"
|
||||
else
|
||||
# Verify: container running (docker ps — not command -v)
|
||||
if rprobe "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}' 2>/dev/null | grep -q node-agent" 2>/dev/null; then
|
||||
log "Verify OK: node-agent container running"
|
||||
rprobe "docker ps --filter name=node-agent --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'" || true
|
||||
else
|
||||
die "node-agent container is NOT running — check: docker logs node-agent on ${TS_HOSTNAME}"
|
||||
fi
|
||||
|
||||
# Verify: fresh events appear in /opt/homelab/events/<node>/ (confirms agent writes)
|
||||
# First cycle runs at start then sleeps CHECK_INTERVAL; allow 90s.
|
||||
log "Waiting for first event (up to 90 s, CHECK_INTERVAL=60)..."
|
||||
_event_ok=0
|
||||
for _i in $(seq 1 9); do
|
||||
if rprobe "ls /opt/homelab/events/${TS_HOSTNAME}/*.json 2>/dev/null | head -1 | grep -q .json" 2>/dev/null; then
|
||||
_event_ok=1
|
||||
break
|
||||
fi
|
||||
log " ... ${_i}0 s elapsed, waiting..."
|
||||
sleep 10
|
||||
done
|
||||
|
||||
if [[ "$_event_ok" -eq 1 ]]; then
|
||||
log "Verify OK: events present in /opt/homelab/events/${TS_HOSTNAME}/"
|
||||
rprobe "ls -lth /opt/homelab/events/${TS_HOSTNAME}/ | head -5" || true
|
||||
else
|
||||
warn "No events yet in /opt/homelab/events/${TS_HOSTNAME}/ after 90 s — agent may still be initialising (CHECK_INTERVAL=60)"
|
||||
warn "Re-run verify manually: docker logs node-agent on ${TS_HOSTNAME}"
|
||||
fi
|
||||
fi
|
||||
|
||||
log "[$STEP_NAME] done"
|
||||
|
|
@ -1,140 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/40-register.sh — wpisz node do inventory i commituj na branchu
|
||||
#
|
||||
# Efekty (wszystkie idempotentne):
|
||||
# 1. Dopisuje blok <node> do inventory/topology.yaml
|
||||
# 2. Tworzy hosts/<node>/services.yaml jeśli nie istnieje
|
||||
# 3. git add + git commit na aktualnym branchu (NIE push — merge należy do operatora)
|
||||
#
|
||||
# Reload observera celowo poza tym krokiem — wykonywany ręcznie po merge→master,
|
||||
# git pull na VPS i uruchomieniu 50-verify.sh.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
STEP_NAME="40-register"
|
||||
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
|
||||
: "${DRY_RUN:=0}"
|
||||
|
||||
if ! declare -f log >/dev/null 2>&1; then
|
||||
# shellcheck source=../lib/common.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
|
||||
fi
|
||||
|
||||
NODE_ENTRY=$(yaml_get "${NODE_YAML}" "tailscale.hostname")
|
||||
[[ -z "${NODE_ENTRY}" ]] && die "tailscale.hostname not set in ${NODE_YAML}"
|
||||
|
||||
TOPOLOGY="${REPO_ROOT}/inventory/topology.yaml"
|
||||
SERVICES_YAML="${REPO_ROOT}/hosts/${NODE_ENTRY}/services.yaml"
|
||||
|
||||
# ── 1. inventory/topology.yaml ────────────────────────────────────────────────
|
||||
step "[${STEP_NAME}] 1/3 inventory/topology.yaml"
|
||||
|
||||
_TOPOLOGY_BLOCK=$(cat << 'EOF'
|
||||
|
||||
PLACEHOLDER:
|
||||
roles:
|
||||
- edge
|
||||
services:
|
||||
- node-agent
|
||||
EOF
|
||||
)
|
||||
# Replace the PLACEHOLDER with the actual node name
|
||||
_TOPOLOGY_BLOCK="${_TOPOLOGY_BLOCK//PLACEHOLDER/${NODE_ENTRY}}"
|
||||
|
||||
if grep -q "^ ${NODE_ENTRY}:" "${TOPOLOGY}"; then
|
||||
log "${NODE_ENTRY} already present in topology.yaml — skip"
|
||||
else
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "Would append to ${TOPOLOGY}:"
|
||||
echo "${_TOPOLOGY_BLOCK}"
|
||||
else
|
||||
printf '%s\n' "${_TOPOLOGY_BLOCK}" >> "${TOPOLOGY}"
|
||||
log "Appended ${NODE_ENTRY} block to topology.yaml"
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── 2. hosts/<node>/services.yaml ────────────────────────────────────────────
|
||||
step "[${STEP_NAME}] 2/3 hosts/${NODE_ENTRY}/services.yaml"
|
||||
|
||||
if [[ -f "${SERVICES_YAML}" ]]; then
|
||||
log "services.yaml already exists — skip"
|
||||
else
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "Would create ${SERVICES_YAML}:"
|
||||
cat << EOF
|
||||
host: ${NODE_ENTRY}
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
EOF
|
||||
else
|
||||
mkdir -p "${REPO_ROOT}/hosts/${NODE_ENTRY}"
|
||||
cat > "${SERVICES_YAML}" << EOF
|
||||
host: ${NODE_ENTRY}
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
EOF
|
||||
log "Created ${SERVICES_YAML}"
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── 3. git commit ─────────────────────────────────────────────────────────────
|
||||
step "[${STEP_NAME}] 3/3 git commit"
|
||||
|
||||
cd "${REPO_ROOT}"
|
||||
|
||||
_changed_files=()
|
||||
git diff --quiet "${TOPOLOGY}" 2>/dev/null || _changed_files+=("inventory/topology.yaml")
|
||||
[[ -f "${SERVICES_YAML}" ]] && \
|
||||
git ls-files --error-unmatch "${SERVICES_YAML}" 2>/dev/null || \
|
||||
_changed_files+=("hosts/${NODE_ENTRY}/services.yaml")
|
||||
|
||||
# Re-check: is anything staged or unstaged for these paths?
|
||||
_needs_commit=0
|
||||
if git diff --quiet && git diff --cached --quiet; then
|
||||
# Nothing changed at all — may already be committed
|
||||
if git ls-files --error-unmatch "${TOPOLOGY}" "${SERVICES_YAML}" >/dev/null 2>&1 && \
|
||||
! git diff HEAD -- "${TOPOLOGY}" "${SERVICES_YAML}" | grep -q .; then
|
||||
log "Nothing to commit — ${NODE_ENTRY} already registered and committed"
|
||||
else
|
||||
_needs_commit=1
|
||||
fi
|
||||
else
|
||||
_needs_commit=1
|
||||
fi
|
||||
|
||||
if [[ "${_needs_commit}" -eq 1 ]]; then
|
||||
run git add "inventory/topology.yaml" "hosts/${NODE_ENTRY}/services.yaml"
|
||||
run git commit -m "feat(onboard): register ${NODE_ENTRY} in topology + services.yaml"
|
||||
if [ "${DRY_RUN:-0}" != 1 ]; then
|
||||
log "Committed on $(git branch --show-current)"
|
||||
log "Next: agent.sh merge task/node-onboarding → master, git pull VPS, run 50-verify.sh"
|
||||
fi
|
||||
fi
|
||||
|
||||
log "[${STEP_NAME}] done"
|
||||
|
|
@ -1,160 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# scripts/onboard/steps/50-verify.sh — restart observera + smoke test węzła w panelu
|
||||
#
|
||||
# Uruchamiaj PO: merge task/node-onboarding → master + git pull na VPS.
|
||||
#
|
||||
# Sprawdzenia:
|
||||
# 1. SSH <node>: node-agent container running
|
||||
# 2. SSH <node>: eventy obecne w /opt/homelab/events/<node>/
|
||||
# 3. SSH VPS: docker restart control-plane-observer + poll observer.heartbeat
|
||||
# 4. SSH VPS: <node> widoczny w /opt/homelab/world/nodes.json
|
||||
#
|
||||
# Exit 0 — wszystkie OK | Exit 1 — co najmniej jedno FAIL (tabela podsumowująca)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
STEP_NAME="50-verify"
|
||||
|
||||
: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
|
||||
: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
|
||||
: "${DRY_RUN:=0}"
|
||||
|
||||
if ! declare -f log >/dev/null 2>&1; then
|
||||
# shellcheck source=../lib/common.sh
|
||||
source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
|
||||
fi
|
||||
|
||||
SSH_USER=$(yaml_get "${NODE_YAML}" "ssh_user")
|
||||
TS_HOSTNAME=$(yaml_get "${NODE_YAML}" "tailscale.hostname")
|
||||
[[ -z "${SSH_USER}" ]] && die "ssh_user not set in ${NODE_YAML}"
|
||||
[[ -z "${TS_HOSTNAME}" ]] && die "tailscale.hostname not set in ${NODE_YAML}"
|
||||
|
||||
VPS_SSH_USER="oskar"
|
||||
VPS_SSH_HOST="100.95.58.48"
|
||||
VPS_REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
|
||||
_SSH_OPTS=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes)
|
||||
|
||||
_ssh_node() { ssh "${_SSH_OPTS[@]}" "${SSH_USER}@${TS_HOSTNAME}" -- "$@"; }
|
||||
_ssh_vps() { ssh "${_SSH_OPTS[@]}" "${VPS_SSH_USER}@${VPS_SSH_HOST}" -- "$@"; }
|
||||
|
||||
declare -A RESULTS=()
|
||||
|
||||
# ── 1. node-agent running on <node> ──────────────────────────────────────────
|
||||
step "[${STEP_NAME}] 1/4 ${TS_HOSTNAME}: node-agent container"
|
||||
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "ssh ${SSH_USER}@${TS_HOSTNAME} docker ps --filter name=^node-agent\$"
|
||||
RESULTS["node-agent-running"]="skip"
|
||||
elif _ssh_node "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}'" 2>/dev/null \
|
||||
| grep -q "node-agent"; then
|
||||
log "OK: node-agent running"
|
||||
_ssh_node "docker ps --filter name=node-agent --format 'table {{.Names}}\t{{.Status}}'" 2>/dev/null || true
|
||||
RESULTS["node-agent-running"]="PASS"
|
||||
else
|
||||
warn "FAIL: node-agent nie działa na ${TS_HOSTNAME}"
|
||||
RESULTS["node-agent-running"]="FAIL"
|
||||
fi
|
||||
|
||||
# ── 2. eventy w /opt/homelab/events/<node>/ ───────────────────────────────────
|
||||
step "[${STEP_NAME}] 2/4 ${TS_HOSTNAME}: eventy"
|
||||
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "ssh ${SSH_USER}@${TS_HOSTNAME} find /opt/homelab/events/${TS_HOSTNAME}/ -name '*.json'"
|
||||
RESULTS["events-present"]="skip"
|
||||
elif _ssh_node "find /opt/homelab/events/${TS_HOSTNAME}/ -name '*.json' 2>/dev/null | head -1" 2>/dev/null \
|
||||
| grep -q ".json"; then
|
||||
_latest=$(_ssh_node "ls -t /opt/homelab/events/${TS_HOSTNAME}/*.json 2>/dev/null | head -1" || echo "?")
|
||||
log "OK: eventy obecne (ostatni: ${_latest})"
|
||||
RESULTS["events-present"]="PASS"
|
||||
else
|
||||
warn "FAIL: brak eventów w /opt/homelab/events/${TS_HOSTNAME}/"
|
||||
RESULTS["events-present"]="FAIL"
|
||||
fi
|
||||
|
||||
# ── 3. restart observera + healthcheck ────────────────────────────────────────
|
||||
step "[${STEP_NAME}] 3/4 VPS: restart control-plane-observer"
|
||||
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} docker restart control-plane-observer"
|
||||
dryrun "poll /opt/homelab/state/observer.heartbeat (max 30s)"
|
||||
RESULTS["observer-healthy"]="skip"
|
||||
else
|
||||
log "Restarting control-plane-observer na VPS..."
|
||||
_ssh_vps "docker restart control-plane-observer"
|
||||
|
||||
log "Polling observer.heartbeat (max 30s)..."
|
||||
_ok=0
|
||||
for _i in $(seq 1 6); do
|
||||
sleep 5
|
||||
_age=$(_ssh_vps "python3 -c \
|
||||
\"import os,time; s=os.stat('/opt/homelab/state/observer.heartbeat'); \
|
||||
print(int(time.time()-s.st_mtime))\" 2>/dev/null" || echo "999")
|
||||
if [[ "${_age}" -lt 20 ]]; then
|
||||
log "OK: observer.heartbeat fresh (${_age}s temu)"
|
||||
_ok=1
|
||||
break
|
||||
fi
|
||||
log " ... ${_i}×5s, heartbeat ${_age}s old..."
|
||||
done
|
||||
|
||||
if [[ "${_ok}" -eq 1 ]]; then
|
||||
RESULTS["observer-healthy"]="PASS"
|
||||
else
|
||||
warn "FAIL: observer.heartbeat nie odświeżony po 30s"
|
||||
warn "Sprawdź: ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} docker logs control-plane-observer --tail 30"
|
||||
RESULTS["observer-healthy"]="FAIL"
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── 4. <node> widoczny w world/nodes.json ─────────────────────────────────────
|
||||
step "[${STEP_NAME}] 4/4 VPS: ${TS_HOSTNAME} w world/nodes.json"
|
||||
|
||||
if [ "${DRY_RUN:-0}" = 1 ]; then
|
||||
dryrun "ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} python3 -c \"json.load(.../world/nodes.json)['${TS_HOSTNAME}']\""
|
||||
RESULTS["world-state"]="skip"
|
||||
else
|
||||
_node_status=$(_ssh_vps "python3 -c \"
|
||||
import json, sys
|
||||
try:
|
||||
d = json.load(open('/opt/homelab/world/nodes.json'))
|
||||
node = d.get('${TS_HOSTNAME}', {})
|
||||
print(node.get('status', 'missing'))
|
||||
except Exception as e:
|
||||
print('error:' + str(e))
|
||||
\"" 2>/dev/null || echo "ssh-error")
|
||||
|
||||
case "${_node_status}" in
|
||||
online|offline)
|
||||
log "OK: ${TS_HOSTNAME} w world/nodes.json (status=${_node_status})"
|
||||
RESULTS["world-state"]="PASS"
|
||||
;;
|
||||
missing)
|
||||
warn "FAIL: ${TS_HOSTNAME} nie ma wpisu w world/nodes.json"
|
||||
warn "Możliwa przyczyna: observer nie przetworzyл jeszcze eventów (poczekaj 60s i spróbuj ponownie)"
|
||||
RESULTS["world-state"]="FAIL"
|
||||
;;
|
||||
*)
|
||||
warn "FAIL: nieoczekiwana odpowiedź: ${_node_status}"
|
||||
RESULTS["world-state"]="FAIL"
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
|
||||
# ── tabela podsumowująca ──────────────────────────────────────────────────────
|
||||
echo ""
|
||||
printf '%s\n' "══════════════════════════════════════════"
|
||||
printf " %-30s %s\n" "CHECK" "RESULT"
|
||||
printf '%s\n' "──────────────────────────────────────────"
|
||||
for _key in "node-agent-running" "events-present" "observer-healthy" "world-state"; do
|
||||
_val="${RESULTS[${_key}]:-???}"
|
||||
printf " %-30s %s\n" "${_key}" "${_val}"
|
||||
done
|
||||
printf '%s\n' "══════════════════════════════════════════"
|
||||
echo ""
|
||||
|
||||
for _val in "${RESULTS[@]}"; do
|
||||
[[ "${_val}" == "FAIL" ]] && { warn "Verify: co najmniej jeden check nie przeszedł"; exit 1; }
|
||||
done
|
||||
|
||||
log "[${STEP_NAME}] Verify OK — ${TS_HOSTNAME} zarejestrowany i widoczny w panelu"
|
||||
|
|
@ -1,55 +0,0 @@
|
|||
### Agent System
|
||||
Central runtime materializer and Operator Control Plane UI.
|
||||
|
||||
#### Components
|
||||
- **Redis**: Central state store (on PIHA).
|
||||
- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
|
||||
- **Web UI**: Exposes API endpoints and serving the Operator UI.
|
||||
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
|
||||
|
||||
#### Configuration
|
||||
Environment variables should be set in `.env` (see `env.example`).
|
||||
Key variables for the Telegram Bot:
|
||||
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
|
||||
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
|
||||
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
|
||||
|
||||
#### Telegram Commands
|
||||
- `/status`: Check bot and API connectivity.
|
||||
- `/summary`: System health overview.
|
||||
- `/nodes`: List homelab nodes and their status.
|
||||
- `/services`: Summary of services across nodes.
|
||||
- `/unhealthy`: List all unhealthy components.
|
||||
- `/incidents`: View active incidents.
|
||||
- `/actions`: Summary of operator actions.
|
||||
- `/help`: List all commands.
|
||||
|
||||
#### Deployment (on PIHA)
|
||||
```bash
|
||||
cd services/agent-system
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
#### Deployment (on CHELSTY)
|
||||
```bash
|
||||
cd services/stability-agent
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
#### Verification
|
||||
The `deploy.sh` script automatically verifies the local endpoints.
|
||||
You can also manually check:
|
||||
```bash
|
||||
# Check runtime summary
|
||||
curl http://localhost:18180/summary
|
||||
|
||||
# Check discovered nodes
|
||||
curl http://localhost:18180/nodes
|
||||
|
||||
# Check discovered services
|
||||
curl http://localhost:18180/services
|
||||
```
|
||||
|
||||
#### Directory Structure
|
||||
- `/opt/homelab/world`: Contains materialized JSON state.
|
||||
- `/opt/homelab/state`: Contains operator configuration and local heartbeats.
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
### Action Approval Data Model
|
||||
|
||||
Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
|
||||
|
||||
#### Statuses
|
||||
- `pending`: Waiting for operator approval. AI agents create actions in this state.
|
||||
- `approved`: Approved by operator, ready for execution.
|
||||
- `rejected`: Rejected by operator, will not be executed.
|
||||
- `running`: Currently being executed by an agent (e.g. `materializer`).
|
||||
- `completed`: Successfully executed.
|
||||
- `failed`: Execution failed.
|
||||
|
||||
#### Human-in-the-Loop (HIL) Protocol
|
||||
1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
|
||||
2. **Notification**: System notifies the human operator.
|
||||
3. **Audit**: Human reviews `details.reason` and `details.diff`.
|
||||
4. **Authorization**: Human moves file to `approved/`.
|
||||
5. **Execution**: Agent monitors `approved/` and executes the task.
|
||||
|
||||
#### Schema
|
||||
```json
|
||||
{
|
||||
"action_id": "string",
|
||||
"service": "string",
|
||||
"node": "string",
|
||||
"type": "deploy_service | restart_service | rollback | scale",
|
||||
"risk": "nominal | guarded | critical",
|
||||
"status": "pending | approved | rejected | ...",
|
||||
"created_at": <unix_seconds>,
|
||||
"updated_at": <unix_seconds>,
|
||||
"details": {
|
||||
"image": "string",
|
||||
"reason": "string",
|
||||
"diff": "string"
|
||||
},
|
||||
"transition_history": [
|
||||
{
|
||||
"from": "string | null",
|
||||
"to": "string",
|
||||
"timestamp": <unix_seconds>,
|
||||
"by": "string (system | operator-tg-12345 | webui)"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Workflow
|
||||
1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
|
||||
2. `telegram-bot` detects the file, sends a message to allowed users.
|
||||
3. Operator clicks "Approve" or "Reject".
|
||||
4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
|
||||
5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
echo ">>> Validating docker-compose configuration..."
|
||||
docker compose config
|
||||
|
||||
echo ">>> Building and starting Agent System services..."
|
||||
docker compose up -d --build
|
||||
|
||||
echo ">>> Services status:"
|
||||
docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
|
||||
if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
|
||||
echo ">>> Telegram bot status: DISABLED (token missing)"
|
||||
else
|
||||
echo ">>> Telegram bot status: ENABLED"
|
||||
fi
|
||||
|
||||
echo ">>> Verifying API endpoints..."
|
||||
sleep 5 # Give it a moment to start
|
||||
|
||||
endpoints=("summary" "nodes" "services")
|
||||
for ep in "${endpoints[@]}"; do
|
||||
echo "Checking /$ep..."
|
||||
curl -s -f http://localhost:18180/$ep > /dev/null && echo " OK" || echo " FAILED"
|
||||
done
|
||||
|
||||
echo ">>> Deployment complete."
|
||||
|
|
@ -1,47 +0,0 @@
|
|||
services:
|
||||
redis:
|
||||
image: redis:7
|
||||
container_name: agent-system-redis
|
||||
ports:
|
||||
- "6379:6379"
|
||||
restart: unless-stopped
|
||||
|
||||
webui:
|
||||
build: ./webui
|
||||
container_name: agent-system-webui
|
||||
ports:
|
||||
- "18180:8080"
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
depends_on:
|
||||
- redis
|
||||
restart: unless-stopped
|
||||
|
||||
runtime-materializer:
|
||||
build: ./runtime-materializer
|
||||
container_name: agent-system-runtime-materializer
|
||||
environment:
|
||||
REDIS_HOST: redis
|
||||
REDIS_PORT: "6379"
|
||||
HOMELAB_WORLD_ROOT: /opt/homelab/world
|
||||
WORLD_DIR: /opt/homelab/world
|
||||
MATERIALIZE_INTERVAL: "10"
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
depends_on:
|
||||
- redis
|
||||
restart: unless-stopped
|
||||
|
||||
telegram-bot:
|
||||
build: ./telegram-bot
|
||||
container_name: agent-system-telegram-bot
|
||||
environment:
|
||||
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
|
||||
TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
|
||||
CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
|
||||
ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
|
||||
OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
|
||||
ACTIONS_ROOT: /opt/homelab/actions
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
restart: on-failure
|
||||
|
|
@ -1,19 +0,0 @@
|
|||
# Telegram Bot Configuration
|
||||
# Get token from @BotFather
|
||||
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
|
||||
# Comma-separated list of Telegram User IDs
|
||||
TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
|
||||
# Local control-plane API (default is internal compose address)
|
||||
CONTROL_PLANE_URL=http://webui:8080
|
||||
# Optional LLM fallback logic
|
||||
ENABLE_LLM_FALLBACK=false
|
||||
OPENCLAW_BASE_URL=http://openclaw.internal
|
||||
|
||||
# Runtime Materializer Configuration
|
||||
REDIS_HOST=100.108.208.3
|
||||
REDIS_PORT=6379
|
||||
|
||||
# Paths
|
||||
HOMELAB_ROOT=/opt/homelab
|
||||
ACTIONS_ROOT=/opt/homelab/actions
|
||||
WORLD_DIR=/opt/homelab/world
|
||||
|
|
@ -1,16 +0,0 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install redis python package as requested
|
||||
RUN pip install --no-cache-dir redis
|
||||
|
||||
COPY materializer.py .
|
||||
|
||||
# Ensure the world directory exists in the container (though it will likely be a volume)
|
||||
RUN mkdir -p /opt/homelab/world
|
||||
|
||||
# Use unbuffered output to see logs in docker
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
CMD ["python", "materializer.py"]
|
||||
|
|
@ -1,251 +0,0 @@
|
|||
import redis
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import argparse
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from datetime import datetime
|
||||
|
||||
# Configuration from environment variables
|
||||
REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
|
||||
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
|
||||
WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
|
||||
|
||||
# When set, materialize from the control-plane HTTP API instead of Redis.
|
||||
# This is the authoritative source of truth: the observer writes clean world
|
||||
# state to the control-plane API, which the materializer mirrors locally so
|
||||
# the webui's /snapshot (and all other endpoints) reflect the same data.
|
||||
#
|
||||
# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
|
||||
CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
|
||||
|
||||
|
||||
def get_redis_client():
|
||||
"""Returns a Redis client with decoding enabled."""
|
||||
return redis.Redis(
|
||||
host=REDIS_HOST,
|
||||
port=REDIS_PORT,
|
||||
decode_responses=True,
|
||||
socket_timeout=5
|
||||
)
|
||||
|
||||
def safe_json_loads(data, default=None):
|
||||
"""Safely loads JSON from a string."""
|
||||
if not data:
|
||||
return default
|
||||
try:
|
||||
if isinstance(data, (dict, list)):
|
||||
return data
|
||||
return json.loads(data)
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return data
|
||||
|
||||
def normalize_health(health):
|
||||
"""Normalizes health values for the UI."""
|
||||
if not health:
|
||||
return "nominal"
|
||||
h = str(health).lower()
|
||||
if h in ["healthy", "ok", "running", "nominal"]:
|
||||
return "nominal"
|
||||
if h in ["degraded", "warning"]:
|
||||
return "degraded"
|
||||
return "error"
|
||||
|
||||
|
||||
def _fetch_json(url):
|
||||
"""Fetch JSON from a URL, returning parsed data or None on error."""
|
||||
try:
|
||||
with urllib.request.urlopen(url, timeout=10) as resp:
|
||||
return json.loads(resp.read())
|
||||
except Exception as e:
|
||||
print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def write_json(filename, data):
|
||||
path = os.path.join(WORLD_DIR, filename)
|
||||
with open(path, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
|
||||
def materialize_from_api():
|
||||
"""Mirror world state from the control-plane API to local world files.
|
||||
|
||||
The control-plane observer on VPS is the single authoritative writer of
|
||||
world state. By fetching from its HTTP API we get the same clean, pruned
|
||||
data that the /summary endpoint serves — no stale Redis artefacts.
|
||||
|
||||
Returns True if all fetches succeeded and files were written, False otherwise.
|
||||
"""
|
||||
print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
|
||||
|
||||
endpoints = {
|
||||
"nodes.json": f"{CONTROL_PLANE_URL}/nodes",
|
||||
"services.json": f"{CONTROL_PLANE_URL}/services",
|
||||
"incidents.json": f"{CONTROL_PLANE_URL}/incidents",
|
||||
"deployments.json": f"{CONTROL_PLANE_URL}/deployments",
|
||||
"recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
|
||||
"runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
|
||||
"events.json": f"{CONTROL_PLANE_URL}/events",
|
||||
}
|
||||
|
||||
fetched = {}
|
||||
for filename, url in endpoints.items():
|
||||
data = _fetch_json(url)
|
||||
if data is None:
|
||||
print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
|
||||
return False
|
||||
fetched[filename] = data
|
||||
|
||||
os.makedirs(WORLD_DIR, exist_ok=True)
|
||||
for filename, data in fetched.items():
|
||||
write_json(filename, data)
|
||||
|
||||
svc_count = len(fetched.get("services.json") or [])
|
||||
print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
|
||||
return True
|
||||
|
||||
|
||||
def materialize():
|
||||
"""Reads state from Redis and writes JSON files to the world directory."""
|
||||
print(f"[{datetime.now().isoformat()}] Materializing world state...")
|
||||
try:
|
||||
r = get_redis_client()
|
||||
|
||||
# 1. Nodes
|
||||
nodes = []
|
||||
node_keys = r.keys("homelab:nodes:*")
|
||||
for key in node_keys:
|
||||
node_data = r.hgetall(key)
|
||||
if node_data:
|
||||
# Normalize health
|
||||
if "health" in node_data:
|
||||
node_data["health"] = normalize_health(node_data["health"])
|
||||
# Parse JSON fields if they exist
|
||||
if "capabilities" in node_data:
|
||||
node_data["capabilities"] = safe_json_loads(node_data["capabilities"], [])
|
||||
if "checks" in node_data:
|
||||
node_data["checks"] = safe_json_loads(node_data["checks"], {})
|
||||
nodes.append(node_data)
|
||||
|
||||
# 2. Services
|
||||
services = []
|
||||
service_keys = r.keys("homelab:services:*")
|
||||
for key in service_keys:
|
||||
svc_data = r.hgetall(key)
|
||||
if svc_data:
|
||||
# Normalize health
|
||||
if "health" in svc_data:
|
||||
svc_data["health"] = normalize_health(svc_data["health"])
|
||||
if "dependencies" in svc_data:
|
||||
svc_data["dependencies"] = safe_json_loads(svc_data["dependencies"], [])
|
||||
if "recommendations" in svc_data:
|
||||
svc_data["recommendations"] = safe_json_loads(svc_data["recommendations"], [])
|
||||
services.append(svc_data)
|
||||
|
||||
# 3. Events (Stream)
|
||||
events = []
|
||||
try:
|
||||
# Get last 100 events from the stream
|
||||
raw_events = r.xrevrange("homelab:events", count=100)
|
||||
for event_id, data in raw_events:
|
||||
event = data.copy()
|
||||
event["id"] = event_id
|
||||
if "details" in event:
|
||||
event["details"] = safe_json_loads(event["details"], {})
|
||||
events.append(event)
|
||||
except redis.exceptions.ResponseError:
|
||||
# homelab:events might not be a stream or doesn't exist
|
||||
pass
|
||||
|
||||
# 4. Incidents (Hash)
|
||||
incidents = []
|
||||
incident_keys = r.keys("homelab:incidents:*")
|
||||
for key in incident_keys:
|
||||
incident_data = r.hgetall(key)
|
||||
if incident_data:
|
||||
# Normalize health if present
|
||||
if "health" in incident_data:
|
||||
incident_data["health"] = normalize_health(incident_data["health"])
|
||||
incidents.append(incident_data)
|
||||
|
||||
# 5. Deployments (Hash)
|
||||
deployments = []
|
||||
deployment_keys = r.keys("homelab:deployments:*")
|
||||
for key in deployment_keys:
|
||||
dep_data = r.hgetall(key)
|
||||
if dep_data:
|
||||
deployments.append(dep_data)
|
||||
|
||||
# 6. Recommendations (Hash)
|
||||
recommendations = []
|
||||
recommendation_keys = r.keys("homelab:recommendations:*")
|
||||
for key in recommendation_keys:
|
||||
rec_data = r.hgetall(key)
|
||||
if rec_data:
|
||||
recommendations.append(rec_data)
|
||||
|
||||
# 7. Runtime Summary
|
||||
unhealthy_services = [s for s in services if s.get("health") != "nominal"]
|
||||
active_incidents = [i for i in incidents if i.get("status") not in ["resolved", "closed"]]
|
||||
|
||||
status = "nominal"
|
||||
if len(active_incidents) > 0 or len(unhealthy_services) > 5:
|
||||
status = "error"
|
||||
elif len(unhealthy_services) > 0:
|
||||
status = "degraded"
|
||||
|
||||
summary = {
|
||||
"status": status,
|
||||
"timestamp": datetime.utcnow().isoformat() + "Z",
|
||||
"last_update": int(time.time()),
|
||||
"node_count": len(nodes),
|
||||
"service_count": len(services),
|
||||
"active_incidents_count": len(active_incidents),
|
||||
"unhealthy_services_count": len(unhealthy_services),
|
||||
"incident_count": len(incidents),
|
||||
"recent_events_count": len(events),
|
||||
"stale": False
|
||||
}
|
||||
|
||||
# Ensure directory exists
|
||||
os.makedirs(WORLD_DIR, exist_ok=True)
|
||||
|
||||
write_json("runtime-summary.json", summary)
|
||||
write_json("nodes.json", nodes)
|
||||
write_json("services.json", services)
|
||||
write_json("incidents.json", incidents)
|
||||
write_json("events.json", events)
|
||||
write_json("deployments.json", deployments)
|
||||
write_json("recommendations.json", recommendations)
|
||||
|
||||
print(f"[{datetime.now().isoformat()}] Successfully materialized to {WORLD_DIR}")
|
||||
|
||||
except redis.exceptions.ConnectionError as e:
|
||||
print(f"Redis connection error: {e}")
|
||||
except Exception as e:
|
||||
print(f"Unexpected error during materialization: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Homelab Runtime Materializer")
|
||||
parser.add_argument("--once", action="store_true", help="Run once and exit")
|
||||
parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
|
||||
args = parser.parse_args()
|
||||
|
||||
if CONTROL_PLANE_URL:
|
||||
print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
|
||||
run_fn = materialize_from_api
|
||||
else:
|
||||
print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
|
||||
run_fn = materialize
|
||||
|
||||
interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
|
||||
|
||||
if args.once:
|
||||
run_fn()
|
||||
else:
|
||||
print(f"Starting materializer loop (interval: {interval}s)...")
|
||||
while True:
|
||||
run_fn()
|
||||
time.sleep(interval)
|
||||
|
|
@ -1,39 +0,0 @@
|
|||
#!/bin/bash
|
||||
# Script to create a test pending action for Telegram bot verification.
|
||||
|
||||
ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
|
||||
mkdir -p "$ACTIONS_PENDING_DIR"
|
||||
|
||||
ACTION_ID="test-$(date +%s)"
|
||||
FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
|
||||
|
||||
TIMESTAMP=$(date +%s)
|
||||
|
||||
cat <<EOF > "$FILE_PATH"
|
||||
{
|
||||
"action_id": "$ACTION_ID",
|
||||
"service": "frigate",
|
||||
"node": "chelsty",
|
||||
"type": "deploy_service",
|
||||
"risk": "guarded",
|
||||
"status": "pending",
|
||||
"created_at": $TIMESTAMP,
|
||||
"updated_at": $TIMESTAMP,
|
||||
"details": {
|
||||
"image": "blakeblackshear/frigate:0.13.0",
|
||||
"reason": "Security update for Frigate",
|
||||
"diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
|
||||
},
|
||||
"transition_history": [
|
||||
{
|
||||
"from": null,
|
||||
"to": "pending",
|
||||
"timestamp": $TIMESTAMP,
|
||||
"by": "system-test"
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF
|
||||
|
||||
echo "Test action created: $FILE_PATH"
|
||||
echo "If the telegram-bot is running and configured, you should receive a notification."
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
COPY bot.py .
|
||||
|
||||
CMD ["python", "bot.py"]
|
||||
|
|
@ -1,454 +0,0 @@
|
|||
import os
|
||||
import json
|
||||
import time
|
||||
import asyncio
|
||||
import logging
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
|
||||
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
level=logging.INFO
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Configuration
|
||||
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
|
||||
ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
|
||||
ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||
CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
|
||||
ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
|
||||
OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
|
||||
|
||||
async def fetch_api(path):
|
||||
"""Helper to fetch JSON from the Control Plane API."""
|
||||
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
||||
try:
|
||||
def do_request():
|
||||
req = urllib.request.Request(url)
|
||||
with urllib.request.urlopen(req, timeout=5) as response:
|
||||
if response.status != 200:
|
||||
return None
|
||||
return json.loads(response.read().decode())
|
||||
return await asyncio.to_thread(do_request)
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
async def post_api(path, data):
|
||||
"""Helper to POST JSON to the Control Plane API."""
|
||||
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
||||
try:
|
||||
body = json.dumps(data).encode("utf-8")
|
||||
def do_request():
|
||||
req = urllib.request.Request(url, data=body, method="POST")
|
||||
req.add_header("Content-Type", "application/json")
|
||||
with urllib.request.urlopen(req, timeout=5) as response:
|
||||
return response.status == 200
|
||||
return await asyncio.to_thread(do_request)
|
||||
except Exception as e:
|
||||
logger.error(f"Error posting to {url}: {e}")
|
||||
return False
|
||||
|
||||
def _format_pending_action(action_id: str, data: dict) -> str:
|
||||
"""Build the Telegram Markdown message for a pending action notification.
|
||||
|
||||
Extracted so it can be unit-tested without a live Telegram connection.
|
||||
"""
|
||||
# Supervisor writes risk_level; action-model.md legacy schema used risk.
|
||||
risk = data.get("risk_level") or data.get("risk", "unknown")
|
||||
message = (
|
||||
f"⚠️ *Pending Action*\n"
|
||||
f"ID: `{action_id}`\n"
|
||||
f"Type: `{data.get('type', 'unknown')}`\n"
|
||||
f"Service: `{data.get('service', 'unknown')}`\n"
|
||||
f"Node: `{data.get('node', 'unknown')}`\n"
|
||||
f"Risk: *{risk}*\n"
|
||||
)
|
||||
# description carries the human-readable substance of the action (required for
|
||||
# alert_only actions where it is the entire operator-visible message).
|
||||
description = data.get("description", "")
|
||||
if description:
|
||||
truncated = description[:300] + ("..." if len(description) > 300 else "")
|
||||
message += f"Description: `{truncated}`\n"
|
||||
# Legacy details block (old action-model.md schema) — kept for backwards compat.
|
||||
if "details" in data:
|
||||
details_str = json.dumps(data["details"], indent=2)
|
||||
if len(details_str) > 1000:
|
||||
details_str = details_str[:1000] + "..."
|
||||
message += f"\nDetails:\n```json\n{details_str}\n```"
|
||||
return message
|
||||
|
||||
|
||||
class ApprovalBot:
|
||||
def __init__(self):
|
||||
self.pending_dir = ACTIONS_ROOT / "pending"
|
||||
self.approved_dir = ACTIONS_ROOT / "approved"
|
||||
self.rejected_dir = ACTIONS_ROOT / "rejected"
|
||||
# Track which action IDs we have already notified in this session to avoid spam
|
||||
self.notified_actions = set()
|
||||
|
||||
async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Job that periodically checks for new pending action files."""
|
||||
if not self.pending_dir.exists():
|
||||
return
|
||||
|
||||
try:
|
||||
for action_file in self.pending_dir.glob("*.json"):
|
||||
action_id = action_file.stem
|
||||
if action_id in self.notified_actions:
|
||||
continue
|
||||
|
||||
try:
|
||||
data = json.loads(action_file.read_text())
|
||||
# Only notify if it's truly pending
|
||||
if data.get("status") == "pending":
|
||||
await self.notify_users(context, action_id, data)
|
||||
self.notified_actions.add(action_id)
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing action file {action_file}: {e}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error scanning pending directory: {e}")
|
||||
|
||||
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
|
||||
"""Sends an approval request message to all allowed users."""
|
||||
message = _format_pending_action(action_id, data)
|
||||
|
||||
keyboard = [
|
||||
[
|
||||
InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
|
||||
InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
|
||||
]
|
||||
]
|
||||
reply_markup = InlineKeyboardMarkup(keyboard)
|
||||
|
||||
for user_id in ALLOWED_IDS:
|
||||
try:
|
||||
await context.bot.send_message(
|
||||
chat_id=user_id,
|
||||
text=message,
|
||||
parse_mode="Markdown",
|
||||
reply_markup=reply_markup
|
||||
)
|
||||
logger.info(f"Notified user {user_id} about action {action_id}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to notify user {user_id}: {e}")
|
||||
|
||||
async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Handles button clicks for Approve/Reject."""
|
||||
query = update.callback_query
|
||||
user_id = query.from_user.id
|
||||
|
||||
if user_id not in ALLOWED_IDS:
|
||||
await query.answer("Unauthorized", show_alert=True)
|
||||
return
|
||||
|
||||
await query.answer()
|
||||
|
||||
cb_data = query.data
|
||||
if ":" not in cb_data:
|
||||
return
|
||||
|
||||
action, action_id = cb_data.split(":", 1)
|
||||
target_status = "approved" if action == "approve" else "rejected"
|
||||
|
||||
# Use API for mutation if available, fallback to local disk move
|
||||
success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
|
||||
msg = "Success" if success else "API call failed"
|
||||
|
||||
if not success:
|
||||
# Fallback to direct disk manipulation (original behavior)
|
||||
success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
|
||||
|
||||
if success:
|
||||
status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
|
||||
await query.edit_message_text(
|
||||
text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
|
||||
parse_mode="Markdown"
|
||||
)
|
||||
# Remove from notified list as it's no longer pending
|
||||
if action_id in self.notified_actions:
|
||||
self.notified_actions.remove(action_id)
|
||||
else:
|
||||
await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
|
||||
|
||||
def move_action(self, action_id, target_status, user_id, username):
|
||||
"""Moves action file and updates its status and history."""
|
||||
source_path = self.pending_dir / f"{action_id}.json"
|
||||
if not source_path.exists():
|
||||
return False, "Action file no longer exists in pending."
|
||||
|
||||
target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
target_path = target_dir / f"{action_id}.json"
|
||||
|
||||
try:
|
||||
data = json.loads(source_path.read_text())
|
||||
current_status = data.get("status", "pending")
|
||||
|
||||
# Update data
|
||||
data["status"] = target_status
|
||||
data["updated_at"] = time.time()
|
||||
|
||||
history = data.get("transition_history", [])
|
||||
history.append({
|
||||
"from": current_status,
|
||||
"to": target_status,
|
||||
"timestamp": time.time(),
|
||||
"by": f"tg:{username}"
|
||||
})
|
||||
data["transition_history"] = history
|
||||
|
||||
# Atomic move: write to new location, then delete old
|
||||
target_path.write_text(json.dumps(data, indent=2))
|
||||
source_path.unlink()
|
||||
logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
|
||||
return True, "Success"
|
||||
except Exception as e:
|
||||
logger.error(f"Error moving action file: {e}")
|
||||
return False, str(e)
|
||||
|
||||
async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Simple start command to help users find their ID."""
|
||||
user = update.effective_user
|
||||
message = (
|
||||
f"Hello {user.first_name}! 🤖\n"
|
||||
f"Your Telegram User ID is: `{user.id}`\n\n"
|
||||
)
|
||||
if user.id in ALLOWED_IDS:
|
||||
message += "✅ You are authorized to manage the homelab.\n\n"
|
||||
message += "Use /help to see available commands."
|
||||
else:
|
||||
message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
|
||||
|
||||
await update.message.reply_text(message, parse_mode="Markdown")
|
||||
|
||||
async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
res = await fetch_api("/summary")
|
||||
status = "✅ Online" if res else "❌ Unreachable"
|
||||
message = (
|
||||
f"🤖 *Telegram Bot Status*\n"
|
||||
f"Control Plane API: {status}\n"
|
||||
f"Target URL: `{CONTROL_PLANE_URL}`\n"
|
||||
)
|
||||
await update.message.reply_text(message, parse_mode="Markdown")
|
||||
|
||||
async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
data = await fetch_api("/summary")
|
||||
if not data:
|
||||
await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
|
||||
return
|
||||
|
||||
msg = "📊 *System Summary*\n"
|
||||
msg += f"Status: `{data.get('status', 'unknown')}`\n"
|
||||
msg += f"Nodes: {data.get('node_count', 0)}\n"
|
||||
msg += f"Services: {data.get('service_count', 0)}\n"
|
||||
msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
|
||||
if data.get('stale'):
|
||||
msg += "\n⚠️ *Warning: Data is stale!*"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
nodes = await fetch_api("/nodes")
|
||||
if nodes is None:
|
||||
await update.message.reply_text("❌ Failed to fetch nodes.")
|
||||
return
|
||||
|
||||
if not nodes:
|
||||
await update.message.reply_text("No nodes discovered in the fleet.")
|
||||
return
|
||||
|
||||
msg = "🖥️ *Nodes Status*\n"
|
||||
for node in nodes:
|
||||
health_icon = "✅" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else "❌"
|
||||
msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
|
||||
msg += f" Last seen: {node.get('last_seen', 'N/A')}\n"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
services = await fetch_api("/services")
|
||||
if services is None:
|
||||
await update.message.reply_text("❌ Failed to fetch services.")
|
||||
return
|
||||
|
||||
# Summarize by node
|
||||
nodes = {}
|
||||
for s in services:
|
||||
node = s.get("node", "unknown")
|
||||
if node not in nodes: nodes[node] = []
|
||||
nodes[node].append(s)
|
||||
|
||||
msg = "⚙️ *Services Summary*\n"
|
||||
if not nodes:
|
||||
msg += "No services discovered."
|
||||
else:
|
||||
for node, svc_list in sorted(nodes.items()):
|
||||
nominal = len([s for s in svc_list if s.get("health") == "nominal"])
|
||||
msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
|
||||
|
||||
msg += "\nUse /unhealthy to see issues."
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
services = await fetch_api("/services")
|
||||
nodes = await fetch_api("/nodes")
|
||||
|
||||
msg = "⚠️ *Unhealthy Components*\n"
|
||||
found = False
|
||||
|
||||
if services:
|
||||
for s in services:
|
||||
health = s.get("health", "").lower()
|
||||
if health != "nominal":
|
||||
msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
|
||||
found = True
|
||||
|
||||
if nodes:
|
||||
for n in nodes:
|
||||
checks = n.get("checks", {})
|
||||
if isinstance(checks, str):
|
||||
try: checks = json.loads(checks)
|
||||
except: checks = {}
|
||||
|
||||
docker = checks.get("docker", {})
|
||||
if docker.get("status") == "ok":
|
||||
for c in docker.get("containers", []):
|
||||
if c.get("state") != "running":
|
||||
msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
|
||||
found = True
|
||||
|
||||
if not found:
|
||||
msg += "All systems nominal. ✅"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
incidents = await fetch_api("/incidents")
|
||||
if incidents is None:
|
||||
await update.message.reply_text("❌ Failed to fetch incidents.")
|
||||
return
|
||||
|
||||
active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
|
||||
if not active:
|
||||
await update.message.reply_text("No active incidents. ✅")
|
||||
return
|
||||
|
||||
msg = "🚨 *Active Incidents*\n"
|
||||
for inc in active:
|
||||
severity = inc.get('severity', 'info').upper()
|
||||
msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
actions = await fetch_api("/actions")
|
||||
if actions is None:
|
||||
await update.message.reply_text("❌ Actions endpoint unavailable.")
|
||||
return
|
||||
|
||||
msg = "⚡ *Actions Summary*\n"
|
||||
total = 0
|
||||
for status, act_list in actions.items():
|
||||
if act_list:
|
||||
msg += f"• {status.capitalize()}: {len(act_list)}\n"
|
||||
total += len(act_list)
|
||||
|
||||
if total == 0:
|
||||
msg = "No actions recorded."
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
msg = (
|
||||
"📖 *Supported Commands*\n\n"
|
||||
"/status - Check bot and API connectivity\n"
|
||||
"/summary - System health overview\n"
|
||||
"/nodes - List homelab nodes and their status\n"
|
||||
"/services - Summary of services across nodes\n"
|
||||
"/unhealthy - List all unhealthy components\n"
|
||||
"/incidents - View active incidents\n"
|
||||
"/actions - Summary of operator actions\n"
|
||||
"/help - Show this help message\n\n"
|
||||
"Free text will be handled by the guidance system."
|
||||
)
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Handles non-command messages."""
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
|
||||
if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
|
||||
# Placeholder for OpenClaw LLM fallback
|
||||
# In a real scenario, this would call the LLM API
|
||||
logger.info(f"LLM fallback requested for: {update.message.text}")
|
||||
|
||||
await update.message.reply_text(
|
||||
"Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
|
||||
)
|
||||
|
||||
async def run_bot():
|
||||
if not TOKEN:
|
||||
print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
|
||||
# Keep process alive to not crash compose if not desired, but here we just exit
|
||||
# Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
|
||||
return
|
||||
|
||||
bot_logic = ApprovalBot()
|
||||
|
||||
application = ApplicationBuilder().token(TOKEN).build()
|
||||
|
||||
application.add_handler(CommandHandler("start", start_command))
|
||||
application.add_handler(CommandHandler("status", status_command))
|
||||
application.add_handler(CommandHandler("summary", summary_command))
|
||||
application.add_handler(CommandHandler("nodes", nodes_command))
|
||||
application.add_handler(CommandHandler("services", services_command))
|
||||
application.add_handler(CommandHandler("unhealthy", unhealthy_command))
|
||||
application.add_handler(CommandHandler("incidents", incidents_command))
|
||||
application.add_handler(CommandHandler("actions", actions_command))
|
||||
application.add_handler(CommandHandler("help", help_command))
|
||||
|
||||
application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
|
||||
application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
|
||||
|
||||
# Schedule the pending actions check
|
||||
job_queue = application.job_queue
|
||||
if job_queue:
|
||||
job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
|
||||
else:
|
||||
logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
|
||||
|
||||
logger.info("Starting Telegram Approval Bot...")
|
||||
await application.initialize()
|
||||
await application.start()
|
||||
await application.updater.start_polling()
|
||||
|
||||
# Run until the application is stopped
|
||||
stop_event = asyncio.Event()
|
||||
try:
|
||||
await stop_event.wait()
|
||||
except (KeyboardInterrupt, SystemExit):
|
||||
logger.info("Stopping bot...")
|
||||
finally:
|
||||
await application.stop()
|
||||
await application.shutdown()
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(run_bot())
|
||||
except KeyboardInterrupt:
|
||||
pass
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
|
|
@ -1 +0,0 @@
|
|||
python-telegram-bot[job-queue]==20.7
|
||||
|
|
@ -1,38 +0,0 @@
|
|||
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
import types
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
|
||||
def _make_telegram_stub() -> types.ModuleType:
|
||||
mod = types.ModuleType("telegram")
|
||||
mod.Update = MagicMock
|
||||
mod.InlineKeyboardButton = MagicMock
|
||||
mod.InlineKeyboardMarkup = MagicMock
|
||||
return mod
|
||||
|
||||
|
||||
def _make_telegram_ext_stub() -> types.ModuleType:
|
||||
mod = types.ModuleType("telegram.ext")
|
||||
mod.ApplicationBuilder = MagicMock
|
||||
|
||||
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
|
||||
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
|
||||
ContextTypesMock = MagicMock()
|
||||
ContextTypesMock.DEFAULT_TYPE = type(None)
|
||||
mod.ContextTypes = ContextTypesMock
|
||||
|
||||
mod.CommandHandler = MagicMock
|
||||
mod.CallbackQueryHandler = MagicMock
|
||||
mod.MessageHandler = MagicMock
|
||||
mod.filters = MagicMock()
|
||||
return mod
|
||||
|
||||
|
||||
# Insert before any import of bot.py
|
||||
if "telegram" not in sys.modules:
|
||||
sys.modules["telegram"] = _make_telegram_stub()
|
||||
if "telegram.ext" not in sys.modules:
|
||||
sys.modules["telegram.ext"] = _make_telegram_ext_stub()
|
||||
|
|
@ -1,116 +0,0 @@
|
|||
"""Tests for _format_pending_action — no Telegram connection required.
|
||||
|
||||
telegram stubs are set up in conftest.py before this module is imported.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
from bot import _format_pending_action
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Bug 1 — risk_level field
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_risk_level_shown_when_present():
|
||||
data = {
|
||||
"type": "container_restart", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "low",
|
||||
}
|
||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
||||
assert "Risk: *low*" in msg
|
||||
assert "unknown" not in msg
|
||||
|
||||
|
||||
def test_risk_falls_back_to_legacy_risk_key():
|
||||
data = {
|
||||
"type": "redeploy", "service": "mosquitto",
|
||||
"node": "chelsty-infra", "risk": "guarded",
|
||||
}
|
||||
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
|
||||
assert "Risk: *guarded*" in msg
|
||||
|
||||
|
||||
def test_risk_unknown_when_both_absent():
|
||||
data = {"type": "redeploy", "service": "foo", "node": "bar"}
|
||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
||||
assert "Risk: *unknown*" in msg
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Bug 2 — description field
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_description_shown_for_alert_only():
|
||||
data = {
|
||||
"type": "alert_only", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "info",
|
||||
"description": "3 entities unavailable for >1h",
|
||||
}
|
||||
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
|
||||
assert "3 entities unavailable for >1h" in msg
|
||||
assert "Description:" in msg
|
||||
|
||||
|
||||
def test_description_shown_for_container_restart():
|
||||
data = {
|
||||
"type": "container_restart", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "low",
|
||||
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
|
||||
}
|
||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
||||
assert "HA WebSocket unresponsive" in msg
|
||||
|
||||
|
||||
def test_description_absent_no_crash():
|
||||
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
|
||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
||||
assert "Description:" not in msg
|
||||
assert "Risk: *guarded*" in msg
|
||||
|
||||
|
||||
def test_description_truncated_at_300_chars():
|
||||
long_desc = "x" * 400
|
||||
data = {
|
||||
"type": "alert_only", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "info",
|
||||
"description": long_desc,
|
||||
}
|
||||
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
|
||||
assert "x" * 300 in msg
|
||||
assert "..." in msg
|
||||
assert "x" * 301 not in msg
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Combined — real HA alert_only action shape
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_ha_alert_only_full_action():
|
||||
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
|
||||
data = {
|
||||
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
|
||||
"type": "alert_only",
|
||||
"node": "chelsty-ha",
|
||||
"service": "homeassistant",
|
||||
"risk_level": "info",
|
||||
"confidence": 1.0,
|
||||
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
|
||||
"status": "pending",
|
||||
"payload": {
|
||||
"location_tag": "chelsty",
|
||||
"reason": "ha_entity_unavailable_long",
|
||||
"count": 3,
|
||||
},
|
||||
}
|
||||
msg = _format_pending_action(data["action_id"], data)
|
||||
assert "alert_only" in msg
|
||||
assert "chelsty-ha" in msg
|
||||
assert "Risk: *info*" in msg
|
||||
assert "3 entities unavailable" in msg
|
||||
assert "unknown" not in msg
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
COPY web.py index.html ./
|
||||
|
||||
EXPOSE 8080
|
||||
CMD ["python", "web.py"]
|
||||
|
|
@ -1,769 +0,0 @@
|
|||
<!doctype html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>Operator Control Plane</title>
|
||||
<style>
|
||||
:root {
|
||||
--bg-color: #0a0c0e;
|
||||
--sidebar-color: #14171a;
|
||||
--card-color: #1c2024;
|
||||
--border-color: #2a3540;
|
||||
--text-color: #e7edf3;
|
||||
--text-muted: #94a3b8;
|
||||
--accent-color: #3eaf7c;
|
||||
--nominal: #3eaf7c;
|
||||
--degraded: #e7c000;
|
||||
--unstable: #e67e22;
|
||||
--reconciling: #3498db;
|
||||
--error: #c0392b;
|
||||
--safe: #3eaf7c;
|
||||
--guarded: #e67e22;
|
||||
--dangerous: #c0392b;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
||||
background: var(--bg-color);
|
||||
color: var(--text-color);
|
||||
display: flex;
|
||||
height: 100vh;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
/* Sidebar */
|
||||
.sidebar {
|
||||
width: 240px;
|
||||
background: var(--sidebar-color);
|
||||
border-right: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.sidebar-header {
|
||||
padding: 24px;
|
||||
font-weight: 800;
|
||||
font-size: 14px;
|
||||
letter-spacing: 0.1em;
|
||||
color: var(--accent-color);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.nav-list {
|
||||
list-style: none;
|
||||
padding: 12px 0;
|
||||
margin: 0;
|
||||
flex-grow: 1;
|
||||
}
|
||||
|
||||
.nav-item {
|
||||
padding: 12px 24px;
|
||||
cursor: pointer;
|
||||
font-size: 14px;
|
||||
color: var(--text-muted);
|
||||
transition: all 0.2s;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.nav-item:hover {
|
||||
background: rgba(255, 255, 255, 0.05);
|
||||
color: var(--text-color);
|
||||
}
|
||||
|
||||
.nav-item.active {
|
||||
background: rgba(62, 175, 124, 0.1);
|
||||
color: var(--accent-color);
|
||||
border-left: 3px solid var(--accent-color);
|
||||
}
|
||||
|
||||
.sidebar-footer {
|
||||
padding: 16px;
|
||||
border-top: 1px solid var(--border-color);
|
||||
font-size: 12px;
|
||||
}
|
||||
|
||||
/* Content Area */
|
||||
.main-content {
|
||||
flex-grow: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
header {
|
||||
height: 64px;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
padding: 0 24px;
|
||||
justify-content: space-between;
|
||||
background: var(--bg-color);
|
||||
}
|
||||
|
||||
.view-title {
|
||||
font-size: 18px;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.content-scroll {
|
||||
flex-grow: 1;
|
||||
overflow-y: auto;
|
||||
padding: 24px;
|
||||
}
|
||||
|
||||
/* Cards & Grids */
|
||||
.grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
||||
gap: 20px;
|
||||
}
|
||||
|
||||
.card {
|
||||
background: var(--card-color);
|
||||
border: 1px solid var(--border-color);
|
||||
padding: 20px;
|
||||
border-radius: 4px;
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.card-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
|
||||
.card-title {
|
||||
font-weight: 700;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
/* Status Badges */
|
||||
.badge {
|
||||
padding: 4px 8px;
|
||||
border-radius: 4px;
|
||||
font-size: 11px;
|
||||
font-weight: 700;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
||||
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
||||
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
||||
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
||||
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
||||
|
||||
/* Timeline */
|
||||
.timeline {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.event {
|
||||
padding: 12px;
|
||||
border-left: 2px solid var(--border-color);
|
||||
background: rgba(255, 255, 255, 0.02);
|
||||
font-family: ui-monospace, monospace;
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.event.high { border-left-color: var(--error); }
|
||||
.event.medium { border-left-color: var(--unstable); }
|
||||
.event.low { border-left-color: var(--nominal); }
|
||||
|
||||
.event-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
margin-bottom: 4px;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* Forms & Inputs */
|
||||
.controls {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
margin-top: 20px;
|
||||
}
|
||||
|
||||
input, button {
|
||||
background: var(--card-color);
|
||||
border: 1px solid var(--border-color);
|
||||
color: var(--text-color);
|
||||
padding: 8px 16px;
|
||||
font-size: 14px;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
button {
|
||||
cursor: pointer;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
button:hover { background: var(--border-color); }
|
||||
|
||||
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
||||
.btn-primary:hover { background: #359b6d; }
|
||||
|
||||
/* Utility */
|
||||
.hidden { display: none !important; }
|
||||
.mono { font-family: ui-monospace, monospace; }
|
||||
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
||||
.value { font-weight: 500; margin-bottom: 12px; }
|
||||
|
||||
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
||||
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
||||
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
||||
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<aside class="sidebar">
|
||||
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
||||
<ul class="nav-list">
|
||||
<li class="nav-item active" onclick="showView('dashboard', this)">
|
||||
<span>Dashboard</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('actions', this)">
|
||||
<span>Action Queue</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('nodes', this)">
|
||||
<span>Nodes</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('services', this)">
|
||||
<span>Services</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('deployments', this)">
|
||||
<span>Deployments</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('topology', this)">
|
||||
<span>Topology</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('events', this)">
|
||||
<span>Events</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('correlation', this)">
|
||||
<span>Correlation</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('recommendations', this)">
|
||||
<span>Recommendations</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('settings', this)">
|
||||
<span>Settings</span>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="sidebar-footer">
|
||||
<div id="summary-status">System Status: Loading...</div>
|
||||
</div>
|
||||
</aside>
|
||||
|
||||
<main class="main-content">
|
||||
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
||||
RUNTIME STATE IS STALE
|
||||
</div>
|
||||
<header>
|
||||
<div style="display:flex; align-items:center; gap:20px">
|
||||
<div class="view-title" id="current-view-title">Dashboard</div>
|
||||
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
||||
<option value="observe">OBSERVE</option>
|
||||
<option value="recommend">RECOMMEND</option>
|
||||
<option value="approval" selected>APPROVAL</option>
|
||||
<option value="autonomous">AUTONOMOUS</option>
|
||||
<option value="maintenance">MAINTENANCE</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="header-actions" style="display:flex; gap:8px; align-items:center">
|
||||
<button onclick="refreshData()">Refresh</button>
|
||||
<button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<div class="content-scroll">
|
||||
<!-- Dashboard View -->
|
||||
<div id="view-dashboard" class="view">
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<div class="card-title">System Overview</div>
|
||||
<div id="dashboard-summary" style="margin-top:20px"></div>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="card-title">Pending Actions</div>
|
||||
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="card-title">Active Incidents</div>
|
||||
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Actions View -->
|
||||
<div id="view-actions" class="view hidden">
|
||||
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
||||
<div>
|
||||
<h3>Pending Approval</h3>
|
||||
<div id="actions-pending" class="timeline"></div>
|
||||
</div>
|
||||
<div>
|
||||
<h3>Active / History</h3>
|
||||
<div id="actions-history" class="timeline"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Nodes View -->
|
||||
<div id="view-nodes" class="view hidden">
|
||||
<div class="grid" id="nodes-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Services View -->
|
||||
<div id="view-services" class="view hidden">
|
||||
<div class="grid" id="services-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Deployments View -->
|
||||
<div id="view-deployments" class="view hidden">
|
||||
<div class="grid" id="deployments-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Topology View -->
|
||||
<div id="view-topology" class="view hidden">
|
||||
<div class="card" style="min-height:500px">
|
||||
<div class="card-title">Runtime Topology</div>
|
||||
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Events View -->
|
||||
<div id="view-events" class="view hidden">
|
||||
<div class="timeline" id="events-timeline"></div>
|
||||
</div>
|
||||
|
||||
<!-- Correlation View -->
|
||||
<div id="view-correlation" class="view hidden">
|
||||
<div id="correlation-chains" class="grid"></div>
|
||||
</div>
|
||||
|
||||
<!-- Recommendations View -->
|
||||
<div id="view-recommendations" class="view hidden">
|
||||
<div class="grid" id="recommendations-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Settings View -->
|
||||
<div id="view-settings" class="view hidden">
|
||||
<div class="card">
|
||||
<div class="card-title">Configuration</div>
|
||||
<div id="settings-content" style="margin-top:20px"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</main>
|
||||
|
||||
<script>
|
||||
let currentView = 'dashboard';
|
||||
const pollInterval = 5000;
|
||||
|
||||
function showView(viewId, el) {
|
||||
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
||||
document.getElementById('view-' + viewId).classList.remove('hidden');
|
||||
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
||||
if (el) el.classList.add('active');
|
||||
currentView = viewId;
|
||||
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
||||
refreshData();
|
||||
}
|
||||
|
||||
async function fetchData(endpoint) {
|
||||
try {
|
||||
const res = await fetch(endpoint, {cache: 'no-store'});
|
||||
return await res.json();
|
||||
} catch (e) {
|
||||
console.error('Fetch error:', endpoint, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function postData(endpoint, data) {
|
||||
try {
|
||||
const res = await fetch(endpoint, {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
return await res.json();
|
||||
} catch (e) {
|
||||
console.error('Post error:', endpoint, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function mutateAction(id, status) {
|
||||
const res = await postData('/action/mutate', {id, status});
|
||||
if (res && res.status === 'ok') {
|
||||
refreshData();
|
||||
} else {
|
||||
alert('Mutation failed');
|
||||
}
|
||||
}
|
||||
|
||||
async function setOperatorMode(mode) {
|
||||
console.log('Operator mode set to:', mode);
|
||||
const res = await postData('/mode', {mode});
|
||||
if (res && res.status === 'ok') {
|
||||
console.log('Mode updated successfully');
|
||||
}
|
||||
}
|
||||
|
||||
function formatTime(ts) {
|
||||
if (!ts) return 'N/A';
|
||||
return new Date(ts * 1000).toLocaleString();
|
||||
}
|
||||
|
||||
function getStatusClass(status) {
|
||||
status = (status || '').toLowerCase();
|
||||
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
||||
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
||||
if (['unstable'].includes(status)) return 'status-unstable';
|
||||
if (['reconciling'].includes(status)) return 'status-reconciling';
|
||||
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
||||
return '';
|
||||
}
|
||||
|
||||
async function refreshData() {
|
||||
// Refresh summary always
|
||||
const summary = await fetchData('/summary');
|
||||
if (summary) {
|
||||
const statusEl = document.getElementById('summary-status');
|
||||
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
||||
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
||||
|
||||
// Handle stale state
|
||||
const staleBanner = document.getElementById('stale-banner');
|
||||
if (summary.stale) {
|
||||
staleBanner.classList.remove('hidden');
|
||||
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
||||
} else {
|
||||
staleBanner.classList.add('hidden');
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard') {
|
||||
const dashSummary = document.getElementById('dashboard-summary');
|
||||
dashSummary.innerHTML = `
|
||||
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
||||
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
||||
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard' || currentView === 'actions') {
|
||||
const actions = await fetchData('/actions');
|
||||
if (actions) {
|
||||
if (currentView === 'dashboard') {
|
||||
const dashActions = document.getElementById('dashboard-actions-summary');
|
||||
const pendingCount = actions.pending.length;
|
||||
dashActions.innerHTML = `
|
||||
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
||||
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
||||
`;
|
||||
}
|
||||
if (currentView === 'actions') {
|
||||
const pendingEl = document.getElementById('actions-pending');
|
||||
const historyEl = document.getElementById('actions-history');
|
||||
|
||||
pendingEl.innerHTML = actions.pending.map(a => `
|
||||
<div class="card" style="margin-bottom:12px">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
||||
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
||||
</div>
|
||||
<p>${a.description || a.action_type || 'No description'}</p>
|
||||
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
||||
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
||||
<div class="controls">
|
||||
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
||||
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('') || 'No pending actions.';
|
||||
|
||||
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
||||
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
||||
<div class="event">
|
||||
<div class="event-header">
|
||||
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
||||
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
||||
</div>
|
||||
<div>${a.description || a.action_type || 'No description'}</div>
|
||||
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
||||
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
||||
${a.transition_history ? `
|
||||
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
||||
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
||||
</div>
|
||||
` : ''}
|
||||
</div>
|
||||
`).join('') || 'No history.';
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard' || currentView === 'events') {
|
||||
const incidents = await fetchData('/incidents');
|
||||
if (currentView === 'dashboard') {
|
||||
const dashIncidents = document.getElementById('dashboard-incidents');
|
||||
if (!incidents || incidents.length === 0) {
|
||||
dashIncidents.textContent = 'No active incidents.';
|
||||
} else {
|
||||
dashIncidents.innerHTML = incidents.map(inc => `
|
||||
<div class="event ${inc.severity}">
|
||||
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
||||
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'nodes') {
|
||||
const nodes = await fetchData('/nodes');
|
||||
const list = document.getElementById('nodes-list');
|
||||
list.innerHTML = nodes.map(node => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${node.hostname}</div>
|
||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||
</div>
|
||||
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
||||
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
||||
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
||||
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
||||
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
||||
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'services') {
|
||||
const services = await fetchData('/services');
|
||||
const list = document.getElementById('services-list');
|
||||
list.innerHTML = services.map(svc => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${svc.name}</div>
|
||||
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
||||
</div>
|
||||
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
||||
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
||||
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
||||
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'deployments') {
|
||||
const deps = await fetchData('/deployments');
|
||||
const list = document.getElementById('deployments-list');
|
||||
list.innerHTML = deps.map(dep => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${dep.service}</div>
|
||||
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
||||
</div>
|
||||
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
||||
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
||||
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
||||
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
||||
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'events') {
|
||||
const events = await fetchData('/events');
|
||||
const timeline = document.getElementById('events-timeline');
|
||||
timeline.innerHTML = events.map(ev => `
|
||||
<div class="event ${ev.severity}">
|
||||
<div class="event-header">
|
||||
<span>${ev.type.toUpperCase()}</span>
|
||||
<span>${formatTime(ev.timestamp)}</span>
|
||||
</div>
|
||||
<div>${ev.message}</div>
|
||||
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'recommendations') {
|
||||
const recs = await fetchData('/recommendations');
|
||||
const list = document.getElementById('recommendations-list');
|
||||
list.innerHTML = recs.map(rec => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${rec.title}</div>
|
||||
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
||||
</div>
|
||||
<p>${rec.description}</p>
|
||||
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
||||
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
||||
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
||||
<div class="controls">
|
||||
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'topology') {
|
||||
const nodes = await fetchData('/nodes');
|
||||
const services = await fetchData('/services');
|
||||
const topMap = document.getElementById('topology-map');
|
||||
if (nodes && services) {
|
||||
topMap.innerHTML = nodes.map(node => {
|
||||
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
||||
return `
|
||||
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${node.hostname}</div>
|
||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||
</div>
|
||||
<div class="label">Capabilities</div>
|
||||
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
||||
<div class="label">Services</div>
|
||||
<div style="font-size:12px; margin-bottom:10px">
|
||||
${nodeServices.length > 0 ? nodeServices.map(s => `
|
||||
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
||||
<span>${s.name}</span>
|
||||
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
||||
</div>
|
||||
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
||||
`).join('') : '<div class="value">None</div>'}
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'correlation') {
|
||||
const incidents = await fetchData('/incidents');
|
||||
const actions = await fetchData('/actions');
|
||||
const list = document.getElementById('correlation-chains');
|
||||
if (incidents && actions) {
|
||||
const allActions = Object.values(actions).flat();
|
||||
list.innerHTML = incidents.map(inc => {
|
||||
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
||||
return `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
||||
<span class="badge status-error">Active</span>
|
||||
</div>
|
||||
<p>${inc.message}</p>
|
||||
<div class="label">Related Actions</div>
|
||||
${related.map(a => `
|
||||
<div class="event" style="margin-top:5px">
|
||||
<strong>${a.type}</strong> (${a.status})<br>
|
||||
<small>${a.description}</small>
|
||||
</div>
|
||||
`).join('') || '<div class="value">No actions yet</div>'}
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
}
|
||||
if (currentView === 'settings') {
|
||||
const config = await fetchData('/config');
|
||||
const content = document.getElementById('settings-content');
|
||||
content.innerHTML = `
|
||||
<div class="label">Auto Mode</div>
|
||||
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
||||
<div class="label">Action Thresholds</div>
|
||||
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
||||
<div class="label">Telegram Integration</div>
|
||||
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
||||
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
async function copyForAI() {
|
||||
const btn = document.getElementById('copy-ai-btn');
|
||||
const original = btn.textContent;
|
||||
btn.textContent = 'Copying...';
|
||||
btn.disabled = true;
|
||||
|
||||
try {
|
||||
const snap = await fetchData('/snapshot');
|
||||
if (!snap) throw new Error('snapshot fetch failed');
|
||||
|
||||
const now = new Date(snap.timestamp);
|
||||
const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
|
||||
const lines = [];
|
||||
|
||||
lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
|
||||
|
||||
if (snap.nodes && snap.nodes.length > 0) {
|
||||
lines.push('NODES: ' + snap.nodes.map(n =>
|
||||
`${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
|
||||
).join(', '));
|
||||
} else {
|
||||
lines.push('NODES: none');
|
||||
}
|
||||
|
||||
if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
|
||||
lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
|
||||
`${s.name} (${s.node}) - ${s.health}`
|
||||
).join(', '));
|
||||
} else {
|
||||
lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
|
||||
}
|
||||
|
||||
const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
|
||||
if (activeIncidents.length > 0) {
|
||||
lines.push('INCIDENTS: ' + activeIncidents.map(i =>
|
||||
`[${i.severity}] ${i.message} (${i.node})`
|
||||
).join('; '));
|
||||
} else {
|
||||
lines.push('INCIDENTS: none');
|
||||
}
|
||||
|
||||
if (snap.events && snap.events.length > 0) {
|
||||
lines.push(`EVENTS (last ${snap.events.length}):`);
|
||||
snap.events.forEach(ev => {
|
||||
const ts = ev.timestamp
|
||||
? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
|
||||
: '?';
|
||||
const svc = ev.service ? '/' + ev.service : '';
|
||||
lines.push(` ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
|
||||
});
|
||||
} else {
|
||||
lines.push('EVENTS (last 10): none');
|
||||
}
|
||||
|
||||
const s = snap.summary || {};
|
||||
lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
|
||||
|
||||
await navigator.clipboard.writeText(lines.join('\n'));
|
||||
btn.textContent = 'Copied!';
|
||||
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
||||
} catch (e) {
|
||||
console.error('copyForAI error:', e);
|
||||
btn.textContent = 'Error';
|
||||
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
||||
}
|
||||
}
|
||||
|
||||
// Initial load
|
||||
refreshData();
|
||||
// Poll for updates
|
||||
setInterval(refreshData, pollInterval);
|
||||
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
|
|
@ -1,301 +0,0 @@
|
|||
import json
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
||||
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
||||
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
||||
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
||||
|
||||
STATIC_DIR = Path(__file__).parent
|
||||
|
||||
DEFAULT_CONFIG = {
|
||||
"operator_mode": "approval",
|
||||
"auto_mode": True,
|
||||
"action_thresholds": {
|
||||
"restart_ha": 0.8,
|
||||
"check_network": 0.9,
|
||||
},
|
||||
"default_threshold": 0.9,
|
||||
"allowed_auto_actions": ["restart_ha"],
|
||||
}
|
||||
|
||||
|
||||
def read_json_file(path, default=None):
|
||||
if not path.exists():
|
||||
return default if default is not None else []
|
||||
try:
|
||||
return json.loads(path.read_text())
|
||||
except Exception:
|
||||
return default if default is not None else []
|
||||
|
||||
|
||||
def get_config():
|
||||
config_path = STATE_DIR / "operator-config.json"
|
||||
if config_path.exists():
|
||||
return read_json_file(config_path, DEFAULT_CONFIG)
|
||||
return DEFAULT_CONFIG
|
||||
|
||||
|
||||
def save_config(config):
|
||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
||||
|
||||
|
||||
def current_nodes():
|
||||
return read_json_file(WORLD_DIR / "nodes.json")
|
||||
|
||||
|
||||
def current_services():
|
||||
return read_json_file(WORLD_DIR / "services.json")
|
||||
|
||||
|
||||
def current_deployments():
|
||||
return read_json_file(WORLD_DIR / "deployments.json")
|
||||
|
||||
|
||||
def current_incidents():
|
||||
return read_json_file(WORLD_DIR / "incidents.json")
|
||||
|
||||
|
||||
def current_recommendations():
|
||||
return read_json_file(WORLD_DIR / "recommendations.json")
|
||||
|
||||
|
||||
def current_summary():
|
||||
path = WORLD_DIR / "runtime-summary.json"
|
||||
summary = read_json_file(path, default={})
|
||||
if summary:
|
||||
last_update_val = summary.get("last_update")
|
||||
if last_update_val:
|
||||
try:
|
||||
if isinstance(last_update_val, str):
|
||||
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
|
||||
else:
|
||||
last_update = float(last_update_val)
|
||||
except Exception:
|
||||
last_update = os.path.getmtime(path)
|
||||
else:
|
||||
last_update = os.path.getmtime(path)
|
||||
summary["last_update"] = last_update
|
||||
summary["stale"] = (time.time() - last_update) > 60
|
||||
return summary
|
||||
|
||||
|
||||
def current_events():
|
||||
return read_json_file(WORLD_DIR / "events.json", default=[])
|
||||
|
||||
|
||||
def current_actions():
|
||||
actions = {}
|
||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||
for status in statuses:
|
||||
actions[status] = []
|
||||
status_dir = ACTIONS_DIR / status
|
||||
if status_dir.exists():
|
||||
for f in status_dir.glob("*.json"):
|
||||
data = read_json_file(f)
|
||||
if data:
|
||||
# Injects some metadata for UI
|
||||
data["id"] = data.get("action_id") or f.stem
|
||||
data["status"] = status
|
||||
actions[status].append(data)
|
||||
return actions
|
||||
|
||||
|
||||
def mutate_action(action_id, target_status):
|
||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||
if target_status not in statuses:
|
||||
return False, f"Invalid target status: {target_status}"
|
||||
|
||||
# Find where the action is
|
||||
source_path = None
|
||||
current_status = None
|
||||
for status in statuses:
|
||||
p = ACTIONS_DIR / status / f"{action_id}.json"
|
||||
if p.exists():
|
||||
source_path = p
|
||||
current_status = status
|
||||
break
|
||||
|
||||
if not source_path:
|
||||
return False, f"Action {action_id} not found"
|
||||
|
||||
target_dir = ACTIONS_DIR / target_status
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
target_path = target_dir / f"{action_id}.json"
|
||||
|
||||
try:
|
||||
data = json.loads(source_path.read_text())
|
||||
data["status"] = target_status
|
||||
data["updated_at"] = time.time()
|
||||
|
||||
# Keep history of transitions
|
||||
history = data.get("transition_history", [])
|
||||
history.append({
|
||||
"from": current_status,
|
||||
"to": target_status,
|
||||
"timestamp": time.time()
|
||||
})
|
||||
data["transition_history"] = history
|
||||
|
||||
target_path.write_text(json.dumps(data, indent=2))
|
||||
if source_path != target_path:
|
||||
source_path.unlink()
|
||||
return True, "Success"
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def get_snapshot():
|
||||
nodes = current_nodes()
|
||||
services = current_services()
|
||||
incidents = current_incidents()
|
||||
events = current_events()
|
||||
summary = current_summary()
|
||||
|
||||
non_nominal = [s for s in services if s.get("health") != "nominal"]
|
||||
nominal_count = len(services) - len(non_nominal)
|
||||
|
||||
return {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"summary": summary,
|
||||
"nodes": nodes,
|
||||
"non_nominal_services": non_nominal,
|
||||
"nominal_service_count": nominal_count,
|
||||
"total_service_count": len(services),
|
||||
"incidents": incidents,
|
||||
"events": events[:10],
|
||||
}
|
||||
|
||||
|
||||
def send_json(status, payload, handler):
|
||||
body = (json.dumps(payload) + "\n").encode("utf-8")
|
||||
handler.send_response(status)
|
||||
handler.send_header("Content-Type", "application/json")
|
||||
handler.send_header("Content-Length", str(len(body)))
|
||||
handler.end_headers()
|
||||
handler.wfile.write(body)
|
||||
|
||||
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
def do_GET(self):
|
||||
if self.path == "/config":
|
||||
send_json(200, get_config(), self)
|
||||
return
|
||||
|
||||
if self.path == "/nodes":
|
||||
send_json(200, current_nodes(), self)
|
||||
return
|
||||
|
||||
if self.path == "/services":
|
||||
send_json(200, current_services(), self)
|
||||
return
|
||||
|
||||
if self.path == "/deployments":
|
||||
send_json(200, current_deployments(), self)
|
||||
return
|
||||
|
||||
if self.path == "/incidents":
|
||||
send_json(200, current_incidents(), self)
|
||||
return
|
||||
|
||||
if self.path == "/recommendations":
|
||||
send_json(200, current_recommendations(), self)
|
||||
return
|
||||
|
||||
if self.path == "/summary":
|
||||
send_json(200, current_summary(), self)
|
||||
return
|
||||
|
||||
if self.path == "/events":
|
||||
send_json(200, current_events(), self)
|
||||
return
|
||||
|
||||
if self.path == "/actions":
|
||||
send_json(200, current_actions(), self)
|
||||
return
|
||||
|
||||
if self.path == "/snapshot":
|
||||
send_json(200, get_snapshot(), self)
|
||||
return
|
||||
|
||||
if self.path in ("/", "/index.html"):
|
||||
body = (STATIC_DIR / "index.html").read_bytes()
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||
self.send_header("Content-Length", str(len(body)))
|
||||
self.end_headers()
|
||||
self.wfile.write(body)
|
||||
return
|
||||
|
||||
self.send_error(404)
|
||||
|
||||
def do_POST(self):
|
||||
if self.path not in (
|
||||
"/config",
|
||||
"/action/mutate",
|
||||
"/mode",
|
||||
):
|
||||
self.send_error(404)
|
||||
return
|
||||
|
||||
length = int(self.headers.get("Content-Length", "0"))
|
||||
raw_body = self.rfile.read(length).decode("utf-8")
|
||||
try:
|
||||
payload = json.loads(raw_body)
|
||||
except json.JSONDecodeError:
|
||||
self.send_error(400, "Invalid JSON")
|
||||
return
|
||||
|
||||
if self.path == "/config":
|
||||
config = get_config()
|
||||
config.update(payload)
|
||||
save_config(config)
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
return
|
||||
|
||||
if self.path == "/mode":
|
||||
mode = payload.get("mode")
|
||||
if not mode:
|
||||
self.send_error(400, "mode is required")
|
||||
return
|
||||
config = get_config()
|
||||
config["operator_mode"] = mode
|
||||
save_config(config)
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
return
|
||||
|
||||
if self.path == "/action/mutate":
|
||||
action_id = payload.get("id")
|
||||
target = payload.get("status")
|
||||
if not action_id or not target:
|
||||
self.send_error(400, "id and status are required")
|
||||
return
|
||||
success, msg = mutate_action(action_id, target)
|
||||
if success:
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
else:
|
||||
self.send_error(500, msg)
|
||||
return
|
||||
|
||||
def log_message(self, format, *args):
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Ensure directories exist
|
||||
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
port = int(os.getenv("PORT", "8080"))
|
||||
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
||||
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
|
||||
server.serve_forever()
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY src/ src/
|
||||
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
ENV PYTHONPATH=/app/src
|
||||
|
||||
CMD ["python", "-m", "brain_watchdog.main"]
|
||||
|
|
@ -1,30 +0,0 @@
|
|||
services:
|
||||
brain-watchdog:
|
||||
build: .
|
||||
container_name: brain-watchdog
|
||||
restart: unless-stopped
|
||||
|
||||
env_file:
|
||||
- /opt/homelab/config/brain-watchdog/.env
|
||||
|
||||
volumes:
|
||||
- brain_watchdog_data:/data
|
||||
|
||||
healthcheck:
|
||||
test:
|
||||
- "CMD"
|
||||
- "python"
|
||||
- "-c"
|
||||
- |
|
||||
import os, time, json, sys
|
||||
p = '/data/state.json'
|
||||
if not os.path.exists(p): sys.exit(1)
|
||||
age = time.time() - os.path.getmtime(p)
|
||||
sys.exit(0 if age < 300 else 1)
|
||||
interval: 1m
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
|
||||
volumes:
|
||||
brain_watchdog_data:
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue